What is FuzzyOcr?

Brief Description

 FuzzyOcr is a plugin for  SpamAssassin which is aimed at unsolicited bulk mail (also known as "Spam") containing images as the main content carrier. Using different methods, it analyzes the content and properties of images to distinguish between normal mails (Ham) and spam mails. The methods mainly are:

  • Optical Character Recognition using different engines and settings
  • Fuzzy word matching algorithm applied to OCR results
  • Image hashing system to learn unique properties of known spam images
  • Dimension, size and integrity checking of images
  • Content-Type verification for the containing email

For a brief description of features, resource aspects and scalability, see the detailed list below (might be incomplete):

  • Matching and learning techniques
    • Flexible Optical Character Recognition interface
    • Fuzzy word matching algorithm applied to OCR results
    • Recognition of duplicate (already processed) or similar images using feature vectors (Hashing)
      • Efficient MLDBM database
      • Mysql Support
    • Dimension, size and integrity checking
    • Content-Type checking of containing email
    • Generic preprocessor interface
    • Efficient and fast GIF deanimation algorithm
  • Resource saving techniques
    • Only scan mails which where not recognized yet as Ham or Spam by other SpamAssassin rules or plugins (using score thresholds)
    • Optional skip of other scanning facilities once one scores already with a given threshold
    • Mail skipping based on direct feature analysis (Dimensions and file size)
    • Automatic optimization of scanner order based on previous results

  • Safety measures
    • Configurable timeout against Denial of Service attacks against the third party tools
    • Context based word sets instead of simple lists to prevent false positives (planned for 3.6)

  • For information about the latest branch, see the  ChangeLog

Screenshots

All screenshots show SA running in the command line, displaying various points where FuzzyOcr strikes.

An animated gif is scanned for the first time

source:tags/screenshots/normal_result.png

The same gif is scanned a second time

source:tags/screenshots/known_hash.png

Various tricks of Image Spammers

source:tags/screenshots/broken.png

FuzzyOcr Debug output when running SpamAssassin with -D (or enabling debug mode in the config)

source:tags/screenshots/debug.png