Installation Instructions for the 3.x branch

Please note that this manual is only valid for version 3.x.

Make sure you also read the Operating System specific notes.

If you find differences or inconsistencies between this documentation and the one supplied in your tarball, please report this so we can improve our documentation :)

Dependencies

Please install all of the following dependencies:

  • SpamAssassin, version 3.1.4 or higher
  • NetPBM Tools, preferably 10.xx or higher
  • GifSicle
  • GifLib/Libungif
    • Please note that the source should be patched with the patch provided on the Downloads page, to fix a bug which causes segmentation faults with some images
  • Gocr, preferably version 0.40 (some people reported bad recognition with 0.41)
    • Please note that the source should be patched with the patch provided on the Downloads page, to fix a bug which causes segmentation faults with some images
    • When compiling gocr, make sure you enable NetPBM support, otherwise, results are not as good
    • Please read OS specific notes when installing from RPMs
  • Perl modules:
  • Optional: Ocrad OCR Engine (at least 0.14!)
    • Ocrad can be used as addition to gocr, or to replace it completely
    • In some situations Ocrad performs better, in some, gocr does, so if you have false negatives, try both :)

Installing the plugin

Installing the necessary files

  • Put the FuzzyOcr.cf and the FuzzyOcr.pm files into /etc/mail/spamassassin.
  • The FuzzyOcr.cf file already contains a line to load the plugin, if you want to put the .pm file in a different location, change this line accordingly.
  • Create a wordlist file, a sample wordlist is shipped with this release, and put it also in /etc/mail/spamassassin.

Necessary Configuration

  • Open the FuzzyOcr.cf. Make sure that you specify a writable file as a logfile, or a directory where the plugin can write to, so it can create the logfile itself.
  • Make sure that you specify a correct file as global wordlist.
  • If any of your external programs is in a non-standard location, change the configuration file accordingly to reflect the location of the binary in question.

With these changes, FuzzyOcr is ready to work, but feel free to read the meanings of the other configuration variables in the .cf file and adjust them if you want.

Enabling the Image hashing database (optional)

The Image hashing database feature allows the plugin to store a vector of image features to a database, so it "knows" this image when it arrives a second time (and therefore does not need to scan it again). The special thing about this function is that it also recognizes the image again if it was changed slightly (which is done by spammers). If you want to use this feature, follow these steps:

  • In version 3.x, there are 2 different operation modes for the database, the first one will use a MLDBM database file and stores much more informations (recommended), the second one uses a flat file like the 2.x branch did. If you were using 2.x before, then using the MLDBM mode now will import your old flat file, so you don't loose any hashes.

Recommended way:

  • Set the following options in your configuration file:
         focr_enable_image_hashing 2
         focr_db_hash <full_path_to_file>
         focr_db_safe <full_path_to_file>
         focr_db_max_days <number_of_days>
    
  • Make sure that the specified files are writable or that the directory is writable.
  • By default, all images recognized as spam, are added to this database automatically.
  • The score is saved as well and reused later again.
  • Images not recognized as spam are added to the database as well as HAM.
  • Using Utils/fuzzy-find.pl you can display infos or remove hashes from the database by specifying a hash directly or passing an image file to the utility.
  • Using Utils/fuzzy-stats.pl you can display daily statistics about the db.
  • After <number_of_days>, hashes expire and are removed. This is recommended to prevent an endless growing db.

Deprecated way:

  • Set the following options in your configuration file:
         focr_enable_image_hashing 1
         focr_digest_db <full_path_to_file>
    
  • Make sure that the specified file is writable or that the directory is writable.
  • By default, all images recognized as spam, are added to this database automatically.
  • The score is saved as well and reused later again.

Testing

Go into the samples subdirectory and invoke spamasassin with the *.eml files, for example with the animated-gif.eml:

spamassassin --debug FuzzyOcr < animated-gif.eml > /dev/null

If you're using amavisd-new, try this instead:

su -c "spamassassin --debug FuzzyOcr < animated-gif.eml > /dev/null" amavis

Tweaking scansets (advanced and optional)

Everyone gets different image spam, and most times, one method to scan is not successful with all types of spam you get. That's where the focr_scansets setting can help you. This setting takes a comma seperated list of scansets. Each scanset starts with the name of a program, followed by either other programs connected with pipes, or nothing anymore. The only important thing is that input for this "program chain" is a picture in the PNM format, and the output is ASCII text.

An example might clarify this:

  • This will do a single scan with gocr default settings:

focr_scanset $gocr -i $pfile

  • This will do 2 scans, one with the default settings, and the second one with a modified -l value:

focr_scanset $gocr -i $pfile, $gocr -l 180 -i $pfile

  • This will do 2 scans, one with gocr default settings, and the second one with ocrad tweaked settings:

focr_scanset $gocr -i $pfile, $ocrad -s5 -T 0.5 $pfile

  • A complicated example which shows how to use piping and redirecting in scansets:

focr_scansets pnmnorm $pfile 2>$efile | pnmquant 3 2>>$efile | pnmnorm 2>>$efile | $gocr -l 180 -d 2 -i -

You are now free to select which scansets get you the most spam, but don't pick too many, as this will also use more resources.

Here are some hints:

  • If your resources allow it, use both ocrad and gocr, it will catch more spam most likely
  • If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr
  • The -l setting often helps for gocr, try values like 180, 140, or 100
  • Ocrad performs best with -s5 and -T 0.5 settings most times
  • pnminvert or pnmquant are useful with white text or text with many colors

Two syntax remarks: -Instead of writing "gocr" and "ocrad", write "$gocr" and "$ocrad" (as shown in the examples) as this will be replaced with the correct path to your binaries as specified in the config file.

  • You can redirect the stderr output by using pnminvert 2>>$efile
  • $pfile is always replaced by the input image file in PNM format

If the scanset fails then, and debug logging is enabled, you will see this stderr output in the logfile :)

I know this seems confusing for some, but if this is unclear somehow, feel free to write an email to the list.