Installation Instructions for the 2.x branch

Please note that this manual is only valid for version 2.x.

Make sure you also read the Operating System specific notes.

If you find differences or inconsistencies between this documentation and the one supplied in your tarball, please report this so we can improve our documentation :)

Dependencies

Please install all of the following dependencies:

  • SpamAssassin, version 3.1.4 or higher
  • NetPBM Tools, preferably 10.xx or higher
  • ImageMagick
  • GifLib/Libungif
    • Please note that the source should be patched with the patch provided on the Downloads page, to fix a bug which causes segmentation faults with some images
  • Gocr, preferably version 0.40 (some people reported bad recognition with 0.41)
    • Please note that the source should be patched with the patch provided on the Downloads page, to fix a bug which causes segmentation faults with some images
    • When compiling gocr, make sure you enable NetPBM support, otherwise, results are not as good
    • Please read OS specific notes when installing from RPMs
  • Perl modules:

Installing the plugin

Installing the necessary files

  • Put the FuzzyOcr.cf and the FuzzyOcr.pm files into /etc/mail/spamassassin.
  • The FuzzyOcr.cf file already contains a line to load the plugin, if you want to put the .pm file in a different location, change this line accordingly.
  • Create a wordlist file, a sample wordlist is shipped with this release, and put it also in /etc/mail/spamassassin.

Necessary Configuration

  • Open the FuzzyOcr.cf. Make sure that you specify a writable file as a logfile, or a directory where the plugin can write to, so it can create the logfile itself.
  • Make sure that you specify a correct file as global wordlist.
  • If any of your external programs is in a non-standard location, change the configuration file accordingly to reflect the location of the binary in question.

With these changes, FuzzyOcr is ready to work, but feel free to read the meanings of the other configuration variables in the .cf file and adjust them if you want.

Enabling the Image hashing database (optional)

The Image hashing database feature allows the plugin to store a vector of image features to a database, so it "knows" this image when it arrives a second time (and therefore does not need to scan it again). The special thing about this function is that it also recognizes the image again if it was changed slightly (which is done by spammers). If you want to use this feature, follow these steps:

  • Set focr_enable_image_hashing to 1 in the config file, and make sure that focr_digest_db points to a writable file/directory.
  • You can also create this file yourself if you like. By default, all images recognized as spam, are added

to this database automatically.

  • The score is saved as well and reused later again.

Tweaking scansets (advanced and optional)

Everyone gets different image spam, and most times, one method to scan is not successful with all types of spam you get. That's where the focr_scansets setting can help you. This setting takes a comma seperated list of scansets. Each scanset starts with the name of a program, followed by either other programs connected with pipes, or nothing anymore. The only important thing is that input for this "program chain" is a picture in the PNM format, and the output is ASCII text.

An example might clarify this:

  • This will do a single scan with gocr default settings:

focr_scanset gocr -i -

  • This will use pnminvert on the image and then do the scan:

focr_scanset pnminvert | gocr -i -

  • This will do 2 scans, one with the default settings, and the second one with a modified -l value:

focr_scanset gocr -i -, gocr -l 180 -i -

You are now free to select which scansets get you the most spam, but don't pick too many, as this will also use more resources.

Here are some hints:

  • pnminvert or pnmquant are useful with white text or text with many colors
  • If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr
  • The -l setting often helps, try values like 180, 140, or 100
  • You can also use different OCR engines (e.g. Ocrad) instead of Gocr

Two syntax remarks: -Instead of writing "gocr", write "$gocr" as this will be replaced with the correct path to your gocr binary. (This does not work with $ocrad in 2.x yet, use the full path to the binary instead)

  • If you invoke custom binaries (like pnminvert for example), you can redirect the stderr output by using:

pnminvert 2>>$errfile

If the scanset fails then, and debug logging is enabled, you will see this stderr output in the logfile :)

I know this seems confusing for some, but if this is unclear somehow, feel free to write an email to the list.