Installation Instructions for the 3.5.x branch

Please note that this manual is only valid for version 3.5.x.

Make sure you also read the Operating System specific notes.

If you find differences or inconsistencies between this documentation and the one supplied in your tarball, please report this so we can improve our documentation :)

Dependencies

Please install all of the following dependencies:

  •  SpamAssassin, version 3.1.4 or higher (SA 3.2 only with SVN version currently)
  •  NetPBM Tools, preferably 10.xx or higher
  •  GifSicle
  •  GifLib/Libungif
    • Please note that the source should be patched with the patch provided on the Downloads page, to fix a bug which causes segmentation faults with some images
  • At least one OCR Engine (though multiple engines at once are possible), for example:
    •  Ocrad OCR Engine (at least 0.14!)
      • In some situations Ocrad performs better, in some, gocr does, but Ocrad is nowadays considered the best choice
    •  Gocr, preferably version 0.43
      • Please note that version 0.43 does NOT require patching anymore against segmentation faults, this is only the case with 0.40
      • Versions 0.41 and 0.42 should NOT be used.
      • When compiling gocr, make sure you enable NetPBM support, otherwise, results are not as good
      • Please read OS specific notes when installing from RPMs
  • Perl modules:

Installing the plugin

Installing the necessary files

  • Put the FuzzyOcr.cf, FuzzyOcr.scansets, FuzzyOcr.preps and the FuzzyOcr.pm files, as well as the FuzzyOcr/ folder into /etc/mail/spamassassin.
  • The FuzzyOcr.cf file already contains a line to load the plugin, if you want to put the .pm file in a different location, change this line accordingly.
  • Create a wordlist file, a sample wordlist is shipped with this release, and put it also in /etc/mail/spamassassin.

Necessary Configuration

  • Open the FuzzyOcr.cf. Make sure that you specify a writable file as a logfile, or a directory where the plugin can write to, so it can create the logfile itself, if you want to log anything. The log level can be specified with the focr_verbose option.
  • Make sure that you specify a correct file as global wordlist.
  • If any of your external programs is in a non-standard location, change the configuration file accordingly to reflect the location of the binary in question.

With these changes, FuzzyOcr is ready to work, but feel free to read the meanings of the other configuration variables in the .cf file and adjust them if you want.

Enabling the Image hashing database (optional)

The Image hashing database feature allows the plugin to store a vector of image features to a database, so it "knows" this image when it arrives a second time (and therefore does not need to scan it again). The special thing about this function is that it also recognizes the image again if it was changed slightly (which is done by spammers). If you want to use this feature, follow these steps:

  • In version 3.5.x, there are 3 different operation modes for the database, the first one (option "2") will use a MLDBM database file and stores much more informations (recommended), the second one uses a flat file like the 2.x branch did. If you were using 2.x before, then using the MLDBM mode now will import your old flat file, so you don't loose any hashes. The third one is an experimental interface to a MySQL database. You need to initialize the database with the supplied mysql file.

Recommended way:

  • Set the following options in your configuration file:
         focr_enable_image_hashing 2
         focr_db_hash <full_path_to_file>
         focr_db_safe <full_path_to_file>
         focr_db_max_days <number_of_days>
    
  • Make sure that the specified files are writable or that the directory is writable.
  • By default, all images recognized as spam, are added to this database automatically.
  • The score is saved as well and reused later again.
  • Images not recognized as spam are added to the database as well as HAM.
  • Using Utils/fuzzy-find.pl you can display infos or remove hashes from the database by specifying a hash directly or passing an image file to the utility.
  • Using Utils/fuzzy-stats.pl you can display daily statistics about the db.
  • After <number_of_days>, hashes expire and are removed. This is recommended to prevent an endless growing db.

Deprecated way:

  • Set the following options in your configuration file:
         focr_enable_image_hashing 1
         focr_digest_db <full_path_to_file>
    
  • Make sure that the specified file is writable or that the directory is writable.
  • By default, all images recognized as spam, are added to this database automatically.
  • The score is saved as well and reused later again.

Experiemental way:

  • Set the following options in your configuration file:
         focr_enable_image_hashing 3
         focr_mysql_* (Adjust all values)
    
  • You need to initialize the database with the supplied mysql file
  • By default, all images recognized as spam, are added to this database automatically.
  • The score is saved as well and reused later again.
  • Images not recognized as spam are added to the database as well as HAM.
  • There is currently no utility yet which can manage this type of database

Testing

Go into the samples subdirectory and invoke spamassassin with the *.eml files, for example with the animated-gif.eml:

spamassassin --debug FuzzyOcr < animated-gif.eml > /dev/null

If you're using amavisd-new, try this instead:

su -c "spamassassin --debug FuzzyOcr < animated-gif.eml > /dev/null" amavis

Tweaking scansets (advanced and optional)

Everyone gets different image spam, and most times, one method to scan is not successful with all types of spam you get. That's where the scansets can help you. The file FuzzyOcr.scansets contains all scansets, the FuzzyOcr.preps file contains all preprocessors.

Both files (as well as the wordlist file) accept comments made with # , as single line comments or behind a real line.

Preprocessors

A preprocessor is a unit which receives input, and gives output. Many preprocessors can be chained in a scanset to modify the data the OCR program will receive.

An example:

# requires ImageMagic convert
preprocessor maketiff {
    command = convert
    args = $input tiff:$output
}

This example consists of a label (maketiff, must be unique), a command, and args.

The "command" line is required, it contains the command itself which is to be executed (without ANY arguments). It can either be a binary name (with or without path), or a macro, like $program, where "program" is a registered helper application (see FuzzyOcr.cf). Those macros are automatically replaced with the correct path and binary name.

The args line is optional, it contains the arguments to the command, and is appended. There are two special macros accepted here, $input and $output. $input is replaced with the input filename of the preprocessor, $output respectively with the output filename. You do not have influence on the filenames, you may only specify where in the full command, the filenames need to be. Omitting $input in args implies input from STDIN, omitting $output implies output to STDOUT.

You do not need to worry about compatiblity between the preprocessors when they are chained, FuzzyOcr does all the work. If a preprocessor only accepts STDIN data, then it emulates this STDIN data from file. STDOUT data is internally emulated with a file as well.

The preprocessor in our examples receives input, and outputs tiff data using  ImageMagick. The first preprocessor in the chain receives a .pnm file as $input.

The second preprocessor in a chain will receive the output of the first preprocessor, and so on...

Scansets

A scanset has a similar syntax and behavior as the preprocessors have. Here is an example:

scanset gocr-invert {

preprocessors = normalize, invert, normalize command = $gocr args = -i $input

}

The "command" and "args" line have the same behavior as scansets do.

A new line here is the "preprocessors" line. It is optional and specifies a comma seperated list of preprocessor labels. All preprocessors in this list are chained, the first preprocessor receives a .pnm file (or pnm data if STDIN), and the output of the last preprocessor is the input for the OCR command (specified by command and args).

Assuming that "normalize" and "invert" are preprocessors here in our example, then this scanset will first normalize the PNM data, then invert the picture, then normalize the PNM data again, and then use gocr on it. gocr outputs to STDOUT here, but if the program supports it, it could also output to a file using $output.

WARNING: DO NOT USE SHELL REDIRECTIONS OR PIPES IN COMMAND OR ARGS

FuzzyOcr will refuse to accept shell redirections or pipes in commands/args because it is dangerous with the use of the exec() command. Exec() would emulate the command using a shell, spawning multiple processes that we can't kill later. If you need a pipe or a redirector, then you did something wrong, as the preprocessor/scanset system allows one to do everything that would also be possible with a pipe or redirection.

There is one more option available in scansets, called "force_output_in". This option can be used to force FuzzyOcr to read the content of this file as output instead of what FuzzyOcr used for $output. ($output itself can also be used in this clause and is properly substituted). An example with TesserAct will clarify this:

scanset tesseract {
    preprocessors = maketiff
    command = $tesseract
    args = $input $output batch 
    force_output_in = $output.txt
}

TesserAct receives an $input file (which must be in TIFF format, hence the maketiff preprocessor), and takes an $output argument. But it creates three files then:

  • $output.map
  • $output.raw
  • $output.txt

($output is here the filename that FuzzyOcr created for this purpose). Now FuzzyOcr would normally try to read from $output, but the real OCR output is in $output.txt.

Hence we force FuzzyOcr to use $output.txt instead.

You are now free to select which scansets get you the most spam, but don't pick too many, as this will also use more resources. This release of FuzzyOcr supplies example scansets for popular gocr and ocrad settings, as well as for the experimental tesseract scanset (not recommended).

Here are some hints:

  • If your resources allow it, use both ocrad and gocr, it will catch more spam most likely
  • If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr
  • The -l setting often helps for gocr, try values like 180, 140, or 100
  • Ocrad performs best with -s5 and -T 0.4 settings most times
  • Dark images with bright text can be scanned with Ocrad and the -i switch (inverts the picture)
  • pnminvert or pnmquant are useful with white text or text with many colors

If the scanset fails then, and debug logging is enabled, you will see this stderr output in the logfile :)

I know this seems confusing for some, but if this is unclear somehow, feel free to write an email to the list.