How to install Fuzzy OCR on Debian Sarge and have it work

As this is Debian we are going to stick with Deb's all the way but this is going to require some custom work. links/versions are current as of 1 Feb 2007. Update as required but the general idea works with whatever version you use.

As listed on the previous page FuzzyOCR does have some pre-requisites we can apt. (Other packages have dependencies as shown below in each section)

apt-get install netpbm
apt-get install libstring-approx-perl libmldbm-sync-perl
apt-get install dpatch

Pick a directory (ie /usr/src) to put all the bits we need to download.

cd /usr/src/

To start with lets get gifsicle which isn't in sarge but does exist in package format. gifsicle depends on quite a few X11 libraries which I didn't want installed on a machine with no X. Fortunately the dpkg-buildpackage command takes a -d option to build the package ignoring the lib dependencies. It worked for me but YMMV.

wget ftp.debian.org/debian/pool/main/g/gifsicle/gifsicle_1.48.orig.tar.gz
wget ftp.debian.org/debian/pool/main/g/gifsicle/gifsicle_1.48-1.diff.gz
wget ftp.debian.org/debian/pool/main/g/gifsicle/gifsicle_1.48-1.dsc
dpkg-source -x gifsicle_1.48-1.dsc
cd gifsicle-1.48/
apt-get install autotools-dev libsm-dev libice-dev debhelper
dpkg-buildpackage -rfakeroot -b -d
cd ..
dpkg -i gifsicle_1.48-1_i386.deb

FuzzyOCR can use either gocr or ocrad to scan images and they recommend both so...

wget ftp.debian.org/debian/pool/main/o/ocrad/ocrad_0.16.orig.tar.gz
wget ftp.debian.org/debian/pool/main/o/ocrad/ocrad_0.16-1.dsc
wget ftp.debian.org/debian/pool/main/o/ocrad/ocrad_0.16-1.diff.gz
dpkg-source -x ocrad_0.16-1.dsc
cd ocrad-0.16/
dpkg-buildpackage -rfakeroot -b
cd ..
dpkg -i ocrad_0.16-1_i386.deb

that was simple, and now for gocr. This build creates 4 different .deb's but only the main one is required. Install the others if you want. You will also notice an extra patch (as described on the FuzzyOCR downloads page) is included but not used in this package by default. The below changes that so the patch is used.

wget ftp.debian.org/debian/pool/main/g/gocr/gocr_0.41-1.dsc
wget ftp.debian.org/debian/pool/main/g/gocr/gocr_0.41.orig.tar.gz
wget ftp.debian.org/debian/pool/main/g/gocr/gocr_0.41-1.diff.gz
dpkg-source -x gocr_0.41-1.dsc
cd gocr-0.41/
mv debian/patches_not_used/gocr-segfault.patch debian/patches
apt-get install tetex-bin autoconf libgtk1.2-dev libnetpbm10-dev transfig gs gsfonts
dpkg-buildpackage -rfakeroot -b
cd ..
dpkg -i gocr_0.41-1_i386.deb

The libungif included in the Sarge distro is outdated, unpatched and won't work well with FuzzyOCR. Lets get a better one. Previous to the 4.1.4 package there was a required additional patch as described on the FuzzyOCR downloads page. This is now included by default, see libungif4-4.1.4/debian/patches/03_no_global_color_map.dpatch

apt-get remove giflib-bin
apt-get remove libungif-bin
wget ftp.debian.org/debian/pool/main/libu/libungif4/libungif4_4.1.4-4.dsc
wget ftp.debian.org/debian/pool/main/libu/libungif4/libungif4_4.1.4-4.diff.gz
wget ftp.debian.org/debian/pool/main/libu/libungif4/libungif4_4.1.4.orig.tar.gz
dpkg-source -x libungif4_4.1.4-4.dsc
cd libungif4-4.1.4/
dpkg-buildpackage -rfakeroot -b
cd ..
dpkg -i libungif4g_4.1.4-4_i386.deb
dpkg -i libungif-bin_4.1.4-4_i386.deb

That's all the pre-requisites for FuzzyOCR, now lets get the latest version and install it. There are a couple of options in FuzzyOCR.cf I should mention here.

  • focr_autodisable_score has a default of 10. This means any spam which scores above 10 will not run the FuzzyOCR scan. Initially set this to something stupidly high, ie 100 so FuzzyOCR will always be run. Later once well tested set this back to a lower value, ie 10.
  • Remove from the "focr_bin_helper" options pamthreshold, pamtopnm, tesseract as they don't exist in the netpbm package included with Debian Sarge. This isn't a critical failure but they are included with 10.31 apparently which isn't in a deb format yet. Feel free to compile/install by hand.
  • Don't use "focr_enable_image_hashing 2" unless the directory you defined in "focr_db_hash" is chmod 666. It creates lock files which at that time spamd is owned by individual users. Hashing isn't that important.
    wget users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz
    tar zxvf fuzzyocr-3.5.1-devel.tar.gz
    cd FuzzyOcr-3.5.1/
    vi FuzzyOcr.cf and modify
    	focr_verbose 2
    	focr_logfile /var/log/FuzzyOcr.log
    	focr_global_wordlist /etc/spamassassin/FuzzyOcr.words
    	focr_enable_image_hashing 0
    

Depending on where your default SA install is you may need to edit FuzzyOcr?.pm. The default is:

"use lib qw(/etc/mail/spamassassin);"

For Debian it probably needs to be:

"use lib qw(/etc/spamassassin);"

Create the log file as defined above.

touch /var/log/FuzzyOcr.log
chmod 666 /var/log/FuzzyOcr.log

Make sure the log file rotates, yeah permissions/ownership could probably be better. FIXME :)

vi /etc/logrotate.d/fuzzyocr
/var/log/FuzzyOcr.log {
 rotate 5
 weekly
 compress
 delaycompress
 create 666 root root
 }

Now we are ready to put FuzzyOCR into the right place to be used. This includes the .cf .pm .scanset .words .preps .mysql files and the FuzzyOcr? directory.

cp -r FuzzyOcr* /etc/spamassassin/

Note: Any changes you make are not active until you restart spamassassin

Test, test and test again

spamassassin --lint

If you see this error message:

failed to parse plugin /etc/spamassassin/FuzzyOcr.pm: Can't locate Mail/SpamAssassin/Logger.pm in @INC (@INC contains: /usr/share/perl5 /etc/perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 /usr/lib/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl) at /etc/spamassassin/FuzzyOcr.pm line 11.
BEGIN failed--compilation aborted at /etc/spamassassin/FuzzyOcr.pm line 11.

You must install perl module Mail::SpamAssassin::Logger.

# perl -MCPAN -e shell
cpan> install Mail::SpamAssassin::Logger

You probably need at least a few extra packages, probably more depending on your situation.

apt-get install liblog-agent-logger-perl
apt-get install libmldbm-sync-perl
apt-get install libmldbm-perl
apt-get install libstring-approx-perl libtie-cache-perl

For extra logging make sure spamassassin is started with -s /var/log/spamassassin as an option. Check the log file while doing a restart.

Once all that's done, Run some tests on FuzzyOCR. Read the README file, it describes what the output should be. If your output isn't similar, check the logfiles and re-test.

cd /usr/src/FuzzyOcr-3.5.1/samples/
less README
spamassassin -t ocr-animated.eml
spamassassin -t ocr-gif.eml
spamassassin -t ocr-jpg.eml
spamassassin -t ocr-multi.eml
spamassassin -t ocr-obfuscated.eml
spamassassin -t ocr-png.eml
spamassassin -t ocr-wrongext.eml

If all that comes up clean FuzzyOCR is working, feel free to reset the focr_autodisable_score value to something normal and say goodbye to image spam. Any corrections feel free to update this page. This is just what I did.

--

Sabre