How to install Fuzzy OCR on Debian Sarge and have it work
As this is Debian we are going to stick with Deb's all the way but this is going to require some custom work. links/versions are current as of 1 Feb 2007. Update as required but the general idea works with whatever version you use.
As listed on the previous page FuzzyOCR does have some pre-requisites we can apt. (Other packages have dependencies as shown below in each section)
apt-get install netpbm apt-get install libstring-approx-perl libmldbm-sync-perl apt-get install dpatch
Pick a directory (ie /usr/src) to put all the bits we need to download.
cd /usr/src/
To start with lets get gifsicle which isn't in sarge but does exist in package format. gifsicle depends on quite a few X11 libraries which I didn't want installed on a machine with no X. Fortunately the dpkg-buildpackage command takes a -d option to build the package ignoring the lib dependencies. It worked for me but YMMV.
wget ftp.debian.org/debian/pool/main/g/gifsicle/gifsicle_1.48.orig.tar.gz wget ftp.debian.org/debian/pool/main/g/gifsicle/gifsicle_1.48-1.diff.gz wget ftp.debian.org/debian/pool/main/g/gifsicle/gifsicle_1.48-1.dsc dpkg-source -x gifsicle_1.48-1.dsc cd gifsicle-1.48/ apt-get install autotools-dev libsm-dev libice-dev debhelper dpkg-buildpackage -rfakeroot -b -d cd .. dpkg -i gifsicle_1.48-1_i386.deb
FuzzyOCR can use either gocr or ocrad to scan images and they recommend both so...
wget ftp.debian.org/debian/pool/main/o/ocrad/ocrad_0.16.orig.tar.gz wget ftp.debian.org/debian/pool/main/o/ocrad/ocrad_0.16-1.dsc wget ftp.debian.org/debian/pool/main/o/ocrad/ocrad_0.16-1.diff.gz dpkg-source -x ocrad_0.16-1.dsc cd ocrad-0.16/ dpkg-buildpackage -rfakeroot -b cd .. dpkg -i ocrad_0.16-1_i386.deb
that was simple, and now for gocr. This build creates 4 different .deb's but only the main one is required. Install the others if you want. You will also notice an extra patch (as described on the FuzzyOCR downloads page) is included but not used in this package by default. The below changes that so the patch is used.
wget ftp.debian.org/debian/pool/main/g/gocr/gocr_0.41-1.dsc wget ftp.debian.org/debian/pool/main/g/gocr/gocr_0.41.orig.tar.gz wget ftp.debian.org/debian/pool/main/g/gocr/gocr_0.41-1.diff.gz dpkg-source -x gocr_0.41-1.dsc cd gocr-0.41/ mv debian/patches_not_used/gocr-segfault.patch debian/patches apt-get install tetex-bin autoconf libgtk1.2-dev libnetpbm10-dev transfig gs gsfonts dpkg-buildpackage -rfakeroot -b cd .. dpkg -i gocr_0.41-1_i386.deb
The libungif included in the Sarge distro is outdated, unpatched and won't work well with FuzzyOCR. Lets get a better one. Previous to the 4.1.4 package there was a required additional patch as described on the FuzzyOCR downloads page. This is now included by default, see libungif4-4.1.4/debian/patches/03_no_global_color_map.dpatch
apt-get remove giflib-bin apt-get remove libungif-bin wget ftp.debian.org/debian/pool/main/libu/libungif4/libungif4_4.1.4-4.dsc wget ftp.debian.org/debian/pool/main/libu/libungif4/libungif4_4.1.4-4.diff.gz wget ftp.debian.org/debian/pool/main/libu/libungif4/libungif4_4.1.4.orig.tar.gz dpkg-source -x libungif4_4.1.4-4.dsc cd libungif4-4.1.4/ dpkg-buildpackage -rfakeroot -b cd .. dpkg -i libungif4g_4.1.4-4_i386.deb dpkg -i libungif-bin_4.1.4-4_i386.deb
That's all the pre-requisites for FuzzyOCR, now lets get the latest version and install it. There are a couple of options in FuzzyOCR.cf I should mention here.
- focr_autodisable_score has a default of 10. This means any spam which scores above 10 will not run the FuzzyOCR scan. Initially set this to something stupidly high, ie 100 so FuzzyOCR will always be run. Later once well tested set this back to a lower value, ie 10.
- Remove from the "focr_bin_helper" options pamthreshold, pamtopnm, tesseract as they don't exist in the netpbm package included with Debian Sarge. This isn't a critical failure but they are included with 10.31 apparently which isn't in a deb format yet. Feel free to compile/install by hand.
- Don't use "focr_enable_image_hashing 2" unless the directory you defined in "focr_db_hash" is chmod 666. It creates lock files which at that time spamd is owned by individual users. Hashing isn't that important.
wget users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz tar zxvf fuzzyocr-3.5.1-devel.tar.gz cd FuzzyOcr-3.5.1/ vi FuzzyOcr.cf and modify focr_verbose 2 focr_logfile /var/log/FuzzyOcr.log focr_global_wordlist /etc/spamassassin/FuzzyOcr.words focr_enable_image_hashing 0
Depending on where your default SA install is you may need to edit FuzzyOcr?.pm. The default is:
"use lib qw(/etc/mail/spamassassin);"
For Debian it probably needs to be:
"use lib qw(/etc/spamassassin);"
Create the log file as defined above.
touch /var/log/FuzzyOcr.log chmod 666 /var/log/FuzzyOcr.log
Make sure the log file rotates, yeah permissions/ownership could probably be better. FIXME :)
vi /etc/logrotate.d/fuzzyocr
/var/log/FuzzyOcr.log {
rotate 5
weekly
compress
delaycompress
create 666 root root
}
Now we are ready to put FuzzyOCR into the right place to be used. This includes the .cf .pm .scanset .words .preps .mysql files and the FuzzyOcr? directory.
cp -r FuzzyOcr* /etc/spamassassin/
Note: Any changes you make are not active until you restart spamassassin
Test, test and test again
spamassassin --lint
If you see this error message:
failed to parse plugin /etc/spamassassin/FuzzyOcr.pm: Can't locate Mail/SpamAssassin/Logger.pm in @INC (@INC contains: /usr/share/perl5 /etc/perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 /usr/lib/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl) at /etc/spamassassin/FuzzyOcr.pm line 11. BEGIN failed--compilation aborted at /etc/spamassassin/FuzzyOcr.pm line 11.
You must install perl module Mail::SpamAssassin::Logger.
# perl -MCPAN -e shell cpan> install Mail::SpamAssassin::Logger
You probably need at least a few extra packages, probably more depending on your situation.
apt-get install liblog-agent-logger-perl apt-get install libmldbm-sync-perl apt-get install libmldbm-perl apt-get install libstring-approx-perl libtie-cache-perl
For extra logging make sure spamassassin is started with -s /var/log/spamassassin as an option. Check the log file while doing a restart.
Once all that's done, Run some tests on FuzzyOCR. Read the README file, it describes what the output should be. If your output isn't similar, check the logfiles and re-test.
cd /usr/src/FuzzyOcr-3.5.1/samples/ less README spamassassin -t ocr-animated.eml spamassassin -t ocr-gif.eml spamassassin -t ocr-jpg.eml spamassassin -t ocr-multi.eml spamassassin -t ocr-obfuscated.eml spamassassin -t ocr-png.eml spamassassin -t ocr-wrongext.eml
If all that comes up clean FuzzyOCR is working, feel free to reset the focr_autodisable_score value to something normal and say goodbye to image spam. Any corrections feel free to update this page. This is just what I did.
--
Sabre
