root/tags/FuzzyOcr-2.3j/INSTALL

Revision 3, 7.8 kB (checked in by decoder, 2 years ago)

Added current stable and testing release
Added samples
Added patches to external toolchain

Line 
1 Installation manual for FuzzyOcr 2.3:
2
3 1. Dependencies you require for this plugin to work
4
5     Before starting, also make sure to read the OS/distribution specific
6     notes at the end of this section.
7
8     1.1 Spamassassin 3.x
9
10         This plugin requires Spamassassin 3.1.4.
11
12     1.2 NetPBM tools
13
14         Install the NetPBM tools (http://netpbm.sourceforge.net/). If you
15         don't install the binaries in /usr/bin, please make sure to adjust
16         the FuzzyOcr.cf to point to the correct binaries.
17
18     1.3 ImageMagick
19
20         At least one feature requires the convert binary from imagemagick
21         (http://www.imagemagick.org/).  Again, make sure the configuration
22         file points to the convert binary, if not placed in /usr/bin.
23
24     1.4 Giflib (also known as libungif)
25
26         Several tools from this package are required, see
27         (http://sourceforge.net/projects/libungif).
28        
29         Attention: the giftext binary from this package has a bug which can
30         cause segfaults.  A patch is provided in the patches directory that
31         fixes this.
32
33     1.5 Gocr
34
35         For OCR recognition, gocr (http://jocr.sourceforge.net/) must be
36         installed.
37
38         Attention: the gocr binary has a bug which can cause segfaults with
39         specific images. A patch is provided in the patches directory which
40         fixes this.
41
42     1.6 Perl modules:
43         These perl modules are required:
44             Image::Magick
45             String::Approx
46             MLDBM
47             DB_File
48             Storable
49
50     Notes for Fedora Core 5 (or higher) users:
51         The package libungif-utils provides the necessary libungif binaries.
52    
53     Notes for other Redhat/FC users:
54         The packages libungif and libungif-progs should be installed.
55
56     Notes for Debian users:
57         The package libungif-bin provides the necessary libungif binaries.
58    
59     Notes for Slackware users:
60         I have no clue about this distro, but Andy Lyttle sent me a mail
61         about it:
62    
63             "Slackware doesn't currently have a libungif-utils/progs/bin
64             package, and the libungif package does not include the binaries
65             such as giffix.  So, you have to hack it a bit.
66            
67             1. Download (or copy from CD) the /source/l/libungif directory,
68                don't untar anything
69             2. Edit the libungif.SlackBuild and comment out this line:
70             # I don't believe we need all this slop.  Correct me if I'm wrong.
71                 rm -rf $PKG/usr/bin
72             3. Run "sh libungif.SlackBuild"
73             4. Uninstall the libungif package, if it's already installed
74             5. Look in /tmp, and install the new libungif package there"
75    
76     Notes for Gentoo users:
77         All dependencies except the perl modules can be installed via portage.
78         But because of the bugs in giftext and gocr, you might need to write
79         an ebuild which uses the two patches provided in the patches directory.
80         The perl modules can easily be installed with gcpan.
81
82
83 2. Installing the plugin:
84
85     2.1 Installing the required files
86
87         Put the FuzzyOcr.cf and the FuzzyOcr.pm files into /etc/mail/spamassassin.
88
89         The FuzzyOcr.cf file already contains a line to load the plugin, if
90         you want to put the .pm file in a different location, change this line
91         accordingly.
92
93         Create a wordlist file, a sample wordlist is shipped with this release,
94         and put it also in /etc/mail/spamassassin.
95
96     2.2 Necessary configuration
97
98         No changes need to be made to the default FuzzyOcr.cf file.
99  
100  3. Further adjustments
101
102     3.1 Enabling the image hash database
103
104         Set focr_enable_image_hashing to 1 in the config file, and make sure that
105         focr_digest_db points to a writable file/directory.  You can also create
106         this file yourself if you like. By default, all images recognized as spam,
107         are added to this database automatically. The score is saved as well and
108         reused later again.
109
110     3.2 Optional Storage for hash database
111
112         Set focr_enable_image_hashing to 2 in the config file, and make sure that
113         focr_db_hash as well as focr_db_safe point to a writable file/directory.
114         These files can be created by running the fuzzy-stats utility in the utils
115         directory.
116
117         These are default values:
118         focr_db_hash        /etc/mail/spamassassin/FuzzyOcr.db
119         focr_db_safe        /etc/mail/spamassassin/FuzzyOcr.safe.db
120         focr_db_max_days    35
121
122         Remember to keep the default value:
123         focr_hashing_learn_scanned 1
124
125         Setting focr_score_ham to 1 in the config file will give images that score
126         below the focr_counts_required threshold a score based on the formula:
127
128             Score = focr_add_score * Words Found
129
130         This will save images in the focr_db_hash with less than the required word
131         count but with a low score, helping to contribute with this lower score for
132         those messages that contain images that are more difficult to convert into
133         legible text.
134
135         Image hashes will be removed from the DB after focr_db_max_days.
136
137     3.3 My Default Values
138
139         These are the values used in my configuration:
140
141         focr_base_score 5
142         focr_add_score 0.375
143         focr_counts_required 3
144         focr_autodisable_score 20
145         focr_score_ham 1
146
147         These values allow for substantially lower scores to be stored in the database.
148         When more than the required number of words are found, the plugin will add
149         enough points to mark the message as SPAM, generating a slightly higher score
150         with additional word matches.
151
152         I have set focr_autodisable_score to 20 points in order to scan most messages,
153         because if set to the default value of 10, the plugin is skipped on most ocations
154        
155     3.4 Tweaking Scansets
156
157         Everyone gets different image spam, and most times, one method to scan is not
158         successful with all types of spam you get.  That's where the focr_scansets
159         setting can help you. This setting takes a comma seperated list of scansets.
160         Each scanset starts with the name of a program, followed by either other programs
161         connected with pipes, or nothing anymore.  The only important thing is that input
162         for this "program chain" is a picture in the PNM format, and the output is
163         ASCII text.
164
165         An example might clarify this:
166             focr_scanset gocr -i -
167
168         This will do a single scan with gocr default settings.
169
170             focr_scanset pnminvert | gocr -i -
171
172         This will use pnminvert on the image and then do the scan.
173
174             focr_scanset gocr -i -, gocr -l 180 -i -
175
176         This will do 2 scans, one with the default settings, and the second one with
177         a modified -l value.
178
179         You are now free to select which scansets get you the most spam, but don't pick too many, as this will also use more resources.
180
181         Here are some hints: -pnminvert or pnmquant are useful with white text or text with many colors
182                              -If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr
183                              -The -l setting often helps, try values like 180, 140, or 100
184
185         Two syntax remarks: -Instead of writing "gocr", write "$gocr" as this will be replaced with the correct path to your gocr binary.
186                             -If you invoke custom binaries (like pnminvert for example), you can redirect the stderr output by using:
187                                 "pnminvert 2>>$errfile"
188                              If the scanset fails then, and debug logging is enabled, you will see this stderr output in the logfile :)
189
190         I know this seems confusing for some, but if this is unclear somehow, feel free to write an email to the list.
191
192
193 And now, where it gets most thrilling...
194
195 To be continued...
Note: See TracBrowser for help on using the browser.