root/tags/FuzzyOcr-2.3j/CHANGES

Revision 3, 12.1 kB (checked in by decoder, 2 years ago)

Added current stable and testing release
Added samples
Added patches to external toolchain

Line 
1 version 2.3j:
2     Fixed:
3         sh: $efile: ambiguous redirect
4         This message was being generated when using complex scansets, because
5         the 'value' was only translated once. In complex scansets, this value
6         may be specified multiple times.
7
8         FuzzyOcr.cf
9         Fixed outstanding errors. Variable mismatches are now fixed.
10
11         FuzzyOcr.pm
12         Trap ImageMagick errors better, and logs them.
13
14         When processing Animated-GIF files, due to the algorithm, it is possible
15         to discard all frames, leaving an empty image.  Now, this special case
16         is treated as a corrupt image, and triggers FUZZY_OCR_CORRUPT_IMG with
17         $Score{corrupt} points (2.5 by default).
18
19     Changed:
20         Option: focr_personal_wordlist
21         Now, if the option value begins with '/', the value is not treated as
22         relative to the efective user's HOME directory, but as a fixed path.
23
24
25 version 2.3i:
26     Added:
27         Option: 'focr_score_ham'  Default: 0.0
28         When set to 1, images that are below the 'focr_counts_required' threshold,
29         are scored with the formula: $Score{Add} * $cnt; this gives marginally bad
30         images some positive score instead of just allowing them without score.
31        
32     Removed:
33         Util: gif2anim
34         This script is no longer used in the plugin, so it is removed from the
35         distribution, although if needed, it may be found in the previous version.
36
37     Fixed:
38         The plugin was stuck in infinite loop in the case where there is more
39         than one attachment with the same name. The tie-breaking was not working.
40
41         When processing GIF files, extra care has to be taken so that ImageMagick
42         properly recognizes the files as GIF images, otherwise, an error occurs
43         because ImageMagick cannot properly determine the image 'type' and cannot
44         determine the image size, resulting in an invalid hash. Code is now in place
45         to prevent this, and in the case where invalid image size is encountered,
46         the processing of this image is skipped.
47
48     Changed:
49         When the plugin determines that words from the lists are found in the images,
50         it now stores these words in 'focr_db_hash' so that when we encounter the same
51         image hash in another message, the report will add the words 'found' to the
52         report, giving the end user more information, instead of just the
53         FOCR_KNOWN_IMAGE_HASH rule firing with the previous score.
54
55 version 2.3h:
56     Require:
57         New Perl Module
58             Image::Magick;
59     Added:
60         Option: 'focr_anim_delay'  Default: 100
61             This option is used with animated GIF files, and keeps all images
62             that are displayed for at least 1 sec.
63
64         Option: 'focr_anim_max_frames' Default: 2
65             This option is used with animated GIF files, and keeps top N
66             largest frames.
67
68     Fixed:
69         Option: 'focr_digest_hash'
70             Fixed internal parameter to reflect option from original plugin (Thanks Bill).
71
72         Option: 'focr_db_hash'
73             Updated FuzzyOcr.cf to reflect plugin option.
74
75         Option: 'focr_db_safe'
76             Updated FuzzyOcr.cf to reflect plugin option.
77
78         Option: 'focr_counts_required'
79             Fixed default value of '2' was set to '5' making it behave as the original plugin.
80
81     Removed:
82         Option: 'focr_bin_identify'
83         Option: 'focr_bin_convert'
84             These options are no longer valid, since the external programs are no longer called
85             in favor of using PERL module. Makes things 'simpler'.
86
87         Option: 'focr_bin_gifasm'
88         Option: 'focr_bin_tifftopnm'
89             external program not used anymore.
90
91     Changed:
92         The plugin now uses Image::Magick module to access ImageMagick functions from PERL instead
93         of accessing external programs. This makes for fewer system calls to run external programs.
94         (Idea from Eric Yiu)
95        
96 version 2.3g:
97     Added:
98         Option: focr_keep_bad_images
99             The default value for this option is zero(0).
100             When set to 1, the plugin will not remove a tempdir whenever it registers
101                 an error or timeout from any of the 'helper' apps.
102             When set to 2, the plugin will always keep the tempdir. Beware that on heavily
103                 loaded systems, this might fill your /tmp partition.
104        
105         Util: fuzzy-cleantmp
106             This utility can be used to remove tempdirs left behind if the plugin was
107             configured to save them.  It takes one parameter: hours to keep (12 by default)
108             This can safely be placed inside CRON to prune /tmp.
109
110         Util: gif2anim
111             This utility (from ImageMagic) extracts images from animated gifs as well
112             as giving information regarding delays and image sizes. Requires identify and
113             convert to work (these are required, so not a problem).
114
115     Fixed:
116         Bug: 'convert'
117             An invalid parameter was specified when using 'convert' to assemble animated gifs
118             resulting in an error message, and the image was not scanned.
119
120         Bug: 'safe_db'
121             When checking for images in safe_db hash, because we score then as zero (0),
122             we did not 'short circuit' correctly. This has now been fixed.
123
124         Bug: wrong_ctype
125             There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score'
126             parameter to take effect. This has now been fixed.
127
128     Changed:
129         known_image_hash
130             This procedure was called with two parameters: $digest and $score.
131             $digest was not used, so it has been removed. Also, just in the off chance
132             that $score is zero, it uses $Score{base} to score the image.
133
134         fuzzyocr_check
135             Added code to better determine the name of the attachment. Sometimes, the name
136             is hidden in the 'content-id' header of the image/* MIME part, so we extract
137             it from there if no name is given when this header is available. Also it makes
138             shure that problematic characters are changed so as to not give PERL any more
139             grief.
140
141             A copy of the original message is now saved in the tempdir created, so that
142             when we instruct the plugin to keep the created tempdir, we have a copy of the
143             original message to further assist in troubleshooting problems.
144
145             A file is created in tempdir containing all the expanded commands used to
146             process the images. This can help to troubleshoot invalid command errors.
147
148             Removed some debuglog lines to reduce the lines logged.
149
150             Uses gif2anim (if available) to extract images from animated gifs.
151             TODO:
152                 I will try to the generated anim file to root out animated gif spam where
153                 the spam message is not in the largest frame, or is in the frame with the
154                 largest delay, as well as other tricks...
155            
156 version 2.3f:
157     Fixed:
158         Properly initialized $h and $w to zero so that when getting the height and width
159         from an image, if the size parameters cannot be parsed, they can get properly tested.
160
161     Fixed:
162         Hashing now works. $digest was getting reset because it went out of scope. grrr.
163
164     Fixed:
165         $efile was only being replaced for first occurrence in complex scansets.
166
167     Fixed:
168         Various bugs where: Use of uninitialized values were reported.
169
170 version 2.3e:
171     Fixed:
172         Option: 'focr_db_safe'
173             This option was not included in the @pgm_options array.... oops (thanks UxBoD)
174
175         Score: wrongctype
176             This was not used correctly, thus it was not scoring... (thanks Eric)
177
178     Changed:
179         It now works with tempfiles only
180             This hopefully reducing the need to read/write image data from memory after each
181             'filter'. This will hopefully reduce IO and memory usage for the plugin.
182
183         Scanset Syntax: $pfile
184             Because of the use of tempfiles, there is a need to specify the image file to be
185             used as input. '$pfile' must be used to specify the input filename. Please note
186             that in cases where scansets use pipes, only specify $pfile as the input to the
187             first 'filter' program.
188
189         Scanset Syntax: $efile
190             With every scanset, stderr is redirected to '$efile', which is different for each
191             image. When using multiple filters in a scanset, use '$efile' to redirect stderr
192             to this file, making shure the plugin will correctly recognize an error when it
193             occurs.
194            
195
196 version 2.3d:
197     Require:
198         Plugin officially requires SA 3.1.4 or higher
199         New Perl Modules
200             DB_File
201             Storable
202             MLDBM
203         Previous
204             String::Approx
205
206     Removed:
207         Option: 'focr_pre314'
208             Not used as it now requires SA 3.1.4
209
210     Added:
211         Option: 'focr_path_bin'
212             Its value is treated as path for searching of @bin_utils, potentially
213                 requiring less configuration options;
214             Directories in the path that don't exists, are skipped;
215             Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin
216
217         Option: 'focr_db_hash'
218             Its value holds the filename to use for storing hash database; See below.
219             Default value: /etc/mail/spamassassin/FuzzyOcr.db
220
221         Option: 'focr_db_safe'
222             Its value holds the filename to use for storing hash database; See below.
223             Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db
224
225         Option: 'focr_db_max_days'
226             Its value holds the filename to use for storing hash database; See below.
227             Default value: 35
228
229         Option: 'focr_keep_bad_images'
230             If this is set to 1, then this plugin will not remove the temporary image
231                 directory created where the images are stored and processed if it
232                 determines that the image was corrupt, or an error occurred with any
233                 of the auxiliary programs that process the images. Usefull while
234                 debugging.
235             Default value: 0
236            
237
238     Changed:
239         Option: 'focr_logfile'
240             Defaults to 'stderr' so that logging goes there
241         Option: 'focr_enable_image_hashing' if set to 2:
242             Use MLDBM to store Hash info in true DB file for faster access.
243             Stores hashes of images that exceed set thresholds in file
244                 specified by option focr_db_hash
245             Stores hashes of 'clean' images (without matching words)
246                 specified by option focr_db_safe to also cache good images.
247             Keeps statistics of Hash-Hits and displays #times matched in log.
248             Saves name of attachment and content/type as reference
249             Automatically imports known-hashes from focr_digest_db into focr_db_hash
250             Automatically expire 'old' records if not matched in more than
251                 the number of days specified in option 'focr_db_max_days'
252         Instead of having a 'global' timeout, the 'focr_timeout' is used per
253             external program used, this will ensure that there are no timeouts
254             recorded because of complex scansets, or because of temporary spikes
255             in load. Also, it now displays the name and return code information
256             for the binary that timedout, making it easier to debug problems.
257
258     Fixed:
259         A bug where option focr_counts_required was not recognized;
260         Logging to file when option 'focr_logfile' set now works;
261         Individual word scores are now applied correctly
262         Storing only images with matched words to hash database (Thanks to Robert LeBlanc)
263         Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu)
264         Ignores empty lines in wordlists (global and local)
265         Ignores comments starting with (#) to EOL
266
267 version 2.3c:
268     Require:
269         Plugin officially requires SA 3.1.1 or higher
270    
271     Added:
272         Support for BMP/TIFF Images
273
274     Changed:
275         Major internal restructuring
276         Use SpamAssassin Logging Facility instead of own logfile
277
278     Fixed:
279         A bug related to database hashing
Note: See TracBrowser for help on using the browser.