Show
Ignore:
Timestamp:
10.12.2006 16:30:01 (2 years ago)
Author:
decoder
Message:

Last tweaks, commented out some lines in FuzzyOcr?.cf
Added samples, updated samples README.
Replaced INSTALL and CHANGES files with files pointing to the online version of these files.
It is easier for us to maintain one source of INSTALL/CHANGELOG, otherwise, we'll always get outdated docs.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/devel/CHANGES

    r4 r104  
    1 version 2.3j: 
    2     Fixed: 
    3         sh: $efile: ambiguous redirect 
    4         This message was being generated when using complex scansets, because 
    5         the 'value' was only translated once. In complex scansets, this value 
    6         may be specified multiple times. 
     1The changelog for the 3.5.x branch is maintained online at: 
    72 
    8         FuzzyOcr.cf 
    9         Fixed outstanding errors. Variable mismatches are now fixed. 
    10  
    11         FuzzyOcr.pm 
    12         Trap ImageMagick errors better, and logs them. 
    13  
    14         When processing Animated-GIF files, due to the algorithm, it is possible 
    15         to discard all frames, leaving an empty image.  Now, this special case 
    16         is treated as a corrupt image, and triggers FUZZY_OCR_CORRUPT_IMG with 
    17         $Score{corrupt} points (2.5 by default). 
    18  
    19     Changed: 
    20         Option: focr_personal_wordlist 
    21         Now, if the option value begins with '/', the value is not treated as 
    22         relative to the efective user's HOME directory, but as a fixed path. 
    23  
    24  
    25 version 2.3i: 
    26     Added: 
    27         Option: 'focr_score_ham'  Default: 0.0 
    28         When set to 1, images that are below the 'focr_counts_required' threshold, 
    29         are scored with the formula: $Score{Add} * $cnt; this gives marginally bad 
    30         images some positive score instead of just allowing them without score. 
    31          
    32     Removed: 
    33         Util: gif2anim 
    34         This script is no longer used in the plugin, so it is removed from the 
    35         distribution, although if needed, it may be found in the previous version. 
    36  
    37     Fixed: 
    38         The plugin was stuck in infinite loop in the case where there is more 
    39         than one attachment with the same name. The tie-breaking was not working. 
    40  
    41         When processing GIF files, extra care has to be taken so that ImageMagick 
    42         properly recognizes the files as GIF images, otherwise, an error occurs  
    43         because ImageMagick cannot properly determine the image 'type' and cannot 
    44         determine the image size, resulting in an invalid hash. Code is now in place 
    45         to prevent this, and in the case where invalid image size is encountered, 
    46         the processing of this image is skipped. 
    47  
    48     Changed: 
    49         When the plugin determines that words from the lists are found in the images, 
    50         it now stores these words in 'focr_db_hash' so that when we encounter the same 
    51         image hash in another message, the report will add the words 'found' to the 
    52         report, giving the end user more information, instead of just the  
    53         FOCR_KNOWN_IMAGE_HASH rule firing with the previous score. 
    54  
    55 version 2.3h: 
    56     Require: 
    57         New Perl Module 
    58             Image::Magick; 
    59     Added: 
    60         Option: 'focr_anim_delay'  Default: 100 
    61             This option is used with animated GIF files, and keeps all images 
    62             that are displayed for at least 1 sec. 
    63  
    64         Option: 'focr_anim_max_frames' Default: 2 
    65             This option is used with animated GIF files, and keeps top N 
    66             largest frames.  
    67  
    68     Fixed: 
    69         Option: 'focr_digest_hash' 
    70             Fixed internal parameter to reflect option from original plugin (Thanks Bill). 
    71  
    72         Option: 'focr_db_hash' 
    73             Updated FuzzyOcr.cf to reflect plugin option. 
    74  
    75         Option: 'focr_db_safe' 
    76             Updated FuzzyOcr.cf to reflect plugin option. 
    77  
    78         Option: 'focr_counts_required' 
    79             Fixed default value of '2' was set to '5' making it behave as the original plugin. 
    80  
    81     Removed: 
    82         Option: 'focr_bin_identify' 
    83         Option: 'focr_bin_convert' 
    84             These options are no longer valid, since the external programs are no longer called 
    85             in favor of using PERL module. Makes things 'simpler'. 
    86  
    87         Option: 'focr_bin_gifasm' 
    88         Option: 'focr_bin_tifftopnm' 
    89             external program not used anymore. 
    90  
    91     Changed: 
    92         The plugin now uses Image::Magick module to access ImageMagick functions from PERL instead 
    93         of accessing external programs. This makes for fewer system calls to run external programs. 
    94         (Idea from Eric Yiu) 
    95          
    96 version 2.3g: 
    97     Added: 
    98         Option: focr_keep_bad_images 
    99             The default value for this option is zero(0). 
    100             When set to 1, the plugin will not remove a tempdir whenever it registers 
    101                 an error or timeout from any of the 'helper' apps. 
    102             When set to 2, the plugin will always keep the tempdir. Beware that on heavily 
    103                 loaded systems, this might fill your /tmp partition. 
    104          
    105         Util: fuzzy-cleantmp 
    106             This utility can be used to remove tempdirs left behind if the plugin was  
    107             configured to save them.  It takes one parameter: hours to keep (12 by default) 
    108             This can safely be placed inside CRON to prune /tmp. 
    109  
    110         Util: gif2anim 
    111             This utility (from ImageMagic) extracts images from animated gifs as well 
    112             as giving information regarding delays and image sizes. Requires identify and 
    113             convert to work (these are required, so not a problem). 
    114  
    115     Fixed: 
    116         Bug: 'convert' 
    117             An invalid parameter was specified when using 'convert' to assemble animated gifs 
    118             resulting in an error message, and the image was not scanned. 
    119  
    120         Bug: 'safe_db' 
    121             When checking for images in safe_db hash, because we score then as zero (0), 
    122             we did not 'short circuit' correctly. This has now been fixed. 
    123  
    124         Bug: wrong_ctype 
    125             There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score' 
    126             parameter to take effect. This has now been fixed. 
    127  
    128     Changed: 
    129         known_image_hash 
    130             This procedure was called with two parameters: $digest and $score. 
    131             $digest was not used, so it has been removed. Also, just in the off chance 
    132             that $score is zero, it uses $Score{base} to score the image. 
    133  
    134         fuzzyocr_check 
    135             Added code to better determine the name of the attachment. Sometimes, the name 
    136             is hidden in the 'content-id' header of the image/* MIME part, so we extract 
    137             it from there if no name is given when this header is available. Also it makes 
    138             shure that problematic characters are changed so as to not give PERL any more 
    139             grief. 
    140  
    141             A copy of the original message is now saved in the tempdir created, so that 
    142             when we instruct the plugin to keep the created tempdir, we have a copy of the 
    143             original message to further assist in troubleshooting problems. 
    144  
    145             A file is created in tempdir containing all the expanded commands used to 
    146             process the images. This can help to troubleshoot invalid command errors.  
    147  
    148             Removed some debuglog lines to reduce the lines logged. 
    149  
    150             Uses gif2anim (if available) to extract images from animated gifs. 
    151             TODO: 
    152                 I will try to the generated anim file to root out animated gif spam where 
    153                 the spam message is not in the largest frame, or is in the frame with the 
    154                 largest delay, as well as other tricks... 
    155              
    156 version 2.3f: 
    157     Fixed: 
    158         Properly initialized $h and $w to zero so that when getting the height and width 
    159         from an image, if the size parameters cannot be parsed, they can get properly tested. 
    160  
    161     Fixed: 
    162         Hashing now works. $digest was getting reset because it went out of scope. grrr. 
    163  
    164     Fixed: 
    165         $efile was only being replaced for first occurrence in complex scansets. 
    166  
    167     Fixed: 
    168         Various bugs where: Use of uninitialized values were reported. 
    169  
    170 version 2.3e: 
    171     Fixed: 
    172         Option: 'focr_db_safe' 
    173             This option was not included in the @pgm_options array.... oops (thanks UxBoD) 
    174  
    175         Score: wrongctype 
    176             This was not used correctly, thus it was not scoring... (thanks Eric) 
    177  
    178     Changed: 
    179         It now works with tempfiles only 
    180             This hopefully reducing the need to read/write image data from memory after each 
    181             'filter'. This will hopefully reduce IO and memory usage for the plugin. 
    182  
    183         Scanset Syntax: $pfile 
    184             Because of the use of tempfiles, there is a need to specify the image file to be 
    185             used as input. '$pfile' must be used to specify the input filename. Please note 
    186             that in cases where scansets use pipes, only specify $pfile as the input to the 
    187             first 'filter' program. 
    188  
    189         Scanset Syntax: $efile 
    190             With every scanset, stderr is redirected to '$efile', which is different for each 
    191             image. When using multiple filters in a scanset, use '$efile' to redirect stderr 
    192             to this file, making shure the plugin will correctly recognize an error when it 
    193             occurs. 
    194              
    195  
    196 version 2.3d: 
    197     Require: 
    198         Plugin officially requires SA 3.1.4 or higher 
    199         New Perl Modules 
    200             DB_File 
    201             Storable 
    202             MLDBM 
    203         Previous 
    204             String::Approx 
    205  
    206     Removed: 
    207         Option: 'focr_pre314' 
    208             Not used as it now requires SA 3.1.4 
    209  
    210     Added: 
    211         Option: 'focr_path_bin' 
    212             Its value is treated as path for searching of @bin_utils, potentially 
    213                 requiring less configuration options; 
    214             Directories in the path that don't exists, are skipped; 
    215             Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin 
    216  
    217         Option: 'focr_db_hash' 
    218             Its value holds the filename to use for storing hash database; See below. 
    219             Default value: /etc/mail/spamassassin/FuzzyOcr.db 
    220  
    221         Option: 'focr_db_safe' 
    222             Its value holds the filename to use for storing hash database; See below. 
    223             Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db 
    224  
    225         Option: 'focr_db_max_days' 
    226             Its value holds the filename to use for storing hash database; See below. 
    227             Default value: 35 
    228  
    229         Option: 'focr_keep_bad_images' 
    230             If this is set to 1, then this plugin will not remove the temporary image 
    231                 directory created where the images are stored and processed if it  
    232                 determines that the image was corrupt, or an error occurred with any 
    233                 of the auxiliary programs that process the images. Usefull while 
    234                 debugging. 
    235             Default value: 0 
    236              
    237  
    238     Changed: 
    239         Option: 'focr_logfile' 
    240             Defaults to 'stderr' so that logging goes there 
    241         Option: 'focr_enable_image_hashing' if set to 2: 
    242             Use MLDBM to store Hash info in true DB file for faster access. 
    243             Stores hashes of images that exceed set thresholds in file 
    244                 specified by option focr_db_hash 
    245             Stores hashes of 'clean' images (without matching words) 
    246                 specified by option focr_db_safe to also cache good images. 
    247             Keeps statistics of Hash-Hits and displays #times matched in log. 
    248             Saves name of attachment and content/type as reference 
    249             Automatically imports known-hashes from focr_digest_db into focr_db_hash 
    250             Automatically expire 'old' records if not matched in more than 
    251                 the number of days specified in option 'focr_db_max_days' 
    252         Instead of having a 'global' timeout, the 'focr_timeout' is used per 
    253             external program used, this will ensure that there are no timeouts 
    254             recorded because of complex scansets, or because of temporary spikes 
    255             in load. Also, it now displays the name and return code information 
    256             for the binary that timedout, making it easier to debug problems. 
    257  
    258     Fixed: 
    259         A bug where option focr_counts_required was not recognized; 
    260         Logging to file when option 'focr_logfile' set now works; 
    261         Individual word scores are now applied correctly 
    262         Storing only images with matched words to hash database (Thanks to Robert LeBlanc) 
    263         Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu) 
    264         Ignores empty lines in wordlists (global and local) 
    265         Ignores comments starting with (#) to EOL 
    266  
    267 version 2.3c: 
    268     Require: 
    269         Plugin officially requires SA 3.1.1 or higher 
    270      
    271     Added: 
    272         Support for BMP/TIFF Images 
    273  
    274     Changed: 
    275         Major internal restructuring 
    276         Use SpamAssassin Logging Facility instead of own logfile 
    277  
    278     Fixed: 
    279         A bug related to database hashing 
     3http://fuzzyocr.own-hero.net/wiki/Changelog-3.x#version3.5.0