Changeset 104

Show
Ignore:
Timestamp:
10.12.2006 16:30:01 (2 years ago)
Author:
decoder
Message:

Last tweaks, commented out some lines in FuzzyOcr?.cf
Added samples, updated samples README.
Replaced INSTALL and CHANGES files with files pointing to the online version of these files.
It is easier for us to maintain one source of INSTALL/CHANGELOG, otherwise, we'll always get outdated docs.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/devel/CHANGES

    r4 r104  
    1 version 2.3j: 
    2     Fixed: 
    3         sh: $efile: ambiguous redirect 
    4         This message was being generated when using complex scansets, because 
    5         the 'value' was only translated once. In complex scansets, this value 
    6         may be specified multiple times. 
     1The changelog for the 3.5.x branch is maintained online at: 
    72 
    8         FuzzyOcr.cf 
    9         Fixed outstanding errors. Variable mismatches are now fixed. 
    10  
    11         FuzzyOcr.pm 
    12         Trap ImageMagick errors better, and logs them. 
    13  
    14         When processing Animated-GIF files, due to the algorithm, it is possible 
    15         to discard all frames, leaving an empty image.  Now, this special case 
    16         is treated as a corrupt image, and triggers FUZZY_OCR_CORRUPT_IMG with 
    17         $Score{corrupt} points (2.5 by default). 
    18  
    19     Changed: 
    20         Option: focr_personal_wordlist 
    21         Now, if the option value begins with '/', the value is not treated as 
    22         relative to the efective user's HOME directory, but as a fixed path. 
    23  
    24  
    25 version 2.3i: 
    26     Added: 
    27         Option: 'focr_score_ham'  Default: 0.0 
    28         When set to 1, images that are below the 'focr_counts_required' threshold, 
    29         are scored with the formula: $Score{Add} * $cnt; this gives marginally bad 
    30         images some positive score instead of just allowing them without score. 
    31          
    32     Removed: 
    33         Util: gif2anim 
    34         This script is no longer used in the plugin, so it is removed from the 
    35         distribution, although if needed, it may be found in the previous version. 
    36  
    37     Fixed: 
    38         The plugin was stuck in infinite loop in the case where there is more 
    39         than one attachment with the same name. The tie-breaking was not working. 
    40  
    41         When processing GIF files, extra care has to be taken so that ImageMagick 
    42         properly recognizes the files as GIF images, otherwise, an error occurs  
    43         because ImageMagick cannot properly determine the image 'type' and cannot 
    44         determine the image size, resulting in an invalid hash. Code is now in place 
    45         to prevent this, and in the case where invalid image size is encountered, 
    46         the processing of this image is skipped. 
    47  
    48     Changed: 
    49         When the plugin determines that words from the lists are found in the images, 
    50         it now stores these words in 'focr_db_hash' so that when we encounter the same 
    51         image hash in another message, the report will add the words 'found' to the 
    52         report, giving the end user more information, instead of just the  
    53         FOCR_KNOWN_IMAGE_HASH rule firing with the previous score. 
    54  
    55 version 2.3h: 
    56     Require: 
    57         New Perl Module 
    58             Image::Magick; 
    59     Added: 
    60         Option: 'focr_anim_delay'  Default: 100 
    61             This option is used with animated GIF files, and keeps all images 
    62             that are displayed for at least 1 sec. 
    63  
    64         Option: 'focr_anim_max_frames' Default: 2 
    65             This option is used with animated GIF files, and keeps top N 
    66             largest frames.  
    67  
    68     Fixed: 
    69         Option: 'focr_digest_hash' 
    70             Fixed internal parameter to reflect option from original plugin (Thanks Bill). 
    71  
    72         Option: 'focr_db_hash' 
    73             Updated FuzzyOcr.cf to reflect plugin option. 
    74  
    75         Option: 'focr_db_safe' 
    76             Updated FuzzyOcr.cf to reflect plugin option. 
    77  
    78         Option: 'focr_counts_required' 
    79             Fixed default value of '2' was set to '5' making it behave as the original plugin. 
    80  
    81     Removed: 
    82         Option: 'focr_bin_identify' 
    83         Option: 'focr_bin_convert' 
    84             These options are no longer valid, since the external programs are no longer called 
    85             in favor of using PERL module. Makes things 'simpler'. 
    86  
    87         Option: 'focr_bin_gifasm' 
    88         Option: 'focr_bin_tifftopnm' 
    89             external program not used anymore. 
    90  
    91     Changed: 
    92         The plugin now uses Image::Magick module to access ImageMagick functions from PERL instead 
    93         of accessing external programs. This makes for fewer system calls to run external programs. 
    94         (Idea from Eric Yiu) 
    95          
    96 version 2.3g: 
    97     Added: 
    98         Option: focr_keep_bad_images 
    99             The default value for this option is zero(0). 
    100             When set to 1, the plugin will not remove a tempdir whenever it registers 
    101                 an error or timeout from any of the 'helper' apps. 
    102             When set to 2, the plugin will always keep the tempdir. Beware that on heavily 
    103                 loaded systems, this might fill your /tmp partition. 
    104          
    105         Util: fuzzy-cleantmp 
    106             This utility can be used to remove tempdirs left behind if the plugin was  
    107             configured to save them.  It takes one parameter: hours to keep (12 by default) 
    108             This can safely be placed inside CRON to prune /tmp. 
    109  
    110         Util: gif2anim 
    111             This utility (from ImageMagic) extracts images from animated gifs as well 
    112             as giving information regarding delays and image sizes. Requires identify and 
    113             convert to work (these are required, so not a problem). 
    114  
    115     Fixed: 
    116         Bug: 'convert' 
    117             An invalid parameter was specified when using 'convert' to assemble animated gifs 
    118             resulting in an error message, and the image was not scanned. 
    119  
    120         Bug: 'safe_db' 
    121             When checking for images in safe_db hash, because we score then as zero (0), 
    122             we did not 'short circuit' correctly. This has now been fixed. 
    123  
    124         Bug: wrong_ctype 
    125             There wrong index to the Score hash was used, not allowing the 'focr_wrongctype_score' 
    126             parameter to take effect. This has now been fixed. 
    127  
    128     Changed: 
    129         known_image_hash 
    130             This procedure was called with two parameters: $digest and $score. 
    131             $digest was not used, so it has been removed. Also, just in the off chance 
    132             that $score is zero, it uses $Score{base} to score the image. 
    133  
    134         fuzzyocr_check 
    135             Added code to better determine the name of the attachment. Sometimes, the name 
    136             is hidden in the 'content-id' header of the image/* MIME part, so we extract 
    137             it from there if no name is given when this header is available. Also it makes 
    138             shure that problematic characters are changed so as to not give PERL any more 
    139             grief. 
    140  
    141             A copy of the original message is now saved in the tempdir created, so that 
    142             when we instruct the plugin to keep the created tempdir, we have a copy of the 
    143             original message to further assist in troubleshooting problems. 
    144  
    145             A file is created in tempdir containing all the expanded commands used to 
    146             process the images. This can help to troubleshoot invalid command errors.  
    147  
    148             Removed some debuglog lines to reduce the lines logged. 
    149  
    150             Uses gif2anim (if available) to extract images from animated gifs. 
    151             TODO: 
    152                 I will try to the generated anim file to root out animated gif spam where 
    153                 the spam message is not in the largest frame, or is in the frame with the 
    154                 largest delay, as well as other tricks... 
    155              
    156 version 2.3f: 
    157     Fixed: 
    158         Properly initialized $h and $w to zero so that when getting the height and width 
    159         from an image, if the size parameters cannot be parsed, they can get properly tested. 
    160  
    161     Fixed: 
    162         Hashing now works. $digest was getting reset because it went out of scope. grrr. 
    163  
    164     Fixed: 
    165         $efile was only being replaced for first occurrence in complex scansets. 
    166  
    167     Fixed: 
    168         Various bugs where: Use of uninitialized values were reported. 
    169  
    170 version 2.3e: 
    171     Fixed: 
    172         Option: 'focr_db_safe' 
    173             This option was not included in the @pgm_options array.... oops (thanks UxBoD) 
    174  
    175         Score: wrongctype 
    176             This was not used correctly, thus it was not scoring... (thanks Eric) 
    177  
    178     Changed: 
    179         It now works with tempfiles only 
    180             This hopefully reducing the need to read/write image data from memory after each 
    181             'filter'. This will hopefully reduce IO and memory usage for the plugin. 
    182  
    183         Scanset Syntax: $pfile 
    184             Because of the use of tempfiles, there is a need to specify the image file to be 
    185             used as input. '$pfile' must be used to specify the input filename. Please note 
    186             that in cases where scansets use pipes, only specify $pfile as the input to the 
    187             first 'filter' program. 
    188  
    189         Scanset Syntax: $efile 
    190             With every scanset, stderr is redirected to '$efile', which is different for each 
    191             image. When using multiple filters in a scanset, use '$efile' to redirect stderr 
    192             to this file, making shure the plugin will correctly recognize an error when it 
    193             occurs. 
    194              
    195  
    196 version 2.3d: 
    197     Require: 
    198         Plugin officially requires SA 3.1.4 or higher 
    199         New Perl Modules 
    200             DB_File 
    201             Storable 
    202             MLDBM 
    203         Previous 
    204             String::Approx 
    205  
    206     Removed: 
    207         Option: 'focr_pre314' 
    208             Not used as it now requires SA 3.1.4 
    209  
    210     Added: 
    211         Option: 'focr_path_bin' 
    212             Its value is treated as path for searching of @bin_utils, potentially 
    213                 requiring less configuration options; 
    214             Directories in the path that don't exists, are skipped; 
    215             Default value: /usr/local/netpbm/bin:/usr/local/bin:/usr/bin 
    216  
    217         Option: 'focr_db_hash' 
    218             Its value holds the filename to use for storing hash database; See below. 
    219             Default value: /etc/mail/spamassassin/FuzzyOcr.db 
    220  
    221         Option: 'focr_db_safe' 
    222             Its value holds the filename to use for storing hash database; See below. 
    223             Default value: /etc/mail/spamassassin/FuzzyOcr.safe.db 
    224  
    225         Option: 'focr_db_max_days' 
    226             Its value holds the filename to use for storing hash database; See below. 
    227             Default value: 35 
    228  
    229         Option: 'focr_keep_bad_images' 
    230             If this is set to 1, then this plugin will not remove the temporary image 
    231                 directory created where the images are stored and processed if it  
    232                 determines that the image was corrupt, or an error occurred with any 
    233                 of the auxiliary programs that process the images. Usefull while 
    234                 debugging. 
    235             Default value: 0 
    236              
    237  
    238     Changed: 
    239         Option: 'focr_logfile' 
    240             Defaults to 'stderr' so that logging goes there 
    241         Option: 'focr_enable_image_hashing' if set to 2: 
    242             Use MLDBM to store Hash info in true DB file for faster access. 
    243             Stores hashes of images that exceed set thresholds in file 
    244                 specified by option focr_db_hash 
    245             Stores hashes of 'clean' images (without matching words) 
    246                 specified by option focr_db_safe to also cache good images. 
    247             Keeps statistics of Hash-Hits and displays #times matched in log. 
    248             Saves name of attachment and content/type as reference 
    249             Automatically imports known-hashes from focr_digest_db into focr_db_hash 
    250             Automatically expire 'old' records if not matched in more than 
    251                 the number of days specified in option 'focr_db_max_days' 
    252         Instead of having a 'global' timeout, the 'focr_timeout' is used per 
    253             external program used, this will ensure that there are no timeouts 
    254             recorded because of complex scansets, or because of temporary spikes 
    255             in load. Also, it now displays the name and return code information 
    256             for the binary that timedout, making it easier to debug problems. 
    257  
    258     Fixed: 
    259         A bug where option focr_counts_required was not recognized; 
    260         Logging to file when option 'focr_logfile' set now works; 
    261         Individual word scores are now applied correctly 
    262         Storing only images with matched words to hash database (Thanks to Robert LeBlanc) 
    263         Explicitly use Mail::SpamAssassin::Timeout (Thanks Eric Yiu) 
    264         Ignores empty lines in wordlists (global and local) 
    265         Ignores comments starting with (#) to EOL 
    266  
    267 version 2.3c: 
    268     Require: 
    269         Plugin officially requires SA 3.1.1 or higher 
    270      
    271     Added: 
    272         Support for BMP/TIFF Images 
    273  
    274     Changed: 
    275         Major internal restructuring 
    276         Use SpamAssassin Logging Facility instead of own logfile 
    277  
    278     Fixed: 
    279         A bug related to database hashing 
     3http://fuzzyocr.own-hero.net/wiki/Changelog-3.x#version3.5.0 
  • trunk/devel/FuzzyOcr.cf

    r100 r104  
    8585# Include additional scanner/preprocessor commands here: 
    8686# 
    87 focr_bin_helper pnmnorm, pnminvert, convert 
     87focr_bin_helper pnmnorm, pnminvert, pamthreshold, ppmtopgm, pamtopnm 
    8888focr_bin_helper tesseract 
    8989 
     
    158158# Timeout for the plugin, in seconds. (Maximum runtime of the plugin) 
    159159# Default value: 10 
    160 focr_timeout 15 
     160#focr_timeout 15 
    161161 
    162162# Use a global timeout value instead of per helper application. 
    163163# Default value: 0 
    164 focr_global_timeout 1 
     164#focr_global_timeout 1 
    165165 
    166166# Maximum file size for different formats in byte, bigger pictures  
     
    193193# This is the score for a hit after focr_counts_required matches 
    194194# Default value: 5 
    195 focr_base_score 5 
     195#focr_base_score 5 
    196196 
    197197# This is the additional score for every additional match after  
    198198# focr_counts_required matches 
    199199# Default value: 1 
    200 focr_add_score 0.375 
     200#focr_add_score 0.375 
    201201 
    202202# This option defines the factor, which is multiplied with the number 
     
    286286# Auto-prune: Expire records from hasing databases after these many days 
    287287# Default value: 35 
    288 focr_db_max_days 15 
     288#focr_db_max_days 15 
    289289 
    290290### 
  • trunk/devel/FuzzyOcr.pm

    r102 r104  
    820820            } 
    821821            if ($mcnt >= $conf->{focr_counts_required} and $conf->{focr_minimal_scanset}) { 
    822                 warnlog("Scanset \"$scanlabel\" generates enough hits ($mcnt), skipping further scansets..."); 
     822                infolog("Scanset \"$scanlabel\" generates enough hits ($mcnt), skipping further scansets..."); 
    823823                if ($conf->{focr_autosort_scanset}) { 
    824824                    foreach my $s (@$scansets) { 
  • trunk/devel/INSTALL

    r58 r104  
    1 Requirements: 
    2 ~~~~~~~~~~~~~ 
    3   libungif: 
    4     http://sourceforge.net/project/showfiles.php?group_id=102202 
     1The installation manual for the 3.5.x branch is maintained online at: 
    52 
    6   netpbm: 
    7     http://sourceforge.net/project/showfiles.php?group_id=5128 
    8  
    9   gifsicle: 
    10     http://www.lcdf.org/gifsicle/gifsicle-1.44.tar.gz (latest) 
    11  
    12   gocr: (v0.40) suggested ... needs to be patched! 
    13     http://sourceforge.net/project/showfiles.php?group_id=7147 
    14  
    15   ocrad: 
    16     Please use your closest GNU mirror: 
    17       http://www.gnu.org/prep/ftp.html 
    18  
    19   mysql: 
    20     http://www.mysql.com (should work with 3.23+) 
    21  
    22 Perl Packages: 
    23 ~~~~~~~~~~~~~~ 
    24   String::Approx 
    25  
    26   MLDBM             - used with type 2 hasing 
    27   Storable          - used with type 2 hasing 
    28   DB_File           - used with type 2 hasing 
    29  
    30   DBI               - used with type 3 hasing 
    31   DBD::mysql        - used with type 3 hasing 
    32  
    33 Make shure all the above requirements are met, or else!!! 
    34 I personally think it is better to compile all from source, but 
    35 binary packages are available if you decide to go that way. The 
    36 only package that should be compiled from source is gocr (since 
    37 it requires some patching to make it work better ;) 
    38  
    39 Place a copy of the following files in your SpamAssassin local  
    40 configuration directory (/etc/mail/spamassassin by default): 
    41  
    42   FuzzyOcr.pm 
    43   FuzzyOcr.cf (change to taste) 
    44   FuzzyOcr.words 
    45  
    46 Skipping Scans 
    47 ~~~~~~~~~~~~~~ 
    48 Due to possible false positives, you also have the option not to  
    49 scan a particular type of image using the following configuration 
    50 option: 
    51  
    52 focr_skip_<img_type> 1 
    53  
    54 Optionally you could skip scanning of images that are 'too big' by 
    55 specifying the following configuration option: 
    56  
    57 focr_max_size_<img_type> <max-size> 
    58  
    59 where <max-size> is expressed in bytes (compared to the pnm 
    60 filesize), and <img_type> is one of the following: 
    61  
    62 - gif 
    63 - jpeg 
    64 - png 
    65 - bmp 
    66 - tiff 
    67  
    68 Timeouts 
    69 ~~~~~~~~ 
    70 There are two types of timeouts available for FuzzyOCR: 
    71  
    72 1.- Per Application Timeout (Default) 
    73     Set by setting the following: 
    74  
    75     focr_timeout <secs> 
    76     focr_global_timeout 0 (Default) 
    77  
    78     Each external helper application is given <secs> seconds 
    79     to complete, after which time it is assumed that it failed 
    80     and processing continues. 
    81  
    82 2.- Global Timeout 
    83     Set by setting the following: 
    84  
    85     focr_timeout <secs> 
    86     focr_global_timeout 1 
    87  
    88     If scanning takes longer than <secs> seconds, the scan is 
    89     aborted and the images (if any) are not scored or checked. 
    90  
    91 Image Hashing 
    92 ~~~~~~~~~~~~~ 
    93 If using image-hasing option (disabled by default) you need to specify 
    94 the following options in FuzzyOcr.cf: 
    95  
    96 focr_enable_image_hashing 1 
    97 focr_digest_db <full_path_to_file> 
    98  
    99 or 
    100  
    101 focr_enable_image_hashing 2 
    102 focr_db_hash <full_path_to_file> 
    103 focr_db_safe <full_path_to_file> 
    104 focr_db_max_days ##                     (default: 35) 
    105  
    106 In either case, you need to make shure the effective user running  
    107 SpamAssassin has the proper permissions to write to the specified files, 
    108 or change permissions on the files so that the effective user has 
    109 write permissions on these files. 
    110  
    111 Now if you decide to store the data in MySQL tables, 
    112  
    113 focr_enable_image_hashing 3 
    114 focr_db_max_days ##                     (default: 35) 
    115 focr_mysql_db <database_name>           (default: FuzzyOcr) 
    116 focr_mysql_hash <hash_table>            (default: Hash) 
    117 focr_mysql_safe <safe_table>            (default: Safe) 
    118 focr_mysql_user <username>              (default: fuzzyocr) 
    119 focr_mysql_pass <password>              (default: fuzzyocr) 
    120  
    121 and 
    122  
    123  focr_mysql_socket <path_to_socket>     (default: undefined) 
    124 or 
    125  focr_mysql_host <hostname>             (default: localhost) 
    126  focr_mysql_port <mysql_port>           (default: 3306) 
    127  
    128  
    129 Test with: 
    130 ~~~~~~~~~~ 
    131   spamassassin --debug FuzzyOcr < path_to_email > /dev/null 
    132  
    133 If you do not get errors, you are ready to go, and restart SPAMD which is 
    134 the (*strongly*) recomended way to use this plugin. 
    135  
     3http://fuzzyocr.own-hero.net/wiki/Installation-3.5.x 
  • trunk/devel/samples/README

    r103 r104  
    22 
    33Use spamassassin -t < samplefile.eml to test :) 
     4 
     5ATTENTION: If FuzzyOcr does not trigger on one of the messages, then make sure you have the focr_autodisable_score set high enough. 
     6Otherwise, if a message gets enough hits by SA, FuzzyOcr will not scan it. This is generally depending on your other SA rules. 
     7 
    48 
    59ocr-gif.eml: Contains a corrupted gif image, additionally I changed the content-type to jpeg, so the output should show: 
     
    913                            Image has format "GIF" but content-type is 
    1014                            "image/jpeg" 
    11  3.0 FUZZY_OCR_CORRUPT_IMG  BODY: Mail contains a corrupted image 
     15 2.5 FUZZY_OCR_CORRUPT_IMG  BODY: Mail contains a corrupted image 
    1216                            Corrupt image: GIF-LIB error: Image is 
    1317                            defective, decoding aborted. 
     18 8.8 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
     19                            Words found: 
     20                            "target" in 1 lines 
     21                            "service" in 1 lines 
     22                            "stock" in 2 lines 
     23                            "price" in 2 lines 
     24                            "company" in 1 lines 
     25                            "recommendation" in 1 lines 
     26                            (12 word occurrences found) 
    1427 
    15   10 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
     28ocr-animated.eml: Contains an animated gif. If all deanimation routines are working properly on your system, the output should contain: 
     29 
     30 6.5 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
    1631                            Words found: 
    17                             "stock" with fuzz of 0.2 
    18                             "price" with fuzz of 0.2 
    19                             "price" with fuzz of 0.2 
    20                             "stock" with fuzz of 0 
    21                             "company" with fuzz of 0 
    22                             "trade" with fuzz of 0.2 
    23                             "service" with fuzz of 0.285714285714286 
    24                             "investor" with fuzz of 0.25 
    25                             (8 word occurrences found) 
     32                            "price" in 1 lines 
     33                            "company" in 1 lines 
     34                            "alert" in 1 lines 
     35                            "news" in 1 lines 
     36 
     37ocr-obfuscated.eml: Contains an obfuscated gif image, to test the ocrad-decolorize scansets. If you want to test this scanset, either set the minimal_scanset option to 0 or put the decolorize scanset temporarily at the beginning of the scansets file. The output should be: 
     38 
     39 5.9 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
     40                            Words found: 
     41                            "target" in 1 lines 
     42                            "profit" in 1 lines 
     43                            "trade" in 1 lines 
     44                            (4.5 word occurrences found) 
     45 
    2646 
    2747ocr-jpg.eml: Contains a jpeg file. Output should show: 
    2848 
    29  6.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
     49 5.9 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
    3050                            Words found: 
    31                             "viagra" with fuzz of 0 
    32                             "cialis" with fuzz of 0 
    33                             "viagra" with fuzz of 0 
    34                             "levitra" with fuzz of 0 
    35                             (4 word occurrences found) 
     51                            "levitra" in 1 lines 
     52                            "viagra" in 2 lines 
     53                            (4.5 word occurrences found) 
    3654 
    3755 
    3856ocr-png.eml: Contains a png file. Output should show: 
    3957 
    40   20 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
     58  14 FUZZY_OCR              BODY: Mail contains an image with common spam text inside 
    4159                            Words found: 
    42                             "price" with fuzz of 0.2 
    43                             "company" with fuzz of 0 
    44                             "price" with fuzz of 0 
    45                             "price" with fuzz of 0.2 
    46                             "software" with fuzz of 0 
    47                             "investor" with fuzz of 0 
    48                             "trade" with fuzz of 0.2 
    49                             "price" with fuzz of 0.2 
    50                             "service" with fuzz of 0 
    51                             "software" with fuzz of 0 
    52                             "company" with fuzz of 0 
    53                             "service" with fuzz of 0 
    54                             "stock" with fuzz of 0 
    55                             "trade" with fuzz of 0 
    56                             "levitra" with fuzz of 0.285714285714286 
    57                             "price" with fuzz of 0 
    58                             "buy" with fuzz of 0 
    59                             "price" with fuzz of 0.2 
    60                             (18 word occurrences found) 
     60                            "buy" in 1 lines 
     61                            "target" in 2 lines 
     62                            "service" in 1 lines 
     63                            "stock" in 1 lines 
     64                            "investor" in 1 lines 
     65                            "price" in 3 lines 
     66                            "company" in 2 lines 
     67                            "trade" in 1 lines 
     68                            "software" in 1 lines 
     69                            "recommendation" in 1 lines 
     70                            "news" in 3 lines 
     71                            (25.5 word occurrences found)