Ticket #406 (new enhancement)

Opened 1 year ago

Last modified 1 year ago

add support for netpbm image attachments

Reported by: keithr Assigned to: decoder
Priority: major Milestone:
Component: Image Analysis Version:
Keywords: Cc:

Description

In working on a spamassassin plugin to extract text and images from PDF files, I was surprised to find that FuzzyOcr? doesn't actually handle netpbm images, even though it converts other image types to netpbm for the OCR programs. This is needed for PDF images because pdmimages(1) extracts them as ppm files. The attached diff adds support for netpbm images.

Attachments

fuzzyocr.diffs.txt (3.6 kB) - added by keithr on 19.07.2007 07:48:04.
Patch to add support for netpbm images

Change History

11.07.2007 22:33:57 changed by keithr

I tried attaching the diff, but got an error from the server, so here it is:

--- ./FuzzyOcr.pm 2007-07-07 21:45:20.000000000 -0700 +++ /etc/mail/spamassassin/FuzzyOcr.pm 2007-07-11 13:07:55.000000000 -0700 @@ -148,21 +148,34 @@ sub fuzzyocr_do {

$fname =~ s/[<>]//g; $fname =~ tr/\@\$\%\&/_/s;

}

my $filename = $fname; $filename =~ tr{a-zA-Z0-9\-.}{_}cs; debuglog("fname: \"$fname\" => \"$filename\""); my $pdata = $p->decode(); my $pdatalen = length($pdata); my $w = 0; my $h = 0;

- if ( substr($pdata,0,3) eq "\x47\x49\x46" ) { + if ($pdata =~ /P([1-7])\n/o) { + ## NETPBM File + if ($pdata =~ /P[1-7]\n([0-9]+) ([0-9]+)\n/o) { + $w = $1; + $h = $2; + } else { + errorlog("Cannot find image dimensions"); + } + + $imgfiles{$filename}{ftype} = 0; + $imgfiles{$filename}{width} = $w; + $imgfiles{$filename}{height} = $h; + infolog("NETPBM: [${h}x${w}] $filename ($pdatalen)"); + } elsif ( substr($pdata,0,3) eq "\x47\x49\x46" ) {

## GIF File $imgfiles{$filename}{ftype} = 1; ($w,$h) = unpack("vv",substr($pdata,6,4)); infolog("GIF: [${h}x${w}] $filename ($pdatalen)"); $imgfiles{$filename}{width} = $w; $imgfiles{$filename}{height} = $h;

} elsif ( substr($pdata,0,2) eq "\xff\xd8" ) {

## JPEG File my @Markers = (0xC0,0xC1,0xC2,0xC3,0xC5,0xC6,0xC7,0xC9,0xCA,0xCB,0xCD,0xCE,0xCF); my $pos = 2;

@@ -369,21 +382,58 @@ sub fuzzyocr_do {

if($$pic{fname} =~ /\.([\w-]+)$/) {

$suffix = $1;

} if ($suffix) {

debuglog("File has Content-Type \"$mimetype\" and File Extension \"$suffix\"");

} else {

debuglog("File has Content-Type \"$mimetype\" and no File Extension");

}

- if ( $$pic{ftype} == 1 ) { + if ( $$pic{ftype} == 0 ) { + infolog("Found NETPBM header name=\"$$pic{fname}\""); + + if ($conf->{focr_skip_ppm}) { + infolog("Skipping image check"); + next; + } + + my $max_size; + + if (defined($conf->{focr_max_size_ppm}) and ($$pic{fsize} > $conf->{focr_max_size_ppm})) { + infolog("PPM file size ($$pic{fsize}) exceeds maximum file size for this format, skipping..."); + next; + } + + if ( ($$pic{ctype} !~ /(pbm|pgm|ppm|pnm|pam)/i) and not $generic_ctype) { + wrong_ctype( "PPM", $$pic{ctype} ); + $internal_score += $conf->{'focr_wrongctype_score'}; + } + + if ( $suffix and $suffix !~ /ppm/i) { + wrong_extension( "PPM", $suffix); + $internal_score += $conf->{'focr_wrongext_score'}; + } + + unless (defined $conf->{'focr_bin_ppmtopgm'}) { + errorlog("Cannot exec ppmtopgm, skipping image"); + next; + } + + printf RAWERR qq(## link($file, $pfile)\n) if ($haserr>0); + unless (link($file, $pfile)) { + printf RAWERR "?? link failed: $!\n" if ($haserr>0); + errorlog("link($file, $pfile) failed, skipping..."); + ++$imgerr if $conf->{focr_keep_bad_images}>0; next; + } + } + elsif ( $$pic{ftype} == 1 ) {

infolog("Found GIF header name=\"$$pic{fname}\""); if ($conf->{focr_skip_gif}) {

infolog("Skipping image check"); next;

} if (defined($conf->{focr_max_size_gif}) and ($$pic{fsize} > $conf->{focr_max_size_gif})) {

infolog("GIF file size ($$pic{fsize}) exceeds maximum file size for this format, skipping..."); next;

}

12.07.2007 16:13:58 changed by decoder

There is no need to support ppm/pnm files because they are simply to big to send them around most times. Furthermore, FuzzyOcr? already supports PDF file attachments, although the code is experimental and can be improved further (error detection, different tools, etc). I do not have the time at the moment, though, to continue working on that problem until I've finished by Bachelor Thesis.

Best regards,

Chris

19.07.2007 07:48:04 changed by keithr

  • attachment fuzzyocr.diffs.txt added.

Patch to add support for netpbm images


Add/Change #406 (add support for netpbm image attachments)