Ticket #15 (new enhancement)

Opened 2 years ago

Last modified 3 months ago

Pushing OCR'ed text back to SA

Reported by: jake@infinitylimited.net Assigned to: decoder
Priority: minor Milestone:
Component: Image Analysis Version:
Keywords: Cc:

Description

Just wondering if it would be possible to decode the image, and then send the results back into spamassassin to get advantage of all the rules in there?

Otherwise, FuzzyOCR will have to maintain rules for (say) viagra and stock symbols, whereas spamassassin can have daily updated rules for those things.

Attachments

Change History

23.02.2007 18:40:32 changed by anonymous

This is a great idea if you ask me. I think it would be best if the image was translated into a plain text file and sent through the already-existing and thoroughly tested gauntlet of spamassassin tests. I really am not comfortable having this "you said the magic word" based scoring, especially how quickly the points rack up. There is a lot of nuance to most of the available spamassassin tests, and it is lost (or duplicated, which is a waste).

Also, the hashing of image files for records is pretty much what razor/pyzor already does, too.

It seems like fuzzyocr is reinventing the wheel for everythig that spamassassin already does.

Just get the text, in whatever form it is found, and pass it to spamassassin - if you mis-read the characters and get \/ instead of V, that's already something spamassassin looks for, anyway.

07.07.2007 03:24:07 changed by anonymous

17.08.2007 18:42:23 changed by anonymous

(in reply to: ↑ description ) 08.11.2007 08:53:26 changed by on

  • summary changed from An idea for future versions to Pushing OCR'ed text back to SA.

I fully second that.

But that will imply quite some functionality change in FuzzyOcr?, as far as I can see:

- different context: the plugin will apply to the message, before any rule matching, so there is no way to say that FuzzyOcr? should not apply if the spam score is already high;

- no way to keep the score for a given image hash, only thing to be kept is the decoded text;

- no way to score for corrupted images, because the plugin is doing only decoding/rendering, not scoring;

- need a mechanism to say that we have applied a scanset that did a reasonable job and that the resulting text is worth passing back to SA, I think that word detection is still a way to do that, possibly using a spell checker (if enough words spell check OK it means we have read some text);

- there is a huge change in the structure of the plugin: right now it extracts all the images to files, then for each file scan it, while it would need to extract and scan one image and push the text back to SA before going to the next one.

25.05.2008 20:14:25 changed by hiebmsp


Add/Change #15 (Pushing OCR'ed text back to SA)