Ticket #2946 (assigned enhancement)

Opened 12 months ago

Last modified 10 months ago

Avoiding conversion to PNM would be better

Reported by: Melkhior Owned by: decoder
Priority: minor Milestone:
Component: External toolchain Version:
Keywords: Cc: brad-fuzzy@…

Description

Right now, FuzzyOcr? starts by converting whatever images it gets into a PNM file, before applying the list of preprocessor then the OCR. The problem is that the conversion to PNM sometimes removes informations from the images and degrades the OCR results.

GOCR can directly accept JPEG has input. Tesseract only accept TIFF, but direct JPEG to TIFF gives better results than going through PNM.

In particular, the preprocessing I've mentioned on the GoodOcrSettings wiki pages give much better results if applied directly on the JPEG to directly output the TIFF. On some images, it goes from matching 2 or 3 words at best to 5 or 6, with a lower fuzz.

I have a patch that adds a "need_pnm" option to Scansets ; if set to 'no', then the raw image file ($tfile) instead of the PNM file ($pfile) if sent to the scanset. As in my case the first preprocessor is imagemagick's convert, it works just fine (it goes hand-in-hand with the 'maketiff' patch to work with tesseract).

Attachments

combined_patch.3.6.0 Download (2.9 KB) - added by Melkhior 11 months ago.
Combined patch to support jpg-to-tesseract (including the maketiff patch)
FuzzyOcr.preps Download (1.0 KB) - added by Melkhior 11 months ago.
Sample preprocessors for the combined patch
FuzzyOcr.scansets Download (2.3 KB) - added by Melkhior 11 months ago.
Sample scansets for the combined patch

Change History

Changed 11 months ago by bfritz

Melkhior,

If you're willing to share the patch, I'd like to try it out. And thanks for the GoodOcrSettings tips.

Changed 11 months ago by bfritz

  • cc brad-fuzzy@… added

Add myself to CC list.

Changed 11 months ago by Melkhior

Combined patch to support jpg-to-tesseract (including the maketiff patch)

Changed 11 months ago by Melkhior

Sample preprocessors for the combined patch

Changed 11 months ago by Melkhior

Sample scansets for the combined patch

Changed 11 months ago by Melkhior

I've attached the patch for those interested. Works fine on some images, not at all on others :-( For me, it gives much better results than either gocr or ocrad with the default settings. The exact threshold to use for optimal result is unfortunately dependant on the image.

Note that the direct conversion from JPG (or other) to TIFF must be made by a preprocessor whose name include the string "maketiff".

You will likely need to change the path to "convert" (from ImageMagick). Recent version of "convert" supports the option "-deskew", but on the wavy/flag-like spams, "-deskew" doesn't do much (for me).

Changed 10 months ago by decoder

  • status changed from new to assigned

Hello Melkhior,

thanks for the contribution. Right now, I'm too busy to review/merge the patch into the main version, but until I can do that, people can get the patch from here anyway. Thanks for sharing it :)

Chris

Note: See TracTickets for help on using tickets.