Ticket #2946 (assigned enhancement)
Avoiding conversion to PNM would be better
| Reported by: | Melkhior | Owned by: | decoder |
|---|---|---|---|
| Priority: | minor | Milestone: | |
| Component: | External toolchain | Version: | |
| Keywords: | Cc: | brad-fuzzy@… |
Description
Right now, FuzzyOcr? starts by converting whatever images it gets into a PNM file, before applying the list of preprocessor then the OCR. The problem is that the conversion to PNM sometimes removes informations from the images and degrades the OCR results.
GOCR can directly accept JPEG has input. Tesseract only accept TIFF, but direct JPEG to TIFF gives better results than going through PNM.
In particular, the preprocessing I've mentioned on the GoodOcrSettings wiki pages give much better results if applied directly on the JPEG to directly output the TIFF. On some images, it goes from matching 2 or 3 words at best to 5 or 6, with a lower fuzz.
I have a patch that adds a "need_pnm" option to Scansets ; if set to 'no', then the raw image file ($tfile) instead of the PNM file ($pfile) if sent to the scanset. As in my case the first preprocessor is imagemagick's convert, it works just fine (it goes hand-in-hand with the 'maketiff' patch to work with tesseract).

