Good OCR settings for FuzzyOcr
Note: this page was originally started using FuzzyOcr version 3.6.0. Other release may have different default values.
By default, FuzzyOcr uses both ocrad and gocr. It also add some basic preprocessing to ocrad, and special options to gocr. All is defined in the FuzzyOcr.scansets file.
The goal of this page is to help get better results from the OCR process.
OCR software
GOCR
GOCR is a GPL OCR program.
By default, FuzzyOcr uses it in two different ways:
- Using the default settings (autodetect for everything)
- Forcing the grey level (option -l) and the dust size (option -d)
OCRAD
OCRAD is a GPL OCR program.
By default, FuzzyOcr uses it in 4 different ways. In all cases, option -s is used to scale the input image to 500%.
- Using the default settings (except for the -s option)
- Inverting the level (i.e. white on black)
- Decolorizing the image with a preprocessor, then inverting
- Decolorizing the image with a preprocessor
Tesseract
Optionally, FuzzyOcr can uses tesseract. Getting recent version of tesseract to work might requires a patch, as tesseract wants its input file ending in ".tif". See the ticket number 500 ( http://fuzzyocr.own-hero.net/ticket/500) for a patch kept in the debian bug tracking system.
At the time of this writing, tesseract is released as version 2.04. Preliminary code for version 3 is available in the SVN repository. The two versions give different results, with version 3 finding words that version 2 doesn't, but occasionally the reverse is true. Choosing a version (or even using both with appropriate scansets) is up to you.
Improving scanning
The default scanset can be improved in two ways:
- Specify the language to match
- Preprocess the image
Language
If you expect spam in a language other than the default, you can use the following scanset to specify the language (here, french):
scanset tesseract-fra {
preprocessors = maketiff
command = $tesseract
args = $input $output -l fra
force_output_in = $output.txt
}
You can put more than one such scanset, if you have more than one language and can afford the CPU consumption and the delay.
Preprocessing
Preprocessing can be done by various tools. One of them is the highly configurable ImageMagick. If you receive image-based spam with very noisy pictures, then you can clean up the image first by using a preprocessor like this one (in file FuzzyOcr.preps):
preprocessor im-resize-threshold-despeckle {
command = /soft/local/ImageMagick-6.5.4-10/bin/convert
args = -resize 200% -threshold 50% -despeckle $input $output
}
This will double the image size, turns it into black & white, then remove the speckles, hopefully helping tesseract get better results. The scanset becomes simply:
scanset im-tesseract-fra {
preprocessors = im-resize-threshold-despeckle, maketiff
command = $tesseract
args = $input $output -l fra
force_output_in = $output.txt
}
Obviously, such preprocessing can also be used with the other OCR tools. But be warned, the more scanset you define, the longer the scan will take. Also, the more preprocessing, the higher the chance of a false positive. So be careful.
