| 1 |
This is the FuzzyOcr Frequently Asked Questions (FAQ). Please read it before sending any support requests :) |
|---|
| 2 |
|
|---|
| 3 |
|
|---|
| 4 |
Question 1: I've installed FuzzyOcr plugin according to the INSTALL instructions, |
|---|
| 5 |
but it doesn't seem to do anything, what can I do? |
|---|
| 6 |
|
|---|
| 7 |
Answer 1: Try running FuzzyOcr on the samples provided within this tarball, |
|---|
| 8 |
or download them seperately from the download page. |
|---|
| 9 |
The archive contains a README file with instructions to test. |
|---|
| 10 |
|
|---|
| 11 |
|
|---|
| 12 |
Question 2: I've installed FuzzyOcr plugin according to the INSTALL instructions, |
|---|
| 13 |
and I want to see if it is all working correctly. |
|---|
| 14 |
|
|---|
| 15 |
Answer 2: See Answer 1. |
|---|
| 16 |
|
|---|
| 17 |
|
|---|
| 18 |
Question 3: I've ran SA on the samples, but FuzzyOcr isn't doing anything. |
|---|
| 19 |
|
|---|
| 20 |
Answer 3: First of all, enable the debug mode, setting focr_verbose 2 in the config file. |
|---|
| 21 |
Also make sure, that the logfile specified in the config file is writable. |
|---|
| 22 |
Then run one of the samples, and then check the logfile for messages indicating |
|---|
| 23 |
errors. See the remaining questions if you can't resolve an error message. |
|---|
| 24 |
|
|---|
| 25 |
Question 4: My installation is working but I'm still getting image spam, what can I do? |
|---|
| 26 |
|
|---|
| 27 |
Answer 4: There are several steps you can try, to get rid of remaining image spam: |
|---|
| 28 |
- Save the image, if it is a gif file, analyze wether it is animated/interlaced or normal. |
|---|
| 29 |
- On a normal picture, run gocr -i filename |
|---|
| 30 |
Check the output, if it looks garbage only (i.e. even with a bit of approximation, there is no word the plugin could match) |
|---|
| 31 |
then you need to try different settings, try experimenting with the -l setting, if the image is noisy, try the -d parameter, |
|---|
| 32 |
experiment with the values. If you get good results and are getting this kind of spam a lot, then add this setting |
|---|
| 33 |
to your scansets. |
|---|
| 34 |
Also make sure, that you have enough keywords for this kind of spam in your wordlist. |
|---|
| 35 |
- If you fail to get a usable result with gocr alone, try involving pnm processors, like pnmnorm, pnmquant or pnminvert. |
|---|
| 36 |
There are no limits in what you can involve in a scanset to get text from a pnm file. |
|---|
| 37 |
You can even use a commercial software, although things like that were never tested. |
|---|
| 38 |
In case you find scansets which generally improve the recognition rate, please send them to the mailing list. |
|---|
| 39 |
- If even that fails, you can still add the md5 sum to the md5 database manually, if you are getting this image often. |
|---|
| 40 |
|
|---|
| 41 |
Question 5: I'm often getting false positives because mails contain screenshots, what can I do? |
|---|
| 42 |
|
|---|
| 43 |
Answer 5: There are some things you can try: |
|---|
| 44 |
- Decrease the focr_threshold value to 0.2 or 0.21, that makes the matching more exactly. |
|---|
| 45 |
- Check if the false positives are caused only by specific words on your wordlist and remove these error prone words. |
|---|
| 46 |
|
|---|
| 47 |
Question 6: I'm using Redhat or a Redhat based distribution and all my gocr results look bad. |
|---|
| 48 |
|
|---|
| 49 |
Answer 6: On Redhat based systems, some RPMs/SRPMs are built incorrectly with the parameter "--with-netpbm=no. |
|---|
| 50 |
This is wrong, you need to make sure that you have a gocr build compiled WITH netpbm support. |
|---|
| 51 |
|
|---|
| 52 |
Question 7: My gocr segfaults on some pictures. |
|---|
| 53 |
|
|---|
| 54 |
Answer 7: Please patch your gocr source with the patch available on my download page and rebuild it. |
|---|
| 55 |
|
|---|
| 56 |
Question 8: My giftext segfaults on some pictures. |
|---|
| 57 |
|
|---|
| 58 |
Answer 8: Please patch your giftext source with the patch available on my download page and rebuild it. |
|---|
| 59 |
|
|---|
| 60 |
Question 9: I'm getting "Failed to open pipe to external programs with pipe command..." |
|---|
| 61 |
|
|---|
| 62 |
Answer 9: This indicates a failure in opening the pipe itself, most likely this is caused by a missing binary. |
|---|
| 63 |
|
|---|
| 64 |
Question 10: I am using MailScanner and I'm getting "Unexpected error in pipe to external programs...." |
|---|
| 65 |
with the graphic tools pipes (like jpegtopnm failing). |
|---|
| 66 |
|
|---|
| 67 |
Answer 10: MailScanner by default only passes the first 30kb of the mail to SpamAssassin. |
|---|
| 68 |
Sometimes, this causes the image to be truncated in the middle if it is bigger. |
|---|
| 69 |
The only way to fix this at the moment is disabling this option in MailScanner (see your documentation). |
|---|
| 70 |
|
|---|
| 71 |
Question 10: I'm getting "Unexpected error in pipe to external programs...." |
|---|
| 72 |
|
|---|
| 73 |
Answer 11: This indicates a failure in the pipe somewhere, most likely this is caused by a missing binary. |
|---|
| 74 |
Also, if a program within the executed chain fails, this will cause such an error. |
|---|
| 75 |
If you get this only rarely on some specific mails, then can be caused by extremely broken images. |
|---|
| 76 |
To find out which binary fails, get the picture which causes the error, and run the chain of programs |
|---|
| 77 |
manually over the picture. You can do this step by step and check for error messages then. |
|---|
| 78 |
If you get pictures that cause such errors in the program chain, please send them to me. |
|---|
| 79 |
|
|---|
| 80 |
Question 12: I'm getting "Skipping scanset "xyz" because of errors, trying next...", |
|---|
| 81 |
what does that mean? |
|---|
| 82 |
|
|---|
| 83 |
Answer 12: This indicates that the scanset command "xyz" failed, either because a program |
|---|
| 84 |
in the scanset was missing, or produced an error. This doesn't need to be your fault, |
|---|
| 85 |
especially with the pnmquant scanset, this can happen with some images. |
|---|
| 86 |
This is only critical if you get it for every scanset, that most likely indicates that |
|---|
| 87 |
your gocr path is wrong or something else is wrong with the gocr binary (check Question 7). |
|---|