Ticket #405 (assigned enhancement)

Opened 1 year ago

Last modified 4 months ago

add pdf processing

Reported by: AnonymousDog Assigned to: decoder (accepted)
Priority: major Milestone: Development Release Version 3.5
Component: Image Analysis Version: SVN
Keywords: pdf Cc:

Description

Mostly just need recognition and preprocessing to pnm: Maybe use pdf2ps then pstopnm. The rest should flow with what's already in the code.

Attachments

Change History

02.07.2007 18:55:34 changed by AnonymousDog

I see now that you replied recently to a similar request. Here's my counter argument to reasoning for not scanning pdfs: I understand you don't see a future in pdf image spam, but, at present, the spammers think differently, and my sites that have a history of getting lots of stock spam have been getting hit hard with pdf image spam. The problem is that, much like other image spam, nothing else identifies these emails as spam since they usually have blank or bayes-poison email contents (with the pdf has the only identifiable-as-spam data). Non-draconian SA rules would be very difficult (or impossible) to write since there is very little objective difference (except the pdf contents) between this spam and the many business-related blank emails serving as pdf attachment envelopes that many businesses need to be able to get.

Regardless of whether it's efficient as a spam technique, it's being used and is driving up false negatives.

I'd be grateful (and donate some cash) If you could write a patch or make it a feature that must be manually enabled.

03.07.2007 06:12:44 changed by Neubian

PDF spam is increasing. I'm in for $50 USD to see this happen. That may not buy a pizza after conversion, but I hope I'm not the only one to chip in to prioritize this. And I WILL follow through. Others, don't pledge and flake out.

03.07.2007 15:16:11 changed by decoder

  • status changed from new to assigned.
  • version changed from 3.5.1 to SVN.

Ok, I give up. The feature was added in the current SVN tree, you can test it by checking out the latest SVN revision. Because I lack samples for new PDF spam, I didn't test this intensively but expect you to test it.

To enable the feature, see the new config file.

Also, in addition to pstopnm (which is part of netpbm), you need pdfinfo and pstopdf which are both part of the poppler package (which you most likely have installed). Also, pstopnm requires a working ghostscript installation as the manpage states.

Best regards,

Chris

03.07.2007 17:21:25 changed by Jason

Every image-spam PDF that I've seen so far is damaged according to pdfinfo. ie: pdfinfo Alert.YVVSJNIZETFTTQ.pdf Error (0): PDF file is damaged - attempting to reconstruct xref table... Tagged: no Pages: 1 Encrypted: no Page size: 547 x 199 pts File size: 17084 bytes Optimized: no PDF version: 1.3

I assume this has to do with the way the spammers are building the pdf document. Right now I have mimedefang running 'pdfinfo $file 2>&1' for each pdf, and if it has an error, I'm quarantining the entire message. So far no false positives.

Also, the image-spam PDFs contain actual images.. So it might be better to use pdfimages to extract just the images. This should lead to less false positives. Only issue with pdfimages is that it doesn't output to STDOUT, it creates a file for each image within the pdf.

04.07.2007 00:32:18 changed by decoder

Thank you very much, that info is very valuable.

Tomorrow, I will continue to work on PDF recognition and add the following things:

- Catch damaged PDFs and score them like damaged images - Test pdfimages

the fact that pdfimages does output into files might be a problem, but maybe we can work around that...

We'll see then how good this works

Chris

05.07.2007 11:48:38 changed by decoder

Sorry but I have to postpone this a bit, I have quite a lot of other work at the moment that needs to be done, but as soon as I get the time, I will work on this :)

10.07.2007 17:06:18 changed by anonymous

I just wanted to add a me too on this. I just got 4 pdf spams this morning. Under most cases I get very little spam so I am waiting to hear from my user that gets lots of spam. It looks like it is a growing I am glad you have some support I might have to switch to svn.

27.07.2007 16:46:31 changed by anonymousdog

Just wanted to add this to the ticket as a suggestion (from discussion at http://www.freespamfilter.org/forum FOCR topic):

What we really need is a module that:

tests the pdf(s) for corruption parses them reliably for metadata extracts embedded images and either passes them to FOCR or ocrs them itself uses something like pdftotext to extract body text and call SA to process with body rules

and passes a score back to SA. It looks like PDF::OCR and PDF::OCR::Thorough do most of that (but for the metadata), esp. Thorough which uses PDF::API2 to check for corruption as well as both pdftotext and tesseract to extract text.

PDF::Parse and Image::ExifTool? both can extract metadata.

PDFInfo wrapped around PDF::OCR and Image::ExifTool? functionality could be a winner.

01.08.2007 08:47:37 changed by anonymous

01.08.2007 08:47:54 changed by anonymous

10.08.2007 14:12:17 changed by anonymous

21.03.2008 06:15:56 changed by anonymous

徐州辉煌钢结构工程有限公司是一家集网架钢结构设计、制作、安装及技术服务为一体的大型专业化企业。公司坐落于有网架之乡美誉的江苏省徐州市,这里是全国优质网架原材料供应基地,也是全国网架技术熟练工人培训基地,有着人才,技术和原材料的地域优势。企业创办多年来,本着"求实创新、开拓进取"的精神,不断引进吸收国内外先进技术经验,汇集来自全国各地从事专业管理,专业设计、制造、检测试验等高级优秀人才,配置了各种先进的成套生产和检测设备,能满足制造生产能力要求的流水生产线。公司始终本着“诚信为本、信守合同、用户至上”的理念,坚持贯彻实践三个“第一”——质量第一、信誉第一,服务第一,企业不断深化改革,深挖潜力,降低成本,以最低的价格吸引客户,以最好的质量服务客户,让辉煌网架钢构建设遍布全国各地,多年来深受广大客户及建设单位的一致好评。   竭诚欢迎各界新老朋友真诚合作、共创辉煌、共享绩效,公司将一如既往地为各界朋友提供优秀的服务!

01.04.2008 06:33:37 changed by anonymous

徐州辉煌钢结构工程有限公司是一家集网架钢结构设计、制作、安装及技术服务为一体的大型专业化企业。公司坐落于有网架之乡美誉的江苏省徐州市,这里是全国优质网架原材料供应基地,也是全国网架技术熟练工人培训基地,有着人才,技术和原材料的地域优势。企业创办多年来,本着"求实创新、开拓进取"的精神,不断引进吸收国内外先进技术经验,汇集来自全国各地从事专业管理,专业设计、制造、检测试验等高级优秀人才,配置了各种先进的成套生产和检测设备,能满足制造生产能力要求的流水生产线。公司始终本着“诚信为本、信守合同、用户至上”的理念,坚持贯彻实践三个“第一”——质量第一、信誉第一,服务第一,企业不断深化改革,深挖潜力,降低成本,以最低的价格吸引客户,以最好的质量服务客户,让辉煌网架钢构建设遍布全国各地,多年来深受广大客户及建设单位的一致好评。   竭诚欢迎各界新老朋友真诚合作、共创辉煌、共享绩效,公司将一如既往地为各界朋友提供优秀的服务!

01.04.2008 09:26:52 changed by anonymous

徐州辉煌钢结构工程有限公司是一家集网架钢结构设计、制作、安装及技术服务为一体的大型专业化企业。公司坐落于有网架之乡美誉的江苏省徐州市,这里是全国优质网架原材料供应基地,也是全国网架技术熟练工人培训基地,有着人才,技术和原材料的地域优势。企业创办多年来,本着"求实创新、开拓进取"的精神,不断引进吸收国内外先进技术经验,汇集来自全国各地从事专业管理,专业设计、制造、检测试验等高级优秀人才,配置了各种先进的成套生产和检测设备,能满足制造生产能力要求的流水生产线。公司始终本着“诚信为本、信守合同、用户至上”的理念,坚持贯彻实践三个“第一”——质量第一、信誉第一,服务第一,企业不断深化改革,深挖潜力,降低成本,以最低的价格吸引客户,以最好的质量服务客户,让辉煌网架钢构建设遍布全国各地,多年来深受广大客户及建设单位的一致好评。   竭诚欢迎各界新老朋友真诚合作、共创辉煌、共享绩效,公司将一如既往地为各界朋友提供优秀的服务!

09.04.2008 08:38:10 changed by anonymous

电子地磅解码器,吨位遥控器/本吨位遥控器引进日本先进技术研制而成,完全采用数字式集成电路技术,采用万能解码数据处理线路,适用于10----150吨以下吨位,无须对地磅作任何改动 具有防拦截,防扫描等优点,解码器安装于车上或离地磅8米以内,在电子称旁40米或60米以内,能控制电子称的数码数据,最小值20公斤,规格10 /15/20,此产品主要产生电子磁场干扰和控制,从而使吨位变大或变小,性能稳定可靠,体积小,遥控主机 解码处理器 如烟盒大小,遥控器配两种型号,隐蔽性强,附件含使用光盘一套,


Add/Change #405 (add pdf processing)