k**********g 发帖数: 989 | 3
。
library
PDF iText, iTextSharp. If PDF-embedded image see Image below. PDF is a
composition-based (rendering-based) format.
However: there are some PDF that cannot be parsed unless one renders it
graphically. for this type of PDF, you must use a PDF renderer. The PDF
renderers I know of are all commercial.
If the PDF contains text (you can test that by try selecting the text from
the document using any PDF reader), it can be extracted with IFilter plugin.
See below (under PPT)
JPEG or any Image format: you have to use an OCR library. Commercial or Free
(e.g. Tesseract)
PPT, Office documents, and PDF containing text: use Windows IFilter plugin.
May require both C++ and C# programming.
【在 s******a 的大作中提到】 : 有一些输入文件是PDF, 或者JPEG, 或者 PPT格式的。这些文件中都包含一些框图。 : 框图中可能包含一些机构人名之间的联系。有没有软件,或者 open source library : 能够把这些机构人名提取出来,同时还能把联系 (主要是框图中的联线) 给提取出来 : 。
|