新闻建站cms系统、政府cms系统定制开发

广州网站建设公司-阅速公司

asp.net新闻发布系统、报纸数字报系统方案
/
http://www.ysneo.com/
广州网站建设公司
您当前位置:首页>网站技术

网站技术

C#pdf解析(asp.net)

发布时间:2019/10/22 16:21:19  作者:Admin  阅读:433  

广告:阿里云采购优惠专区

1. Introduction 介绍

This project allows you to read and parse PDF filse and display their internal structure. The PDF file specification document is available from Adobe. This project is based on “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”. It is an intimidating 1310 pages document. This article provides a concise overview of the specifications. The associated project defines C# classes for reading and parsing a PDF file. To test these classes the attached test program PdfFileAnalyzer allows you to read a PDF file analyzes it and display and save the result. The program breaks the PDF file into individual page descriptions, fonts, images and other objects.

Version 2.0 supports encrypted files. The software is divided into a PDF reader library and a test/demo program.

该项目使您可以阅读和解析PDF文件,并显示其内部结构。 PDF文件规范文档可从Adobe获得。 该项目基于``PDF参考,第六版,Adobe可移植文档格式版本1.7 2006年11月''。 它是一个令人生畏的1310页文件。 本文提供了规范的简要概述。 关联的项目定义了用于读取和解析PDF文件的C#类。 要测试这些类,请使用随附的测试程序PdfFileAnalyzer读取PDF文件进行分析并显示并保存结果。 该程序将PDF文件分为单独的页面描述,字体,图像和其他对象。

2.0版支持加密文件。 该软件分为PDF阅读器库和测试/演示程序。

2. Overview 总览

The PDF file is structured to allow Adobe Acrobat to display and print each page on a variety of screens and printers. If you open the file with a binary editor you will see that most of the file is unreadable. The small sections that are readable look like:

PDF文件的结构允许Adobe Acrobat在各种屏幕和打印机上显示和打印每个页面。 如果使用二进制编辑器打开文件,则将看到大部分文件都不可读。 可读的小部分如下所示:

1 0 obj
<</Lang(en-CA)/MarkInfo<</Marked true>>/Pages 2 0 R
/StructTreeRoot 10 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[4 0 R]/Type/Pages>>
endobj 
4 0 obj
<</Contents 5 0 R/Group <</CS/DeviceRGB /S/Transparency /Type/Group>>
/MediaBox[0 0 612 792] /Parent 2 0 R
/Resources <</Font <</F1 6 0 R /F2 8 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>
/StructParents 0/Tabs/S/Type/Page>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 2319>>
stream
. . .
endstream
endobj

The file is made of objects nested between “n 0 obj” and “endobj” keywords. The PDF term is indirect objects. The numbers before “obj” are the object number and the generation number. The generation number is always zero. Items enclosed within double angle brackets <<>> are dictionaries. Items enclosed between square brackets [] are arrays. Items starting with slash / are parameters names (i.e. /Pages). In the example above the first item “1 0 obj” is the document catalog or the root object. The catalog has in its dictionary an item “/Pages 2 0 R”. This is a reference to an object that defines tree of pages. In this case, object number 2 has a reference to one page “/Kids[4 0 R]”. This is a one page document. Object number 4 is the only page definition. The page size is 612 by 792 points. In other words 8.5” by 11” (1” is 72 points). The page uses two fonts F1 and F2. They are defined in objects 6 and 8. The page contents are being described in object number 5. Object number 5 has a stream that describes the painting of the page. In the example we have “. . .” as place holder for this description. If you tried to look at the PDF file with binary editor the stream will look as a long block of unreadable random numbers. The reason for it is that you are looking at compressed data. The stream is compressed with ZLib deflate method. This is specified in the dictionary by “/Filter /FlateDecode”. The compressed stream is 2319 bytes long. If you decompress the stream the first few items will look something like this:

该文件由嵌套在“ n 0 obj”和“ endobj”关键字之间的对象组成。 PDF术语是间接对象。 “ obj”之前的数字是对象编号和世代编号。世代数始终为零。包含在双尖括号<< >>中的项目是词典。方括号[]之间的项目是数组。以斜杠/开头的项目是参数名称(i.e. /Pages)。在上面的示例中,第一项“ 1 0 obj”是文档目录或根对象。目录在其词典中有一个项目“ / Pages 2 0 R”。这是对定义页面树的对象的引用。在这种情况下,对象编号2引用一页“ / Kids [4 0 R]”。这是一页文件。对象编号4是唯一的页面定义。页面大小为612 x 792点。换句话说,是8.5英寸乘11英寸(1英寸是72点)。该页面使用两种字体F1和F2。它们在对象6和8中定义。页面内容在对象5中描述。对象5具有描述页面绘画的流。在示例中,我们有“。 。 。”作为此说明的占位符。如果您尝试使用二进制编辑器查看PDF文件,则流将看起来像一堆无法读取的随机数。原因是您正在查看压缩数据。使用ZLib deflate方法压缩流。这在字典中由“ / Filter / FlateDecode”指定。压缩流的长度为2319个字节。如果解压缩流,则前几项将如下所示:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

This is a small sample of page description language. In this example “re” stands for rectangle. The four numbers before it are position and size “X Y Width Height”.

这是页面描述语言的一小部分。 在此示例中,“ re”代表矩形。 前面的四个数字是位置和大小“ X Y宽度高度”。

This simplified example demonstrates the general idea behind PDF files. You start with a root object that point to hierarchy of pages. Each page defines resources such as fonts, images and contents streams. Contents streams are made of operators and arguments required to paint the pages. The PdfFileAnalyzer will produce an object summary file. This file contains all the objects without the streams. Each stream will be decoded and saved as a separate file. Page descriptions are saved as text files. Image streams are saved as .jpg or .bmp files. Font streams are saved as .ttf files. Other streams that are binary are saved as .bin files. Text streams are saved as .txt files. Page descriptions go through another parsing process that translates the cryptic one or two letters codes into a pseudo C# source. As an example the page description above is translated to:

这个简化的示例演示了PDF文件背后的一般思想。 您从一个指向页面层次结构的根对象开始。 每个页面定义诸如字体,图像和内容流之类的资源。 内容流由绘制页面所需的运算符和参数组成。 PdfFileAnalyzer将产生一个对象摘要文件。 该文件包含所有没有流的对象。 每个流将被解码并保存为单独的文件。 页面描述另存为文本文件。 图像流另存为.jpg或.bmp文件。 字体流另存为.ttf文件。 其他二进制流将另存为.bin文件。 文本流另存为.txt文件。 页面描述经过另一个解析过程,该过程将一个或两个神秘的字母代码转换为伪C#源。 例如,以上页面描述被翻译为:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

The remaining part of this article will go into PDF file structure and the parsing process in more details. The following sections will cover: object definitions, file structure, file parsing, File reading, and using the PdfFileAnalyzer program.

本文的其余部分将详细介绍PDF文件结构和解析过程。 以下各节将介绍:对象定义,文件结构,文件解析,文件读取以及使用PdfFileAnalyzer程序。

3. Object Definitions 对象定义

PDF file is made of objects. Each PDF object has a corresponding class in the PdfFileAnalyzer project. All of these object classes are derived classes from PdfBase class. The source code for objects class definition is BasicObjects.cs. The exact PDF objects definition is available in chapter 3 of the Adobe's PDF specifications.

PDF文件由对象组成。 每个PDF对象在PdfFileAnalyzer项目中都有一个对应的类。 所有这些对象类都是PdfBase类的派生类。 对象类定义的源代码是BasicObjects.cs。 PDF对象的确切定义在Adobe PDF规范的第3章中提供。

3.1. Basic Objects 基本对象

  • Boolean object is implemented by PdfBoolean class. The PDF definition of Boolean is the same as C#.
  • Integer object is implemented by PdfInt class. The PDF definition is the same as Int32 in C#.
  • Real number object is implemented by PdfReal class. The PDF definition is the same as Single in C#.
  • String object is implemented by PdfStr class. The PDF definition is different than C#. String is made out of bytes not characters. It is enclosed in parenthesis (). The PdfFileAnalyzer saves the PDF string in a C# string including the parenthesis. PDF string is useful for ASCII encoding.
  • Hexadecimal string object is implemented by PdfHex class. It is a string of characters defined by two hex digits per byte and enclosed within angle brackets <>. The PdfFileAnalyzer saves the PDF hex string in C# string including the angle brackets. For PDF readers the string and the hex string objects serve the same purpose. The string (AB) is the equivalent to <4142>. PDF hex string is useful for any encoding.
  • Name object is implemented by PdfName class. Name object are made of forward slash followed by a sequence of characters. For example /Width. Named objects are used as parameters names. The PdfFileAnalyzer saves the name object in C# string including the leading /.
  • Null object is implemented by PdfNull class. The PDF definition of null is basically the same as in C#.
  • 布尔对象由PdfBoolean类实现。布尔值的PDF定义与C#相同。

     

    整数对象由PdfInt类实现。 PDF定义与C#中的Int32相同。

     

    实数对象由PdfReal类实现。 PDF的定义与C#中的Single相同。

     

    字符串对象由PdfStr类实现。 PDF定义与C#不同。字符串由字节而不是字符组成。它括在括号()中。 PdfFileAnalyzer将PDF字符串保存在包含括号的C#字符串中。 PDF字符串对于ASCII编码很有用。

     

    十六进制字符串对象由PdfHex类实现。它是由每个字节两个十六进制数字定义的字符串,并括在尖括号<>中。 PdfFileAnalyzer将PDF十六进制字符串保存在C#字符串中,包括尖括号。对于PDF阅读器,字符串对象和十六进制字符串对象具有相同的用途。字符串(AB)等效于<4142>。 PDF十六进制字符串可用于任何编码。

     

    名称对象由PdfName类实现。名称对象由正斜杠后跟一系列字符组成。例如/ Width。命名对象用作参数名称。 PdfFileAnalyzer将名称对象保存在C#字符串中,包括前导/。

     

    Null对象由PdfNull类实现。 PDF中null的定义基本上与C#中的相同。

3.2. Compound Objects 复合对象

  • Array object is implemented by PdfArray class. PDF array is a collection of objects enclosed within square brackets []. The objects of one array can be a mix of any type except stream. The PdfFileAnalyzer saves the objects in a C# array of PdfBase class. Since all objects are derived classes of PdfBase there is no problem saving a mix of object types within this array. When array object is converted to a string (ToString() method), the program adds a leading and trailing square brackets. Array can be empty. Example of array with six objects: [120 9.56 true null (string) <414243>].
  • Dictionary object is implemented by PdfDict class. PDF dictionary is a collection of key and value pairs enclosed within double angle brackets <<>>. Dictionary key is a name object and value is any object except stream. The PdfFileAnalyzer saves one key value pair in PdfPair class. The key is a C# string and the value is PdfBase. The PdfDict class has an array of PdfPair classes. Dictionary is accessed by key. Therefore pair ordering is not important. PdfFileAnalyzer sorts the pairs by key value. Example of dictionary with three pairs: <</CropBox [0 0 612 792] /Rotate 0 /Type /Page>>.
  • Stream object is implemented by PdfStream. Streams are used to hold page description language, images and fonts. PDF Stream is made of two parts a dictionary and a stream of bytes. The dictionary defines the stream parameters. One of the stream dictionary entries is /Filter. The PDF document defines 10 types of filters. PdfFileAnalyzer supports 4 filters. These 4 filters are the only ones I found to be in general use. The compression filter FlateDecode is the most used filter by current PDF writers. FlateDecode supports ZLib deflate decompression. The LZWDecode compression filter was used a few years ago. In order to read older PDF files, this program supports this filter. ASCII85Decode filter converting printable ASCII to binary. DCTDecode for JPEG image compression. The PdfFileAnalyzer implement decompression for the first three. The DCTDecode stream is saved as is with file extension .jpg. It is an image file that can be viewed.
  • Object stream was introduced in PDF 1.5. It is a stream that contains multiple indirect objects (described below). Stream objects described above are compressed one stream at a time. Object stream compresses all the included streams in one compressed section.
  • Cross-reference stream was introduced in PDF 1.5. It is a stream that contains cross-reference table described later in the article.
  • Inline image object is implemented by PdfInlineImage. It is a stream within a stream. Inline image is part of page description language. It is made of three operators BI-begin image, ID-image data and EI-end image. The area between BI and ID is an image dictionary and the area between ID and EI is the image data.
  • 数组对象由PdfArray类实现。 PDF数组是括在方括号[]中的对象的集合。一个数组的对象可以是除流以外的任何类型的混合。 PdfFileAnalyzer将对象保存在PdfBase类的C#数组中。由于所有对象都是PdfBase的派生类,因此在此数组中保存混合对象类型没有问题。当数组对象转换为字符串(ToString()方法)时,程序将添加前导和尾随方括号。数组可以为空。具有六个对象的数组的示例:[120 9.56 true null(字符串)<414243>]。
  • 字典对象由PdfDict类实现。 PDF词典是括在双尖括号<< >>中的键和值对的集合。字典键是名称对象,值是除流以外的任何对象。 PdfFileAnalyzer在PdfPair类中保存一对键值对。关键是一个C#字符串,值是PdfBase。 PdfDict类具有PdfPair类的数组。字典通过键访问。因此,配对排序并不重要。 PdfFileAnalyzer按键值对对进行排序。具有三对字典的示例:<< / CropBox [0 0 612 792] / Rotate 0 / Type / Page >>。
  • 流对象由PdfStream实现。流用于保存页面描述语言,图像和字体。 PDF Stream由字典和字节流两部分组成。字典定义了流参数。流字典条目之一是/ Filter。 PDF文档定义了10种类型的过滤器。 PdfFileAnalyzer支持4个过滤器。这4个过滤器是我发现普遍使用的唯一过滤器。压缩过滤器FlateDecode是当前PDF编写者最常用的过滤器。 FlateDecode支持ZLib放气解压缩。 LZWDecode压缩过滤器是在几年前使用的。为了读取较旧的PDF文件,该程序支持此过滤器。 ASCII85解码过滤器,将可打印的ASCII转换为二进制。 JPEG图像压缩的DCTDecode。前三个的PdfFileAnalyzer实现解压缩。 DCTDecode流按原样保存,文件扩展名为.jpg。这是一个可以查看的图像文件。
  • 对象流是在PDF 1.5中引入的。它是包含多个间接对象的流(如下所述)。上述流对象一次压缩一个流。对象流在一个压缩段中压缩所有包含的流。
  • 交叉引用流是在PDF 1.5中引入的。它是一个流,其中包含本文后面介绍的交叉引用表。
  • 内嵌图像对象由PdfInlineImage实现。它是流中的流。嵌入式图像是页面描述语言的一部分。它由三个运营商BI开头图像,ID图像数据和EI结束图像组成。 BI和ID之间的区域是图像字典,ID和EI之间的区域是图像数据。

3.3. Indirect Objects 间接对象

  • Indirect object is implemented by PdfIndirectObject. It is the main building block of a PDF document. An indirect object is any object encased between “n 0 obj” and “endobj”. Other objects can refer to indirect object by specifying “n 0 R”. The “n” is the object number. The “0” is the generation number. This program does not support generation number other than 0. The PDF specification allows for other numbers. The idea behind multi-generation is to allow PDF modifications by keeping the original file and appending changes.
  • Object reference is a way of referring to indirect objects. For example /Pages 2 0 R is a dictionary entry in the catalog object. It is a pointer to /Pages object. The pages object is indirect object number 2.
  • 间接对象由PdfIndirectObject实现。 它是PDF文档的主要构建块。 间接对象是包含在“ n 0 obj”和“ endobj”之间的任何对象。 通过指定“ n 0 R”,其他对象可以引用间接对象。 “ n”是对象编号。 “ 0”是世代号。 该程序不支持0以外的世代号。PDF规范允许其他数字。 多代背后的想法是通过保留原始文件并附加更改来允许PDF修改。
  • 对象引用是引用间接对象的一种方式。 例如,/ Pages 2 0 R是目录对象中的词典条目。 它是指向/ Pages对象的指针。 pages对象是间接对象2。

3.4. Operators and keywords 运算符和关键字

  • Operators and keywords are not considered PDF objects. However, the PdfFileAnalyzer program has a PdfOp and a PdfKeyword classes that are derived classes of PdfBase. During the parsing process the parser creates a PdfOp or a PdfKeyword for each valid sequence of characters. Appendix A Operator Summary of the Adobe's PDF file specification lists all the operators. The list is made of 73 operators. Here are some examples of operators: BT-begin text object, G-set gray level for stroking operations, m-move to, re-rectangle and Tc-set character spacing. Examples of keywords: stream, obj, endobj, xref.
  • 运算符和关键字不被视为PDF对象。 但是,PdfFileAnalyzer程序具有PdfOp和PdfKeyword类,它们是PdfBase的派生类。 在解析过程中,解析器为每个有效字符序列创建一个PdfOp或PdfKeyword。 附录A Adobe PDF文件规范的运算符摘要列出了所有运算符。 该列表由73个操作员组成。 以下是一些运算符的示例:BT开头的文本对象,用于笔划操作的G-set灰度级,m-move to,re-rectangular和Tc-set字符间距。 关键字示例:stream,obj,endobj,xref。

4. File Structure 档案结构

PDF file is made of four parts: header, body, cross-reference and trailer signature.

  • Header: The header is the file signature. It must be %PDF-1.x where x is 0 to 7.
  • Body: The body area contains all the indirect objects.
  • Cross-reference: The cross-reference is a table of file position pointers to all indirect objects. There are two types of cross reference tables. The original style made of ASCII characters. The new style is a stream within an indirect object. The information is encoded as binary numbers. At the end of the cross-reference table there is a trailer dictionary. A file can have more than one cross-reference area.
  • Trailer signature: The trailer signature is made of: keyword “startxref”, byte offset to the last cross-reference table, and end signature %%EOF. Please note: trailer dictionary is part of cross-reference area.
  • PDF文件由四部分组成:标题,正文,交叉引用和预告片签名。
  • 标头:标头是文件签名。 它必须是%PDF-1.x,其中x是0到7。
  • 主体:主体区域包含所有间接对象。
  • 交叉引用:交叉引用是指向所有间接对象的文件位置指针的表。 交叉引用表有两种类型。 原始样式由ASCII字符组成。 新样式是间接对象中的流。 该信息被编码为二进制数。 交叉引用表的末尾有一个预告字典。 一个文件可以具有多个交叉引用区域。
  • 预告片签名:预告片签名由以下内容组成:关键字“ startxref”,到最后一个交叉引用表的字节偏移量和结束签名%% EOF。 请注意:预告片字典是交叉引用区域的一部分。

5. File Parsing 文件解析

The PDF file is a sequence of bytes. Some of the bytes have special meaning.

White space is defined as: null, tab, line feed, form feed, carriage return and space.

Delimiters are defined as: (, ), <, >, [, ], {, }, /, %, and white space characters.

File parsing is done with PdfParser class. To start the parsing process the program sets file position to the area to be parsed. ParseNextItem() is the method that extract the next object.

The parser skips white space and comments. If next byte is “(“ the object is a string. If next byte is “[“ the object is an array. If next two bytes are “<<“ the object is a dictionary. If next byte is “<“ the object is a hex string. If next byte is “/“ the object is a name. If the next byte is none of the above the parser accumulates the following bytes until a delimiter is found. The delimiter is not part of the current token. The token can be integer, real number, operator or keyword. In the case of integer, the program will search further for object reference “n 0 R” or indirect object “n 0 obj” where n is the integer. The returned value from ParseNextItem() is the appropriate object as per section 4. Object Definitions. The object class is returned as PdfBase class.

In the case of array or dictionary, the program will perform recursive calling of the ParseNextItem() to parse the internal objects of the array or dictionary.

PDF文件是一个字节序列。一些字节具有特殊含义。

空格定义为:null,制表符,换行符,换页符,回车符和空格。

分隔符定义为:(,),<,>,[,],{,},/,%和空格字符。

文件解析是通过PdfParser类完成的。为了开始解析过程,程序将文件位置设置为要解析的区域。 ParseNextItem()是提取下一个对象的方法。

解析器跳过空白和注释。如果下一个字节是“(”,则该对象是一个字符串。如果下一个字节是“ [”,则该对象是一个数组。如果后两个字节是“ <<”,则该对象是字典。如果下一个字节是“ <”,则该对象是一个十六进制字符串。如果下一个字节是“ /”,则对象是一个名称。如果下一个字节不是上述内容,则解析器将累积以下字节,直到找到分隔符为止。该分隔符不是当前标记的一部分。令牌可以是整数,实数,运算符或关键字。对于整数,程序将进一步搜索对象引用“ n 0 R”或间接对象“ n 0 obj”,其中n是整数从ParseNextItem返回的值()是第4节中合适的对象。对象定义。对象类作为PdfBase类返回。

如果是数组或字典,则程序将对ParseNextItem()进行递归调用以解析数组或字典的内部对象。

6. File Reading 文件读取

PdfReader class is the main class of PDF file analysis. The entry method is OpenPdfFile(String FileName, string Password = null). The program opens the PDF file for binary reading (one byte at a time).

File analysis starts with checking the header signature %PDF-1.x where x is 0 to 7 and the trailer end signature %%EOF. One would think that all PDF writers would put the header at position zero of the file and the trailer at the very end of the file. Unfortunately it is not the case. The program has to search for these two signatures at the two ends of the file. If the header signature is not at position zero, all indirect objects file position pointers have to be adjusted.

Just before the trailer signature there is a pointer to the start of the last cross-reference table.

The parser sets file position for cross-reference table. If the next object is “xref” keyword we have the original style cross reference. Otherwise, it is the new stream bases cross reference. The file can have more than one cross reference table. The file can have both the new and old style of tables. Each table is a list of object numbers and file position pointers to the starting point of indirect reference objects. For each active object the program creates a PdfIndirectObject object and saves it in ObjectArray. The object is empty except for object number and position. For original cross-reference table the position is relative to the file. For the stream type cross-reference the position is relative to a parent indirect object stream.

During this process if indirect object has generation number other than zero, program execution will be aborted. PdfFileAnalyzer does not support multi-generation.

At the end of the cross-reference table we have a trailer dictionary. In order to include this dictionary in the analysis we create a dummy indirect object with negative object number and save the dictionary in it.

The program looks for four particular entries in the trailer dictionary. If /Encrypt is found, the file will be decrypted. Next the program looks for /Root the object number of the catalog object. If /XRefStm entry exist, we have both types of cross reference. Finally if /Prev exist we have another cross-reference table to process.

After the cross-reference processing is done we have an array of all indirect objects. The available information at this stage of the process is object number and position. Next, the program loops through the array and reads and parses each indirect object. This process sets the object value. If the object is a stream, only the dictionary part is being parsed. The reason is that the stream length might not be known at this time. In addition to the object, the system sets object type and subtype members for dictionary and stream objects if these two values are available.

Next the program loops through all objects and process stream objects. Stream objects have object type equal to “/ObjStm”. The program reads the stream associated with these objects and breaks the streams to multiple indirect objects.

Next the program searches all dictionary objects and stream dictionary objects for object reference objects. The program is looking for key value pairs such as: “/name n 0 R”. If a pair like that is found, the program checks the object type. If the object type was not set during object parsing phase, the type is set to the /name value.

The next step is to read all streams that were not read before. The system reads the stream from the file. Each stream is decoded and saved to an appropriate file. The PdfFileAnalyzer supports the following filters: /FlateDecode, /LZWDecode, /ASCII85Decode and /DCTDecode. Text file will have extension .txt, binary files .bin, image files .jpg or .bmp, font files .ttf and cross-reference file .xref. The /FlateDecode is ZLib Deflate compression method.

The next step is to build page contents. The program follows the page tree starting from the root. Page objects are not stream objects. In other words, page description commands are not available directly within the page object. Page objects directories have a /Contents key value pair. If this pair is missing, the page is blank. The value of the contents entry can be a single reference or an array of references. The program will create a dummy contents stream for the page from the one or multiple contents streams. The page contents dummy streams are saved in PageObj_xx.txt and in PageSource_xx.txt. The former file is the actual page description contents for the page. The later file is the same information converted to pseudo C# source code. Section 2. Overview has examples of these two files.

The page contents stream is made of arguments and operators. For example rectangle will be four real numbers followed by re. Inline image is the exception to this rule. It is described above in Section 3. Object Definitions.

Finally, the program produces the object summary file ObjectSummary.txt. The file shows all indirect objects information without the streams.

PDF文件是一个字节序列。一些字节具有特殊含义。

空格定义为:null,制表符,换行符,换页符,回车符和空格。

分隔符定义为:(,),<,>,[,],{,},/,%和空格字符。

文件解析是通过PdfParser类完成的。为了开始解析过程,程序将文件位置设置为要解析的区域。 ParseNextItem()是提取下一个对象的方法。

解析器跳过空白和注释。如果下一个字节是“(”,则该对象是一个字符串。如果下一个字节是“ [”,则该对象是一个数组。如果后两个字节是“ <<”,则该对象是字典。如果下一个字节是“ <”,则该对象是一个十六进制字符串。如果下一个字节是“ /”,则对象是一个名称。如果下一个字节不是上述内容,则解析器将累积以下字节,直到找到分隔符为止。该分隔符不是当前标记的一部分。令牌可以是整数,实数,运算符或关键字。对于整数,程序将进一步搜索对象引用“ n 0 R”或间接对象“ n 0 obj”,其中n是整数从ParseNextItem返回的值()是第4节中合适的对象。对象定义。对象类作为PdfBase类返回。

如果是数组或字典,则程序将对ParseNextItem()进行递归调用以解析数组或字典的内部对象。

PDF文件是一个字节序列。一些字节具有特殊含义。

空格定义为:null,制表符,换行符,换页符,回车符和空格。

分隔符定义为:(,),<,>,[,],{,},/,%和空格字符。

文件解析是通过PdfParser类完成的。为了开始解析过程,程序将文件位置设置为要解析的区域。ParseNextItem()是提取下一个对象的方法。

如果下一个字节是“(”,则该对象是一个字符串。如果下一个字节是“ [”,则该对象是一个数组。如果后两个字节是“ <<”,则该对象是字典。如果下一个字节是“ <”,则该对象是一个十六进制字符串。如果下一个字节是“ /”,则对象是一个名称。如果下一个字节不是上述内容,则解析器将累积以下字节,直到找到分隔符为止。该分隔符不是当前标记的一部分。令牌可以是整数,实数,运算符或关键字。对于整体,程序将进一步搜索对象引用“ n 0 R”或间接对象“ n 0 obj”,其中n是整体从ParseNextItem返回的值()是第4节中合适的对象。对象定义。对象类作为PdfBase类返回。

如果是副本或字典,则程序将对ParseNextItem()进行递归初始化以解析数组或字典的内部对象。

6.文件读取

PdfReader类是PDF文件分析的主要类。输入方法为OpenPdfFile(String FileName,string Password = null)。该程序将打开PDF文件以进行二进制读取(一次读取一个字节)。

文件分析首先检查标头签名%PDF-1.x,其中x为0到7,以及尾标结束签名%% EOF。有人会认为所有PDF编写者都会将标头放在文件的零位置,而将标头放在文件的末尾。不幸的是事实并非如此。该程序必须在文件的两端搜索这两个签名。如果标题签名不在零位置,则必须调整所有间接目标文件位置指针。

在预告片签名之前,有一个指向最后一个交叉引用表开始的指针。

解析器设置交叉引用表的文件位置。如果下一个对象是“ xref”关键字,我们将使用原始样式交叉引用。否则,它是新的流基础交叉引用。该文件可以具有多个交叉引用表。该文件可以具有新样式表和旧样式表。每个表都是对象编号和指向间接引用对象起点的文件位置指针的列表。程序为每个活动对象创建一个PdfIndirectObject对象并将其保存在ObjectArray中。除对象编号和位置外,该对象为空。对于原始交叉引用表,位置是相对于文件的。对于流类型交叉引用,位置相对于父级间接对象流。

在此过程中,如果间接对象的世代号不为零,则程序执行将中止。 PdfFileAnalyzer不支持多代。

在交叉引用表的末尾,我们有一个预告片字典。为了在分析中包括该词典,我们创建了一个带有负对象号的虚拟间接对象,并将该词典保存在其中。

该程序在预告片字典中查找四个特定的条目。如果找到/ Encrypt,则文件将被解密。接下来,程序查找/ Root目录对象的对象号。如果/ XRefStm条目存在,则我们有两种类型的交叉引用。最后,如果/ Prev存在,我们还有另一个交叉引用表要处理。

交叉引用处理完成后,我们将得到所有间接对象的数组。在该过程的此阶段可用的信息是对象编号和位置。接下来,程序循环遍历数组,并读取和解析每个间接对象。此过程设置对象值。如果对象是流,则仅解析字典部分。原因是此时流长度可能未知。除对象外,如果这两个值可用,则系统还会为字典和流对象设置对象类型和子类型成员。

接下来,程序循环遍历所有对象并处理流对象。流对象的对象类型等于“ / ObjStm”。程序读取与这些对象关联的流,并将流拆分为多个间接对象。

下一个

7. TestPdfFileAnalyzer Program

The PdfFileAnalyzer application was developed to test the PDF file parsing classes. If you want to test the executable program outside the development environment, create a PdfFileAnalyzer directory and copy the TestPdfFileAnalyzer.exe program and the PdfFileAnalyser.dll class library into this directory and run the program. If you run the project from the Visual C# development environment, make sure you define a working directory in the Debug tab of the project properties. This program was developed using Microsoft Visual C# 2019.

Start the program. The available options are: Open PDF File, and Recent Files.

On first program execution you must run Setup and define project directory. This directory will hold all sub-directories that will be created for each PDF file being analyzed.

Open button will display a standard file selection dialog. Navigate to the PDF file you want to analyze.

The PdfFileAnalyzer screen will change to object summary screen:

开发PdfFileAnalyzer应用程序以测试PDF文件解析类。 如果要在开发环境之外测试可执行程序,请创建一个PdfFileAnalyzer目录,然后将TestPdfFileAnalyzer.exe程序和PdfFileAnalyser.dll类库复制到该目录中并运行该程序。 如果从Visual C#开发环境中运行项目,请确保在项目属性的“调试”选项卡中定义了一个工作目录。 该程序是使用Microsoft Visual C#2019开发的。

启动程序。 可用选项包括:打开PDF文件和最近的文件。

在第一次执行程序时,您必须运行安装程序并定义项目目录。 该目录将包含将为每个要分析的PDF文件创建的所有子目录。

打开按钮将显示一个标准文件选择对话框。 导航到要分析的PDF文件。

PdfFileAnalyzer屏幕将更改为对象摘要屏幕:

Each row represents an indirect PDF object. Each column is: 每行代表一个间接PDF对象。 每列是

  • Object No. The indirect object number. In the case of trailer dictionary, the object number is a dummy number, it is negative but on the screen it shows as TRn.
  • Object. The type of object as per Section 4. Object Definitions.
  • Type. If the object is a dictionary or a stream, the type is the value of /Type dictionary pair. If the object is not a dictionary or the dictionary does not contain /Type, the displayed value comes from an indirect reference to this object.
  • Subtype. If the object is a dictionary or a stream and if the dictionary contains /Subtype entry it is displayed in this column.
  • Parent Object No. If the indirect object is part of object stream (see Section 3.2. Compound Objects), this column is the object number of the object stream.
  • Parent Index. If the indirect object is part of object stream, this number is the index number within the parent object stream.
  • Object Position. For indirect object files that are not object stream type; this is the object position within the PDF file. Indirect objects that are part of object stream; this is the position within the parent. Position is given in decimal and hexadecimal for programmers who would like to view the PDF file in binary editor.
  • Stream Position and Stream Length. The position and length of the stream. The position is relative to the file or the parent in the same way as object position above.
  • 对象编号。间接对象编号。对于尾部词典,对象号是一个虚拟数,它是负数,但在屏幕上显示为TRn。
  • 宾语。对象的类型,请参见第4节。对象定义。
  • 类型。如果对象是字典或流,则类型是/ Type字典对的值。如果该对象不是字典,或者该字典不包含/ Type,则显示的值来自对该对象的间接引用。
  • 子类型。如果对象是字典或流,并且如果字典包含/ Subtype条目,则该对象显示在此列中。
  • 父对象号。如果间接对象是对象流的一部分(请参见第3.2节“复合对象”),则此列是对象流的对象号。
  • 父级索引。如果间接对象是对象流的一部分,则此数字是父对象流内的索引号。
  • 对象位置。对于不是对象流类型的间接对象文件;这是PDF文件中的对象位置。作为对象流一部分的间接对象;这是父母中的职位。对于要在二进制编辑器中查看PDF文件的程序员,位置以十进制和十六进制给出。
  • 流位置和流长度。流的位置和长度。该位置相对于文件或父级,与上面的对象位置相同。

To view the ObjectSummary.txt file, press the Summary button. Below is an example of the start of this file. 要查看ObjectSummary.txt文件,请按摘要按钮。 下面是此文件开始的示例。

PDF file name: interactiveform_DATA.pdf

Trailer Dictionary
------------------
<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<f681c578264452c4ab65398fdc7c0daa><b4
25aedbd5c8c544a84d960c3f738458>]/Index[3 1 7 1 18 1 100 5 108 2 116 1 123 1 126 1 128 1 134 1 136 1 173
11]/Info 18 0 R/Length 71/Prev 116/Root 20 0 R/Size 184/Type/XRef/W[1 3 1]>>

Indirect Objects
----------------
Object number: 1
Object Value Type: Stream
File Position: 67126 Hex: 10636
Stream Position: 67201 Hex: 10681
Stream Length: 695 Hex: 2B7
Object Type: /ObjStm
<</Filter/FlateDecode/First 22/Length 695/N 4/Type/ObjStm>>

Object number: 2
Object Value Type: Stream
File Position: 67915 Hex: 1094B
Stream Position: 67990 Hex: 10996
Stream Length: 354 Hex: 162
Object Type: /ObjStm
<</Filter/FlateDecode/First 33/Length 354/N 5/Type/ObjStm>>

Object number: 3
Object Value Type: Stream
File Position: 91134 Hex: 163FE
Stream Position: 91193 Hex: 16439
Stream Length: 21616 Hex: 5470
Object Type: /Metadata
Object Subtype: /XML
<</Length 21616/Subtype/XML/Type/Metadata>>

To view the details of an indirect object either select a row and press the View button or double click on a row. The object analysis screen will be displayed.

For all non stream objects, the first three buttons are disabled. The only information available is the object itself. You can view it in text or hexadecimal formats.

For stream objects the first button name is the object type. The first two buttons object type and Stream allow you to toggle between viewing the object or the stream. The Hex and Text allow you to view in binary or text format. If the stream is image, the image will be displayed rather than text. If the stream is a cross-reference stream, the text format shows four columns: (1) object number, (2) type (0-unused, 1-normal object, 2-stream object), (3) position for type 1 and parent for type 2 and (4) parent index number. If the stream is binary (i.e. font), it can be viewed in hexadecimal only.

Page object is treated as a stream object. The text displayed is the concatenation of all contents objects. In addition, the Source button allows you to view the page description language in what appears as C# code.

Images (.jpg and .bmp) can be rotated and scaled.

要查看间接对象的详细信息,请选择一行并按“查看”按钮,或双击一行。将显示对象分析屏幕。

对于所有非流对象,前三个按钮均被禁用。唯一可用的信息是对象本身。您可以以文本或十六进制格式查看它。

对于流对象,第一个按钮名称是对象类型。前两个按钮对象类型和流允许您在查看对象或流之间切换。十六进制和文本允许您以二进制或文本格式查看。如果流是图像,则将显示图像而不是文本。如果流是交叉引用流,则文本格式显示四列:(1)对象编号,(2)类型(0未使用,1普通对象,2流对象),(3)类型1的位置类型2和(4)父级索引号的父级。如果流是二进制(即字体),则只能以十六进制形式查看。

页面对象被视为流对象。显示的文本是所有内容对象的串联。此外,“源代码”按钮允许您以C#代码形式查看页面描述语言。

图像(.jpg和.bmp)可以旋转和缩放。

Page indirect object example. 页面间接对象示例。

Object number: 22
Object Value Type: Dictionary
File Position: 13810 Hex: 35F2
Object Type: /Page
<</Annots 97 0 R/ArtBox[0 0 612 792]/BleedBox[0 0 612 792]/Contents 81 0 R/CropBox[0 0 612 792]/MediaBox
[0 0 612 792]/Parent 16 0 R/Resources<</ColorSpace<</CS0 137 0 R>>/ExtGState<</GS0 138 0 R>>/Font<</C0_0
143 0 R/T1_0 146 0 R/T1_1 149 0 R/T1_2 151 0 R>>/ProcSet[/PDF/Text]/Properties<</MC0<</Metadata 91 0 R>>>>/Shading
<</Sh0 153 0 R>>>>/Rotate 0/TrimBox[0 0 612 792]/Type/Page>>

Content stream example. 内容流示例。

Object number: 22
Object Value Type: Dictionary
File Position: 13810 Hex: 35F2
Object Type: /Page
<</Annots 97 0 R/ArtBox[0 0 612 792]/BleedBox[0 0 612 792]/Contents 81 0 R/CropBox[0 0 612 792]/MediaBox
[0 0 612 792]/Parent 16 0 R/Resources<</ColorSpace<</CS0 137 0 R>>/ExtGState<</GS0 138 0 R>>/Font<</C0_0
143 0 R/T1_0 146 0 R/T1_1 149 0 R/T1_2 151 0 R>>/ProcSet[/PDF/Text]/Properties<</MC0<</Metadata 91 0 R>>>>/Shading
<</Sh0 153 0 R>>>>/Rotate 0/TrimBox[0 0 612 792]/Type/Page>>

8. History 历史

  • 2012/08/25: Version 1.0, Original revision.
  • 2013/04/10 Version 1.1. Support for world regions that define comma as decimal separator.
  • 2014/03/10 Version 1.2 Fix problem related to PDF files with Cross Reference Stream
  • 2015/04/02 Version 1.3 Remove error messages related to unimplemented stream compression filters.
  • 2019/06/14 Version 2.0 The software is divided into two projects, a library and a test program. Encrypted files are supported.
  • 2019/06/19 Version 2.1 Minor changes to sofware.
  • 2012/08/25:版本1.0,原始修订。
  • 2013/04/10版本1.1。 支持将逗号定义为小数点分隔符的世界区域。
  • 2014/03/10版本1.2解决了与具有交叉引用流的PDF文件有关的问题
  • 2015/04/02版本1.3删除与未实现的流压缩过滤器有关的错误消息。
  • 2019/06/14版本2.0该软件分为两个项目,一个库和一个测试程序。 支持加密文件。
  • 2019/06/19版本2.1对软件的较小更改。

License 许可

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

本文以及所有相关的源代码和文件均已获得代码项目开放许可(CPOL)的许可

广告:阿里云新人采购专场

相关文章
C#
pdf解析
cms新闻系统购买咨询
扫描关注 广州阅速软件科技有限公司
扫描关注 广州阅速科技