-
-
Notifications
You must be signed in to change notification settings - Fork 101
Home
This wiki is aimed at PDF developers and people who need to create PDF documents. Maybe you need to port it to another language or you are interested in how PDF works. Plus I want to document what I learned about PDF during the development of this library.
A PDF document is made up of 10 different data types:
- Null (not actually written into the PDF)
- Boolean (true | false)
- Integer (ex 5)
- Real (ex. 5.6)
- Name (ex.
/BlendMode
) - String (regular or in hexadecimal, written in braces: (Hello) )
- Array (ex. [3 4 5])
- Dictionary (key => value storage - ex. << /BlendMode /Multiply >>)
- Stream (arbitrary data, images, graphics, etc.)
- Reference (to another object, using an index and a generation number)
These data types encode everything that is contained in a PDF file. While they can technically be combined in any way, the PDF spec has sometimes very weird rules about how it needs the data.
The file ends with an %%EOF
marker and a descriptor table. Each object in the PDF is encoded seperately like this:
15 0 obj
<<
/Type /ExtGState
>>
endobj
This means, this is the Object with the ID 15, generation 0. It contains a dictionary (marked by the <<
and >>
pairs. Objects can be referenced using the R
parameter like this:
16 0 obj
15 0 R
endobj
In this case the object 16 contains a reference to the object 15. (generations are used for versioning objects when saving).
The end of the file contains a trailer where each entry is exactly 20 bytes long. These 20 bytes contain the offset to the specific object from the start of the file as well as the object ID. The trailer also contains a reference to the "root" element. A PDF is built in a tree-like data structure, where on element can have one parent, but multiple children objects.
The root element is a dictionary. It contains a reference (!) to the /Catalog
dictionary. The catalog contains a /Pages
dictionary and the /Pages
dictionary itself has /Kids
, which is an array of pdf page objects or references to them. In a so-called linearized PDF these objects are in order in the PDF (useful for streaming a PDF over network).
Each PDF page must have a reference to its parent (the /Pages
dictionary) as well as a /Type /Page
and a /MediaBox
entry, which is an array with the [offset_x offset_y width height]
to tell the viewer how big the page actually is.
So far we have an empty page. To display anything, a /Type /Page
has exactly one /Contents
, and one /Resources
dictionary. The content stream must be an indirect reference to a PDF stream. It holds the graphical operations that make up the page, while the /Resources
hold resources that are needed in order to display the page (like: fonts, images, etc.).
In order to be a valid (PDF-X conform) document, you'll need a few other things, too. In addition to the /MediaBox
entry, you will also need a /CropBox
and an /ArtBox
, a /Rotate (0 | 90 | 180 | 270)
entry.
If you add an empty /Contents
dictionary, this is regarded as invalid syntax.
Metadata is a bit of a wild story in PDF. Back in the old days, metadata was held in the /Info
dictionary (which is a direct child of the root object, must be a referenced object). The Info dictionary must contain:
Entry | Description | Example values |
---|---|---|
/GTS_PDFXVersion |
PDF version identifer string | PDF/X-3:2002 |
/CreationDate |
Creation date of the PDF | D:20170701120145+02'00' |
/ModDate |
Last modification | D:20170701120145+02'00' |
/Trapped |
If the PDF is trapped (bool) | false |
/Title |
Title of the PDF document | Hello World PDF |
Then Adobe decided that XML was cool and now you have to duplicate the same metadata to the /Metadata
stream, which is contained as a child element of the /Catalog
dictionary. It contains a pre-defined test, filled with certain values:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfxid="http://www.npes.org/pdfx/ns/id/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<xmp:CreateDate>D:2017-07-01T12:01:45+02'00'</xmp:CreateDate>
<xmp:ModifyDate>D:2017-07-01T12:01:45+02'00'</xmp:ModifyDate>
<xmp:MetadataDate>D:2017-07-01T12:01:45+02'00'</xmp:MetadataDate>
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">PDF_Document_title</rdf:li>
</rdf:Alt>
</dc:title>
<xmpMM:DocumentID>uuid:BQzzHr2PL15f7JooxhxSuMdBw88poDQK</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:ps6B6Pm8c70iFjxBpP5gxJKkf5l9SQbZ</xmpMM:InstanceID>
<xmpMM:RenditionClass>default</xmpMM:RenditionClass>
<xmpMM:VersionID>1</xmpMM:VersionID>
<pdfxid:GTS_PDFXVersion>PDF/X-3:2002</pdfxid:GTS_PDFXVersion>
<pdfx:GTS_PDFXVersion>PDF/X-3:2002</pdfx:GTS_PDFXVersion>
<pdf:Trapped>False</pdf:Trapped>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
And yes, these spaces are necessary. Don't ask me why. The ID is a magic string, it's always the same.
It is important that the creation and modification dates are the same as in the /Info
dictionary. You can add extra information after the end of this stream (Illustrator does this for color swatches). This stream is contained as an object like this:
/Type /Metadata
/Subtype /XML
stream
(XML text goes here)
endstream
The last thing you have to make sure is that your document is color-calibrated. Embedding ICC profiles is done by simply mapping the ICC profile from the .icc file as-is into the PDF document as a stream. On the /Catalog
dictionary, you can set an /OutputIntents
entry - which, by surprise, has to be an array with exactly one object in it (why? I don't know). The first element of this array has to be an object like this:
/OutputIntents[0]
<<
/S /GTS_PDFX
/OutputCondition Commercial and special offset print acccording to ISO 12647-2:2004 / Amd 1, paper type 1 or 2 (matte or gloss-coated offset paper, 115 g/m2
/DestinationOutputProfile 4 0 R
/Type /OutputIntent
/RegistryName http://www.color.org
/OutputConditionIdentifier FOGRA39
/Info Coated FOGRA39 (ISO 12647-2:2004
>>
The 4 0 R
in this case points to a ICC profile stream. printpdf
always embeds a color profile. You can get the color profiles from Adobes website, don't get them from color.org, the latter are somehow corrupt. The /OutputCondition
is not exactly a must, but it helps the printing factory to determine what paper, etc. to use. Currently, this cannot be configure in printpdf
.
The /Catalog
element is also the parent of bookmarks, optional content groups and a lot of other meta-information. You can make your own entry as a child of the /Catalog
and save data specific to your application, any PDF viewer will ignore it. This is how Adobe Illustrator saves information.
Translated from http://www.p2501.ch/pdf-howto/start | Translation DE-EN: Felix Schütt