Skip to content
Felix Schütt edited this page Jul 1, 2017 · 6 revisions

This wiki is aimed at PDF developers and people who need to create PDF documents. Maybe you need to port it to another language or you are interested in how PDF works. Plus I want to document what I learned about PDF during the development of this library.

How a PDF file is built

Objects

A PDF document is made up of 10 different data types:

  • Null (not actually written into the PDF)
  • Boolean (true | false)
  • Integer (ex 5)
  • Real (ex. 5.6)
  • Name (ex. /BlendMode)
  • String (regular or in hexadecimal, written in braces: (Hello) )
  • Array (ex. [3 4 5])
  • Dictionary (key => value storage - ex. << /BlendMode /Multiply >>)
  • Stream (arbitrary data, images, graphics, etc.)
  • Reference (to another object, using an index and a generation number)

These data types encode everything that is contained in a PDF file. While they can technically be combined in any way, the PDF spec has sometimes very weird rules about how it needs the data.

The file ends with an %%EOF marker and a descriptor table. Each object in the PDF is encoded seperately like this:

15 0 obj
    <<
        /Type /ExtGState
    >>
endobj

This means, this is the Object with the ID 15, generation 0. It contains a dictionary (marked by the << and >> pairs. Objects can be referenced using the R parameter like this:

16 0 obj
   15 0 R
endobj

In this case the object 16 contains a reference to the object 15. (generations are used for versioning objects when saving).

The document structure

The end of the file contains a trailer where each entry is exactly 20 bytes long. These 20 bytes contain the offset to the specific object from the start of the file as well as the object ID. The trailer also contains a reference to the "root" element. A PDF is built in a tree-like data structure, where on element can have one parent, but multiple children objects.

The root element is a dictionary. It contains a reference (!) to the /Catalog dictionary. The catalog contains a /Pages dictionary and the /Pages dictionary itself has /Kids, which is an array of pdf page objects or references to them. In a so-called linearized PDF these objects are in order in the PDF (useful for streaming a PDF over network).

Each PDF page must have a reference to its parent (the /Pages dictionary) as well as a /Type /Page and a /MediaBox entry, which is an array with the [offset_x offset_y width height] to tell the viewer how big the page actually is.

The PDF page and its dependencies

So far we have an empty page. To display anything, a /Type /Page has exactly one /Contents, and one /Resources dictionary. The content stream must be an indirect reference to a PDF stream. It holds the graphical operations that make up the page, while the /Resources hold resources that are needed in order to display the page (like: fonts, images, etc.).

In order to be a valid (PDF-X conform) document, you'll need a few other things, too. In addition to the /MediaBox entry, you will also need a /CropBox and an /ArtBox, a /Rotate (0 | 90 | 180 | 270) entry. If you add an empty /Contents dictionary, this is regarded as invalid syntax.

Metadata

Metadata is a bit of a wild story in PDF. Back in the old days, metadata was held in the /Info dictionary (which is a direct child of the root object, must be a referenced object). The Info dictionary must contain:

Entry Description Example values
/GTS_PDFXVersion PDF version identifer string PDF/X-3:2002
/CreationDate Creation date of the PDF D:20170701120145+02'00'
/ModDate Last modification D:20170701120145+02'00'
/Trapped If the PDF is trapped (bool) false
/Title Title of the PDF document Hello World PDF

Then Adobe decided that XML was cool and now you have to duplicate the same metadata to the /Metadata stream, which is contained as a child element of the /Catalog dictionary. It contains a pre-defined test, filled with certain values:

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:pdfxid="http://www.npes.org/pdfx/ns/id/"
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <xmp:CreateDate>D:2017-07-01T12:01:45+02'00'</xmp:CreateDate>
         <xmp:ModifyDate>D:2017-07-01T12:01:45+02'00'</xmp:ModifyDate>
         <xmp:MetadataDate>D:2017-07-01T12:01:45+02'00'</xmp:MetadataDate>
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">PDF_Document_title</rdf:li>
            </rdf:Alt>
         </dc:title>
         <xmpMM:DocumentID>uuid:BQzzHr2PL15f7JooxhxSuMdBw88poDQK</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:ps6B6Pm8c70iFjxBpP5gxJKkf5l9SQbZ</xmpMM:InstanceID>
         <xmpMM:RenditionClass>default</xmpMM:RenditionClass>
         <xmpMM:VersionID>1</xmpMM:VersionID>
         <pdfxid:GTS_PDFXVersion>PDF/X-3:2002</pdfxid:GTS_PDFXVersion>
         <pdfx:GTS_PDFXVersion>PDF/X-3:2002</pdfx:GTS_PDFXVersion>
         <pdf:Trapped>False</pdf:Trapped>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                           
<?xpacket end="w"?>

And yes, these spaces are necessary. Don't ask me why. The ID is a magic string, it's always the same. It is important that the creation and modification dates are the same as in the /Info dictionary. You can add extra information after the end of this stream (Illustrator does this for color swatches). This stream is contained as an object like this:

/Type /Metadata
/Subtype /XML
stream
(XML text goes here)
endstream

The last thing you have to make sure is that your document is color-calibrated. Embedding ICC profiles is done by simply mapping the ICC profile from the .icc file as-is into the PDF document as a stream. On the /Catalog dictionary, you can set an /OutputIntents entry - which, by surprise, has to be an array with exactly one object in it (why? I don't know). The first element of this array has to be an object like this:

/OutputIntents[0]	
<<
    /S /GTS_PDFX
    /OutputCondition Commercial and special offset print acccording to ISO 12647-2:2004 / Amd 1, paper type 1 or 2 (matte or gloss-coated offset paper, 115 g/m2
    /DestinationOutputProfile 4 0 R
    /Type /OutputIntent
    /RegistryName http://www.color.org
    /OutputConditionIdentifier FOGRA39
    /Info Coated FOGRA39 (ISO 12647-2:2004
>>

The 4 0 R in this case points to a ICC profile stream. printpdf always embeds a color profile. You can get the color profiles from Adobes website, don't get them from color.org, the latter are somehow corrupt. The /OutputCondition is not exactly a must, but it helps the printing factory to determine what paper, etc. to use. Currently, this cannot be configure in printpdf.

The /Catalog element is also the parent of bookmarks, optional content groups and a lot of other meta-information. You can make your own entry as a child of the /Catalog and save data specific to your application, any PDF viewer will ignore it. This is how Adobe Illustrator saves information.

Clone this wiki locally