-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathimplementation.tex
82 lines (74 loc) · 3.09 KB
/
implementation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
\section{Implementing The Specification}
\label{sec:implementation}
Ideally, the progression to implementation would have been achieved by
%
(1) fully comprehending the Standard,
(2) finding all problems with the Standard and clarifying them,
(3) writing a formal specification of a PDF parser,
then finally (4) implementing a PDF parser.
%
In practice, the progression was the following:
%
(1) we implemented a basic PDF parser;
(2) we added features one by one (updating the design often);
(3) we become aware of parts of the PDF Standard that were unclear and
incorrect, for which a srictly correct implementation was not obvious;
thus motivated
(4) we wrote the pre-DOM specification; and finally,
(5) we used the specification as a guide to re-writing some of
the subtle parts of pre-DOM processing in our implementation.
%
And concurrently with all of the above steps, we were both learning the PDF
Standard and finding problems and ambiguities with it.
Our implementation consists of a full PDF 2.0 parser. It (safely) ignores
Linearization data;
%
when parsing hybrid-reference PDFs, it ignores conventional
cross-reference tables provided for pre-1.5 readers.
%
It processes but does not validate signatures (as incremental
updates).
The primitive parsers are defined in the DaeDaLus~\cite{daedalusrepo}
language.
%
Other pre-DOM processing is implemented in Haskell.
%
The implementation could be viewed as a relatively complete reference
implementation of PDF.
However, in the implementation,
the clarity of the code is hampered by various software engineering
requirements:
creating better error messages,
handling other PDF complexities such as encryption,
etc.
It successfully parses a large number of PDFs in the wild
(a high percentage of which PDFs diverge from the standard in some manner).
Our implementation in most cases discovers more PDF errors than
most PDF tools and readers.
We think, in future work, it would be valuable to use our specification more
directly in our implementation (which is currently closer to the \dsp{}
implementation);
%
see \Cref{sec:single-pass-problems} above for why this would be
desirable.
% The current specification could be extended to form a complete
% \emph{reference implementation} by extending it to include
% %
% \textbf{(1)} implementations of relatively minor features that were
% not the focus of the current work;
% %
% \textbf{(2)} implementations of all parsers referenced as primitives; and
% %
% \textbf{(3)} computation that generates the document's DOM.
%
% - Our pre-dom parser specification came about as an artifact of
% our attempt to understand what the Standard---defining the format---
% *requires* of a reasonable implementation.
% - The implementation is less interesting as
% - it is more like the D implementation
% - it is much longer as it ...
% - But the above was an aid
% - to enable us to understand the PDF Format Standard
% - to help us to understand how to write our D-ish implem.
% - its more abstract nature would allow us to build a parser/validator
% with a single code base (not done yet)