Skip to content

Latest commit

 

History

History
executable file
·
179 lines (148 loc) · 9.79 KB

readme.md

File metadata and controls

executable file
·
179 lines (148 loc) · 9.79 KB

Background

Depending on the pdf creator/engine used, a .pdf file may include content irrelevant for reading by a human. An example is the inclusion of complete sets of fonts though only a few glyphs are (or only one is) used on the "electronic paper". To print the .pdf with ghostscript again into a .pdf may reduce the file size e.g., prior to organize a publication in a reference manager.

By far, this bash script does not claim to be the first one collecting bits and bolts to address the issue. It rather serves as an aide-memoire of finds encountered earlier, and to moderate ghostscript in Linux accordingly. Within reason, the snippets were joined as provided; thus, the credit belongs to those already in the field.

Intended Use

One use consists of 1) the provision of the executable bit (chmod +x pdf_reprint.sh), and 2) to set an alias in your .bashrc file. Then, the functionality described below is available all across your system.

  • Else, to reprint the .pdf while retaining the color, run either one of the following commands

    bash ./pdf_rewrite.sh --reprint input.pdf
    bash ./pdf_rewrite.sh -r input.pdf
    bash ./pdf_rewrite.sh --colour input.pdf
    bash ./pdf_rewrite.sh --color input.pdf
    bash ./pdf_rewrite.sh -c input.pdf

    to replace the original file input.pdf by a new version of smaller file size. There will be a short note if the attempt was successful; and if so, the improvement compared to the original input.pdf is reported (percentage). Though you may process the file multiple times with this method, savings in file size often quickly converge to be insignificant in comparison to the remaining file size.

    The credit for the underlying approach and implementation belongs to Evan Langlois.1

  • Often, an additional reduction of file size may be obtained by reprinting the .pdf in gray-scale only. Either one of the following commands to the script is equivalent

    bash ./pdf_rewrite.sh --gray input.pdf
    bash ./pdf_rewrite.sh --grey input.pdf
    bash ./pdf_rewrite.sh -g input.pdf

    to replace file input.pdf by its rewritten form. The credit for this approach belongs to user slm on the Unix stackexchange.2

  • To process multiple .pdf, a for-loop in your shell could follow the pattern of

    for file in *.pdf
    do
        echo "$file"
        bash ./pdf_rewrite.sh -r "$file"
    done

    This equally provides you a brief progress report, too.

Note, this script's primary aim is to obtain a file of small file size, e.g., as an attachment of an email while retaining the text easy to read and – if present – to retain a text layer searchable. Especially the reprint in half-tones however may render illustrations less intelligible. It is up to the creators of figures to use easy discernible markers in diagrams, as well as to use color scales suitable for the color blind, and safe for this mimicked "photocopying". For the later, a service like https://colorbrewer2.org/ may guide your selection.

Keep a backup of the .pdf to be processed. Though the script may report problems while processing the data (or even crash, which may destroy the .pdf), it is not a PDF validator such as e.g., veraPDF.3

Benchmark

Initially written for Linux Xubuntu 18.04.3 LTS and ghostscript (version 9.26), the script is known working well e.g., with Debian 13/trixie (currently testing) and GPL Ghostscript (version 10.02.1 published by 2023-11-01).

  • File link2web.pdf was compiled with pdfLaTeX based on an example provided by www.texample.net. This .pdf contains a color figure and link to an external reference. Note, the simplification into half-tones (option -g) affects the document printed, depending on the pdf viewer used, the box around the link may remain colored for the display on screen.

  • The performance of the utility was tested on a couple of recent publications in chemistry. To ease a potential replication, publications used for the bench marke are – within reason – available open access/CC.

In the table below, savings computes the difference of the file size prior and after the processing with either option, then reports this change as percentage in respect to the file size of the originally submitted file after a single run of optimization.

Typically, the simple reprint -r retaining the color is the fastest approach to reduce most of the file size in one run and hence already good enough. How much file size is saved seems to vary not only by the relative amount of special (mathematical) characters, but among journals by the same publisher; see for instance the small savings for a reprint for J. Appl. Cryst. vs. Helv. Chim. Acta both published by Wiley.

source publisher original reprint -r saved % reprint -g saved %
2023ACR3640 ACS 7.0 MB 4.5 MB 35.7 3.2 MB 54.3
2023ACR3654 ACS 2.4 MB 1.6 MB 33.3 1.6 MB 33.3
2023CrystGrowthDes8469 ACS 3.7 MB 0.9 MB 75.7 0.9 MB 75.7
2024CrystGrowthDes71 ACS 10.5 MB 1.5 MB 85.7 1.4 MB 86.7
2023CRV12135 ACS 9.4 MB 6.1 MB 35.1 5.3 MB 43.6
2023CRV13291 ACS 25.5 MB 4.0 MB 84.3 3.7 MB 85.5
2023CRV13713 ACS 12.0 MB 4.5 MB 62.5 4.2 MB 65.0
2023JCE4674 ACS 2.8 MB 2.2 MB 21.4 2.1 MB 25.0
2023JCE4728 ACS 2.6 MB 1.0 MB 61.5 1.0 MB 61.5
2023JOC16679 ACS 4.7 MB 3.2 MB 31.9 3.0 MB 36.2
2023JOC16719 ACS 9.9 MB 2.4 MB 75.8 2.1 MB 78.8
2023OL9002 ACS 2.4 MB 1.2 MB 50.0 1.1 MB 54.2
2023OL9243 ACS 2.2 MB 1.4 MB 36.4 1.4 MB 36.4
2023Tetrahedron133750 Elsevier 1.2 MB 1.0 MB 16.7 0.6 MB 50.0
2024Tetrahedron133787 Elsevier 1.9 MB 1.8 MB 5.3 1.6 MB 15.8
2023TL154433 Elsevier 831 kB 721 kB 13.2 497 kB 40.2
2024TL154885 Elsevier 1.6 MB 0.9 MB 43.8 0.9 MB 43.8
2024PCCP713 RSC 1.0 MB 1.0 MB 0.0 0.5 MB 50.0
2024PCCP770 RSC 2.3 MB 2.1 MB 8.7 0.8 MB 65.2
2024TheorChemAcc4 Springer 1.7 MB 0.8 MB 52.9 0.7 MB 58.8
2023TheorChemAcc133 Springer 1.7 MB 1.0 MB 41.2 1.0 MB 41.2
2024JSulfurChem138 Taylor & Francis 469 kB 248 kB 47.1 247 kB 47.3
2023JSulfurChem269 Taylor & Francis 5.6 MB 2.9 MB 48.2 2.0 MB 64.3
2023Synthesis3777 Thieme 976 kB 936 kB 4.1 528 kB 45.9
2023Synthesis3947 Thieme 2.2 MB 2.2 MB 0.0 1.9 MB 13.6
2024ACIEe202310983 Wiley 877 kB 800 kB 8.8 507 kB 42.2
2024ACIEe202314446 Wiley 2.5 MB 2.4 MB 4.0 1.2 MB 52.0
2023HCAe202300110 Wiley 10.4 MB 5.5 MB 47.1 2.9 MB 72.1
2023HCAe202300154 Wiley 10.4 MB 5.7 MB 45.2 2.2 MB 78.8
2023JApplCryst1618 Wiley 1.1 MB 1.1 MB 0.0 0.9 MB 18.2
2023JApplCryst1639 Wiley 2.8 MB 2.7 MB 3.6 1.2 MB 57.1
link2web.pdf pdflatex 38.0 kB 9.8 kB 74.2 9.8 kB 74.2

Disclaimer

While rewriting the pdf file in question, the pdf metadata Producer (which can be an entry like LaTeX with hyperref), CreationDate, and ModDate are overwritten. Other metadata such as TITLE, SUBJECT, KEYWORDS, and AUTHOR are retained. This however does not seem to affect the retrieval of bibliographic metadata with a reference manager like zotero; presumably, their work accesses and relies on the doi string the pdf of a journal publication instead.

On occasion, typographic ligatures like fi (as in the string file), or fl (as in fluid) are not faithfully processed – the searchable text layer of the reprinted pdf might split them, or drop them altogether. Apparently, this issue depends both on the version of ghostscript installed, and font / pdf-engine of the pdf to be processed because recent journal publications (like by ACS, member of STIX project4 tend to be less frequently affected by this. This pdf-reprinter is not tested on pdf about documents predominantly written in other scripts than Latin.

Footnotes

Footnotes

  1. https://tex.stackexchange.com/questions/18987/how-to-make-the-pdfs-produced-by-pdflatex-smaller?rq=1

  2. https://unix.stackexchange.com/questions/93959/how-to-convert-a-color-pdf-to-black-white

  3. https://openpreservation.org/tools/verapdf/

  4. https://en.wikipedia.org/wiki/STIX_Fonts_project