Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize all PDFs #24

Open
alexarje opened this issue Aug 8, 2020 · 9 comments
Open

Optimize all PDFs #24

alexarje opened this issue Aug 8, 2020 · 9 comments
Assignees

Comments

@alexarje
Copy link
Contributor

alexarje commented Aug 8, 2020

In a discussion about Google Scholar indexing (#12), we have found that it would be good to optimize all PDFs. It is particularly important to reduce the size of large files, but also the universal access of several files are questionable. Particularly the old PDFs could probably be optimized by saving in newer versions of PDFs.

I can run a ghostscript batch script to optimize images, but does anyone know about a good approach to improving the other aspects of PDF files? A batch script with Adobe Acrobat perhaps?

@alexarje
Copy link
Contributor Author

alexarje commented Aug 9, 2020

Have checked various settings in Acrobat XI now. Tried to run a batch called "Optimize for web and mobile". This one reduces and compresses the image too much, I think. You can clearly see the pixels in all images afterwards. The files are small, though, but I think this is a too costly operation.

Some other batch processes that seem interesting to try (in this order):

  • "Optimize scanned documents" (converting content into searchable text and reducing file size)
  • "Prepare for distribution" (removing hidden information and other oddities)
  • "Archive documents" (create PDF/A compliant documents)

There is also the option "Make accessible", which would have been nice to run. But it will asks to add alternative text to all images, which is impossible task on 1700+ PDFs. Hopefully, some more OCR will help on readability.

I wonder about whether we should also try to embed some more metadata in the files, and if that would make them more readable (more info on the Adobe web page).

@alexarje
Copy link
Contributor Author

Have been testing various things now. Have been running a PDF validator (qpdf) on the entire collection based on a discussion about checking if PDF files are corrupted.

find . -type f -iname '*.pdf' \( -exec sh -c 'qpdf --check "{}" > /dev/null && echo "{}": OK' \; -o -exec echo "{}": FAILED \; \)

The check showed that only 794 of the files were labelled as OK, while the others (934) as failed. I have been unable to find any consistency among failing or passing files. Originally, I thought that there might be differences based on whether they were made in LaTeX or MSWord (or something else), the platform, etc. But it turns out to not be that easy. This may also be because many of the files have been through several steps of updating along the way. Paper chairs have added various of stamps and page numbers, and so on.

I think that resaving all PDFs might hopefully solve at least some of these issues.

@alexarje alexarje self-assigned this Aug 12, 2020
@alexarje
Copy link
Contributor Author

I have done some testing of PDF optimization, and have written up my experience in this blog post.

Conclusion: I now have 1728 files that are 1/4 the original size, hopefully with no visible differences, and with better metadata. The files are currently available in a dropbox folder.

They are not PDF/A, though. Any thoughts on how we can get there?

@jacksongoode
Copy link
Contributor

jacksongoode commented Sep 4, 2020

Hello,

I just wanted to add that I am currently working on an analysis toolkit for the last twenty years of NIME under Stefano F. In testing the toolkit's text extraction I noticed about every error that could exist in parsing the PDF format.

I'm so happy to see that you've addressed most, if not all, of the issues I've noticed in processing the archives. Also many of the issues within the BibTeX file I've noticed have been resolved as well. Once I polish up the toolkit, I will create a list of any outstanding issues in the encoding of these papers.

However, now that I've downloaded the new pdfs from your Dropbox link, I've noticed that you've standardized the pdf names to "nime[year]_paper[number].pdf". I wanted to add to this discussion to say that the "url" fields in the BibTex ought to be updated as well with the new names once the hosted files (at http://www.nime.org/proceedings/) has changed, since pdfs prior to 2016 were labeled "nime[year]_[number].pdf"

@alexarje
Copy link
Contributor Author

alexarje commented Sep 5, 2020

Great that you are working on this @jacksonmgoode!

Concerning the PDFs, we discovered some issues with the number of files. Have made a new collection now, and will make new Zips. We also decided to leave the original file names, since there appear to be publications that link directly to the original PDF file names. From now on, hopefully, people will use the DOIs from Zenodo instead.

@alexarje
Copy link
Contributor Author

alexarje commented Sep 5, 2020

I have made new zip files of everything now and have uploaded here. Might be a good idea to upload to Zenodo so that we get DOIs on them as well.

@alexarje
Copy link
Contributor Author

alexarje commented Sep 5, 2020

Working on the upload to Zenodo now. Have made a separate NIME archive that can be used to store various types of conference-related material. This could also include web pages, concerts, etc. as we are collecting more of the historical material. I upload the original Zip files first and then replace them with a new one with the compressed files. That way we can go back in case there are any problems.

@jacksongoode
Copy link
Contributor

I just wanted to follow up on this thread with pdf's I've identified as having encoding issues in working on raw text analysis of the archive. This is a list of pdf's that are visually readable but either fully or partially corrupted when passed through two different text extraction methods. Most of these can be confirmed in using a website like https://pdftotext.com/ (though I'm not sure which decoder this website is using).

In addition there are three pdf's that are completely unreadable:

These bottom three might be found somewhere else and could potentially be replaced? The first list is more tricky, not sure if anything can be done if these pdf's were initially poorly encoded (or just poorly encoded for text extraction).

@stefanofasciani
Copy link
Collaborator

stefanofasciani commented Dec 14, 2020

The files included in this zip file fix the issues mentioned above.

In particular:

  • The folder 'corruptedRecovered' includes working PDF versions of the 3 completely unreadable files (from various sources via Google Scholar)

  • The folder 'encodingProblem' includes the original PDF files listed above (downloaded from nime.org) that present some text encoding problem. The encoding issue affects these files to different extents (from a few characters here and there, to the whole file). To check the severity copy the whole text and paste it in a plain text editor. The files nime2009_031.pdf, nime2009_161.pdf, and nime2009_256.pdf included in this folder has been unlocked after downloading (they were locked).

  • The folder 'encodingFixed' includes a fixed version of the files in 'encodingProblem'. These has been processed OCRmyPDF (using the script fix.py) which is a tool used to add an OCR text layer to scanned PDF files. In this cade the tool replaces the original text. A limitation is that the file size has increased (average 3 times bigger), but files can be further compressed using a third party tool or using the same script and adding compression options at line 16. In the generated text layer, the correct ordering of the text columns is preserved. However if there is more than one author, the author fields for 2nd author and above do not show in the correct place (in the exported plain text it will appear between the end of column 1 of the first page and the beginning of column 2 in the first page). Finally, text in images is also included in the text layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants