-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize all PDFs #24
Comments
Have checked various settings in Acrobat XI now. Tried to run a batch called "Optimize for web and mobile". This one reduces and compresses the image too much, I think. You can clearly see the pixels in all images afterwards. The files are small, though, but I think this is a too costly operation. Some other batch processes that seem interesting to try (in this order):
There is also the option "Make accessible", which would have been nice to run. But it will asks to add alternative text to all images, which is impossible task on 1700+ PDFs. Hopefully, some more OCR will help on readability. I wonder about whether we should also try to embed some more metadata in the files, and if that would make them more readable (more info on the Adobe web page). |
Have been testing various things now. Have been running a PDF validator (qpdf) on the entire collection based on a discussion about checking if PDF files are corrupted.
The check showed that only 794 of the files were labelled as OK, while the others (934) as failed. I have been unable to find any consistency among failing or passing files. Originally, I thought that there might be differences based on whether they were made in LaTeX or MSWord (or something else), the platform, etc. But it turns out to not be that easy. This may also be because many of the files have been through several steps of updating along the way. Paper chairs have added various of stamps and page numbers, and so on. I think that resaving all PDFs might hopefully solve at least some of these issues. |
I have done some testing of PDF optimization, and have written up my experience in this blog post. Conclusion: I now have 1728 files that are 1/4 the original size, hopefully with no visible differences, and with better metadata. The files are currently available in a dropbox folder. They are not PDF/A, though. Any thoughts on how we can get there? |
Hello, I just wanted to add that I am currently working on an analysis toolkit for the last twenty years of NIME under Stefano F. In testing the toolkit's text extraction I noticed about every error that could exist in parsing the PDF format. I'm so happy to see that you've addressed most, if not all, of the issues I've noticed in processing the archives. Also many of the issues within the BibTeX file I've noticed have been resolved as well. Once I polish up the toolkit, I will create a list of any outstanding issues in the encoding of these papers. However, now that I've downloaded the new pdfs from your Dropbox link, I've noticed that you've standardized the pdf names to "nime[year]_paper[number].pdf". I wanted to add to this discussion to say that the "url" fields in the BibTex ought to be updated as well with the new names once the hosted files (at http://www.nime.org/proceedings/) has changed, since pdfs prior to 2016 were labeled "nime[year]_[number].pdf" |
Great that you are working on this @jacksonmgoode! Concerning the PDFs, we discovered some issues with the number of files. Have made a new collection now, and will make new Zips. We also decided to leave the original file names, since there appear to be publications that link directly to the original PDF file names. From now on, hopefully, people will use the DOIs from Zenodo instead. |
I have made new zip files of everything now and have uploaded here. Might be a good idea to upload to Zenodo so that we get DOIs on them as well. |
Working on the upload to Zenodo now. Have made a separate NIME archive that can be used to store various types of conference-related material. This could also include web pages, concerts, etc. as we are collecting more of the historical material. I upload the original Zip files first and then replace them with a new one with the compressed files. That way we can go back in case there are any problems. |
The files included in this zip file fix the issues mentioned above. In particular:
|
In a discussion about Google Scholar indexing (#12), we have found that it would be good to optimize all PDFs. It is particularly important to reduce the size of large files, but also the universal access of several files are questionable. Particularly the old PDFs could probably be optimized by saving in newer versions of PDFs.
I can run a ghostscript batch script to optimize images, but does anyone know about a good approach to improving the other aspects of PDF files? A batch script with Adobe Acrobat perhaps?
The text was updated successfully, but these errors were encountered: