Skip to content

Commit

Permalink
Merge pull request #6 from gwu-libraries/adobePDFsOCR
Browse files Browse the repository at this point in the history
ocr postprocessing
  • Loading branch information
DaltonAlves authored Jun 17, 2024
2 parents e7f58c2 + c63e5e2 commit db43a5e
Show file tree
Hide file tree
Showing 5 changed files with 15 additions and 8 deletions.
1 change: 0 additions & 1 deletion digitization/av/av_bestpractices.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ Digitized av content is often multiple components. These components should be pa

Example SIP for a/v content:


```
ms2374_s2_c107d_f7_i1
├── ms2374_s2_c107d_f7_i1_001.mov
Expand Down
4 changes: 2 additions & 2 deletions digitization/av/av_records.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ grand_parent: Digitization
nav_order: 3
---

When digitizing audiovisual carriers we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information.
When digitizing audiovisual carriers, we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information.

## Scope and Contents
Any content notes derived during the digitization process. If we are doing a monitored transfer, we should record notes about the aboutness of the recording.

Examples:
"David Einsenhower gives speech and takes questions from the audience at the alumni association dinner in 1987."
"David Eisenhower gives speech and takes questions from the audience at the alumni association dinner in 1987."
"1991 Convocation honoring Ronald and Nancy Reagan 10 years after Ronald Reagan was shot and brought to GW hospital for treatment."

## Physical Description
Expand Down
2 changes: 1 addition & 1 deletion digitization/imaging/imaging_bestpractices.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ nav_order: 1

This is not policy, but rather a guiding document. For various reasons, projects might not be able to fulfill these recommendations. At present, technical debt related to storage and access systems makes these recommendations difficult to fulfill.

These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized audio and moving image material.
These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized texts and graphics.

## Documents and manuscripts (unbound)

Expand Down
11 changes: 10 additions & 1 deletion digitization/imaging/ocr.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,13 @@ permalink: /imaging_ocr/
grand_parent: Digitization
parent: "Digitization: Imaging Text and Graphics"
---
test
# Optical Character Recognition (OCR) for Text Based Documents


## Adobe Acrobat

## [Tesseract](https://github.com/tesseract-ocr/tesseract)

### [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)

OCRmyPDF uses Tesseract to generate a searchable [PDF/A ](https://en.wikipedia.org/?title=PDF/A) file from a PDF or images.
5 changes: 2 additions & 3 deletions managing/accessupload.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,8 @@ Presently, GWU Special Collections only ingests access copies of digital collect
- You may use the [ArchivesSpace_to_InternetArchive script]() to generate metadata from the archival description for each record.
- If you use this script, you should still review the metadata before upload. The script pulls description from ancestor records (series, resource, ect.) that may not be appropriate for what is represented by the digital content.
- An example of this might be rights information. The script will pull the rights statement from the resource (collection) record. This may not apply to item-level description for the digital content.
- You may also include additional derivatives (SRT/VTT caption files, text files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative.
- Note that the Internet Archive generates OCR for text-based documents automatically. We cannot overwrite this generated OCR. We can still upload any corrected OCR, but it will not be used for full-text search results.
![CSV screenshot](/assets/images/sidecar_upload.PNG)
- You may also include additional derivatives (SRT/VTT caption files, full text txt files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative.
![CSV screenshot](/assets/images/sidecar_upload.png)

### Uploading Content to Internet Archive
- Once both your files and metadata are prepared you are ready to upload! To do so, you can use your command line interface to start the upload. In your command line (PowerShell, Terminal, ect.) navigate to the directory with the files and your CSV spreadsheet.
Expand Down

0 comments on commit db43a5e

Please sign in to comment.