Merge pull request #6 from gwu-libraries/adobePDFsOCR

ocr postprocessing
gwu-libraries · Jun 17, 2024 · db43a5e · db43a5e
2 parents e7f58c2 + c63e5e2
commit db43a5e
Show file tree

Hide file tree

Showing 5 changed files with 15 additions and 8 deletions.
diff --git a/digitization/av/av_bestpractices.md b/digitization/av/av_bestpractices.md
@@ -54,7 +54,6 @@ Digitized av content is often multiple components. These components should be pa
 
 Example SIP for a/v content:
 
-
 ```
 ms2374_s2_c107d_f7_i1
 ├── ms2374_s2_c107d_f7_i1_001.mov

diff --git a/digitization/av/av_records.md b/digitization/av/av_records.md
@@ -7,13 +7,13 @@ grand_parent: Digitization
 nav_order: 3
 ---
 
-When digitizing audiovisual carriers we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information.
+When digitizing audiovisual carriers, we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information.
 
 ## Scope and Contents
 Any content notes derived during the digitization process. If we are doing a monitored transfer, we should record notes about the aboutness of the recording.
 
 Examples:
-"David Einsenhower gives speech and takes questions from the audience at the alumni association dinner in 1987."
+"David Eisenhower gives speech and takes questions from the audience at the alumni association dinner in 1987."
 "1991 Convocation honoring Ronald and Nancy Reagan 10 years after Ronald Reagan was shot and brought to GW hospital for treatment."
 
 ## Physical Description 

diff --git a/digitization/imaging/imaging_bestpractices.md b/digitization/imaging/imaging_bestpractices.md
@@ -11,7 +11,7 @@ nav_order: 1
 
 This is not policy, but rather a guiding document. For various reasons, projects might not be able to fulfill these recommendations. At present, technical debt related to storage and access systems makes these recommendations difficult to fulfill.
 
-These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized audio and moving image material.
+These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized texts and graphics.
 
 ## Documents and manuscripts (unbound)
 

diff --git a/digitization/imaging/ocr.md b/digitization/imaging/ocr.md
@@ -5,4 +5,13 @@ permalink: /imaging_ocr/
 grand_parent: Digitization
 parent:  "Digitization: Imaging Text and Graphics"
 ---
-test
+# Optical Character Recognition (OCR) for Text Based Documents
+
+
+## Adobe Acrobat
+
+## [Tesseract](https://github.com/tesseract-ocr/tesseract)
+
+### [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)
+
+OCRmyPDF uses Tesseract to generate a searchable [PDF/A ](https://en.wikipedia.org/?title=PDF/A) file from a PDF or images.
diff --git a/managing/accessupload.md b/managing/accessupload.md
@@ -31,9 +31,8 @@ Presently, GWU Special Collections only ingests access copies of digital collect
 - You may use the [ArchivesSpace_to_InternetArchive script]() to generate metadata from the archival description for each record. 
   - If you use this script, you should still review the metadata before upload. The script pulls description from ancestor records (series, resource, ect.) that may not be appropriate for what is represented by the digital content.
     - An example of this might be rights information. The script will pull the rights statement from the resource (collection) record. This may not apply to item-level description for the digital content.
-- You may also include additional derivatives (SRT/VTT caption files, text files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative. 
-  - Note that the Internet Archive generates OCR for text-based documents automatically. We cannot overwrite this generated OCR. We can still upload any corrected OCR, but it will not be used for full-text search results.
-![CSV screenshot](/assets/images/sidecar_upload.PNG)
+- You may also include additional derivatives (SRT/VTT caption files, full text txt files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative. 
+![CSV screenshot](/assets/images/sidecar_upload.png)
 
 ### Uploading Content to Internet Archive
 - Once both your files and metadata are prepared you are ready to upload! To do so, you can use your command line interface to start the upload. In your command line (PowerShell, Terminal, ect.) navigate to the directory with the files and your CSV spreadsheet.
-Original file line number
+Diff line change
@@ Expand Up @@
     Example SIP for a/v content:
     ```
     ms2374_s2_c107d_f7_i1
     ├── ms2374_s2_c107d_f7_i1_001.mov
@@ Expand Down @@