Skip to content

aws-samples/amazon-translate-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Translating PDF Documents with Amazon Textract, Amazon Translate and PDFBox while Retaining the Original PDF Formatting

This repository contains a sample library and code examples showing how Amazon Textract and Amazon Translate can be used to extract and translate text from documents and use PDFBox to create a translated pdf while retaining the original formatting.

How is the translated PDF generated

To generate a translated PDF, we use Amazon Textract to extract text from documents and then use Amazon Translate to get the translated text. The extracted translated text is added as a layer to the image in the PDF document.

Amazon Textract detects and analyzes text input documents and returns information about detected items such as pages, words, lines, form data (key-value pairs), tables, selection elements etc. It also provides bounding box information which is an axis-aligned coarse representation of the location of the recognized item on the document page. We use detected text and its bounding box information to appropriately place the translated text in the pdf page.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Amazon Translate provides high quality on-demand and batch translation capabilities across more than 2970 language pairs, while decreasing your translation costs.

SampleInput.pdf is an example input document in English. SampleOutput-es.pdf is the translated pdf document in Spanish with all the formatting of the original document retained. SampleOutput-ja-min.pdf is the translated pdf document in Japanese using minimal mode and an external TrueType font file.

PDFDocument library wraps all the necessary logic to generate the translated PDF document using output from Amazon Textract and Amazon Translate. It also uses open source Java library Apache PDFBox to create the PDF document but there similar pdf processing libraries available in other programming languages.

Code examples

Create translated PDF from pdf on local drive

...

//Load pdf document and process each page as image
        PDDocument inputDocument = PDDocument.load(new File(documentName));
        PDFRenderer pdfRenderer = new PDFRenderer(inputDocument);
        for (int page = 0; page < inputDocument.getNumberOfPages(); ++page) {
            int pageNumber = page + 1;
            System.out.println("processing page " + pageNumber);
            //Render image
            image = pdfRenderer.renderImage(page, 1, org.apache.pdfbox.rendering.ImageType.RGB);

            //Get image bytes
            byteArrayOutputStream = new ByteArrayOutputStream();
            ImageIOUtil.writeImage(image, "jpeg", byteArrayOutputStream);
            byteArrayOutputStream.flush();
            imageBytes = ByteBuffer.wrap(byteArrayOutputStream.toByteArray());

            //Extract text
            lines = extractTextAndTranslate(imageBytes, sourceLanguage, destinationLanguage);

            //Add page with text layer and image in the pdf document
            if (retainFormatting)
                pdfDocument.addPageWithFormatting(image, ImageType.JPEG, lines);
                //Add page without text layer and image in the pdf document
            else
                pdfDocument.addPageWithoutFormatting(image, lines);

Run code examples on local machine

  1. Git clone or download and unzip this repository.
  2. If you want to use an external TrueType Font file for PDF output, place the .ttf file in src/main/resources and add the filename to your prompt in step 5. This Demo was used to translate English to Japanese using Google Noto Sans JP: https://fonts.google.com/noto/specimen/Noto+Sans+JP
  3. Install Apache Maven if it is not already installed.
  4. In the project directory run "mvn package".
  5. Authenticate with your AWS Credentials. See the Java SDK documentation for instructions: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials-temporary.html
  6. Run: "java -jar target/translate-pdf-3.0.jar --source --translated --font <myfont.ttf>" to run the Java project. This will generate two Sample Output PDFs; one with formatting and one without. --font is only required when using a TTF file. See https://aws.amazon.com/textract/ for the languages Textract currently supports and https://docs.aws.amazon.com/translate/latest/dg/what-is.html#what-is-languages for language codes.

Cost

As you run these samples they call different Amazon Textract and Amazon Translate APIs in your AWS account. You may be charged for all the API calls made as part of the analysis.

Other Resources

[1][large scale document processing with amazon textract - reference architecture] (https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing)

[2][generating-searchable-pdfs-from-scanned-documents-automatically-with-amazon-textract] (https://aws.amazon.com/blogs/machine-learning/generating-searchable-pdfs-from-scanned-documents-automatically-with-amazon-textract/)

[3]Amazon Textract code samples (https://github.com/aws-samples/amazon-textract-code-samples)

[4]Amazon Translate code samples (https://github.com/aws-samples/amazon-translate-text-extract-sample)

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages