Skip to content

Latest commit

 

History

History

Day33

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Day 33: PDF Manipulations

Instructions:

  1. Open a new python file.
  2. The PDF file is a platform agnostic method of sharing documents. It's common to save reports as a PDF in order to ensure the formatting is consistent for print, desktop, and mobile. Python offers several packages for interacting with PDF files which can be used to automate different tasks.
  3. The tabula package is used to read tables from a PDF and allows the program to interact with the data. The output is stored as a dataframe which allows the use of functions like .to_csv. The resulting csv is not a clean representation of the data. Therefore, the file can be read and transformed to eliminate null columns and multi-row headers. While the tabula package provides the opportunity to extract data from a PDF, the output needs to be investigated and cleaned before it can be turned into a usable data product.
    from tabula import read_pdf
    import csv
    import os
    
  4. The read_pdf() function stores the contents of the file as a dataframe in a list object. There are methods of cleaning the data without needing to write the data in the current format to a csv, but it is the easiest method based on the functions already discussed to this point in the course. Therefore, a temporary csv is created to store the uncleaned data.
    census = read_pdf("MontcoCensus.pdf")
    census.to_csv("census_temp.csv", mode="w", sep="|", index=False)
    
    csv.register_dialect("pipe-delim", delimiter="|", lineterminator="\n")
    
  5. Since only the age and census numbers are needed, the program writes the new header and then bypasses the old header through the use of an if statement. The required data from the four rows are then written to the file. The resulting csv is in the proper format and the data can then be used for future analyses.
    with open("census_temp.csv", "rt") as census_in, open("census_20100401.csv", "wt") as census_out:
        writer = csv.writer(census_out, delimiter="|", lineterminator="\n")
        writer.writerow(("age", "both_sexes", "male", "female"))
        for row in csv.reader(census_in, dialect="pipe-delim"):
            if row[0] == "Unnamed: 0" or row[0] == "":
                continue
            else:
                writer.writerow([row[0], row[2], row[3], row[4]])
    
  6. Once the temporary file is no longer needed, it can be removed.
    os.remove("census_temp.csv")
    
  7. Update the log file with what you have learned today.