-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984
Comments
I'd personally suggest separating the migration of the dbf -> sqlite conversion code from the The former task is mostly copy/paste but will make working with the |
Sure, I agree. Happy to do that on a small independent PR once we get everything merged into XBRL integration. |
@rousik asks:
My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB. When we take the data from the translated DB and do more extensive transformations and combine it with the more recent XBRL data to create new PUDL DB tables, we'll give it a longer more readable table name. For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data! |
For FERC Form 2, it seems that the table names match filenames (tables are lower case, filenames all uppercase). I cxan leave as-is.
I haven't inspected the data itself, but I will take a look to see what's being exported to sqlite and if I can glean some meaning from those. |
In what sense are the contents of Really all we are trying to do in this step is translate the data into a modern format that's easy to access. We happen to be able to easily get all the annual DBs into a single multi-year DB which is nice, but all of the cleaning, reshaping, renaming etc. takes place when we extract from the FERC SQLite DB and try to integrate it into PUDL. |
@rousik Do you feel like the tasks in here map to stuff that actually needs to get done to get us to having the additional DBF data translated to SQLite for FERC 2, 6, & 60? Is there enough work in those sub-tasks that we want to break them out into their own issues? It seems like there will probably be significant quirks to manage in the individual datasets. |
Moved two quality issues into #2748, closing this issue. |
We have a bunch of Visual FoxPro / DBF to SQLite extraction / conversion code that lives in the
pudl.extract.ferc1
module. Really this is more analogous to the XBRL extractor code, or the ExcelExtractor that we use for the EIA data, and should be split out into its own more generally applicable DBF extractor module.The data this module should enable us to extract includes:
The forms 1, 2, 6, and 60 are already archived and should be available for programmatic access via
pudl_datastore
command. We do not anticipate any further updates to the existing DBF data that has been published. All new FERC data is being published using XBRL.We use a modified version of the dbfread library to access this type of data.
@zschira created a scoping issue in #2335
Note that the old FERC EQR data could be structured very differently from the other form data, and it might be more appropriate to extract that data to CSV or Apache Parquet if it's just a couple of very long tables. This is a stretch goal and might really be an entirely separate project so... if the work doesn't apply directly, that's fine and it should be put off.
Tasks
The text was updated successfully, but these errors were encountered: