Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

zaneselvans · 2022-10-12T23:33:07Z

We have a bunch of Visual FoxPro / DBF to SQLite extraction / conversion code that lives in the pudl.extract.ferc1 module. Really this is more analogous to the XBRL extractor code, or the ExcelExtractor that we use for the EIA data, and should be split out into its own more generally applicable DBF extractor module.

The data this module should enable us to extract includes:

The forms 1, 2, 6, and 60 are already archived and should be available for programmatic access via pudl_datastore command. We do not anticipate any further updates to the existing DBF data that has been published. All new FERC data is being published using XBRL.

We use a modified version of the dbfread library to access this type of data.

@zschira created a scoping issue in #2335

Note that the old FERC EQR data could be structured very differently from the other form data, and it might be more appropriate to extract that data to CSV or Apache Parquet if it's just a couple of very long tables. This is a stretch goal and might really be an entirely separate project so... if the work doesn't apply directly, that's fine and it should be put off.

All years of DBF data for a given form should be extracted into a single SQLite DB.
All years of data for a particular table should be available in a single table in the SQLite DB.

The text was updated successfully, but these errors were encountered:

cmgosnell · 2022-10-13T13:41:57Z

I'd personally suggest separating the migration of the dbf -> sqlite conversion code from the pudl.extract.ferc1 module from the generalize-ing of the dbf extractor.

The former task is mostly copy/paste but will make working with the pudl.extract.ferc1 much more manageable. I'm also excited about the later task but know that form 2 is not on our immediate horizon.

zaneselvans · 2022-10-13T15:18:22Z

Sure, I agree. Happy to do that on a small independent PR once we get everything merged into XBRL integration.

zaneselvans · 2023-03-03T16:01:39Z

@zschira Not a lot here but related to #2335

zaneselvans · 2023-03-13T16:55:08Z

@rousik As I said to @zschira there's not a whole lot here, but this is the issue we were chatting about.

I wrote the original code and am happy to flesh this out more if that would be useful.

zaneselvans · 2023-05-02T02:58:36Z

@rousik asks:

I'm suspecting that we might want to do some cleanup of the FERC 2 table names, see https://github.com/catalyst-cooperative/pudl/blob/rousik-ferc2/src/pudl/package_data/ferc2/table_file_map.csv - e.g. f2_117_cmpinc_hedge seems to contain lot of excess useless information that just clutter the things. Thoughts?

My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB.

When we take the data from the translated DB and do more extensive transformations and combine it with the more recent XBRL data to create new PUDL DB tables, we'll give it a longer more readable table name.

For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data!

rousik · 2023-05-02T15:01:48Z

@rousik asks:

I'm suspecting that we might want to do some cleanup of the FERC 2 table names, see https://github.com/catalyst-cooperative/pudl/blob/rousik-ferc2/src/pudl/package_data/ferc2/table_file_map.csv - e.g. f2_117_cmpinc_hedge seems to contain lot of excess useless information that just clutter the things. Thoughts?

My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB.

For FERC Form 2, it seems that the table names match filenames (tables are lower case, filenames all uppercase). I cxan leave as-is.

For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data!

I haven't inspected the data itself, but I will take a look to see what's being exported to sqlite and if I can glean some meaning from those.

zaneselvans · 2023-05-02T15:10:18Z

In what sense are the contents of f2_117_cmpinc_hedge messy?

Really all we are trying to do in this step is translate the data into a modern format that's easy to access. We happen to be able to easily get all the annual DBs into a single multi-year DB which is nice, but all of the cleaning, reshaping, renaming etc. takes place when we extract from the FERC SQLite DB and try to integrate it into PUDL.

zaneselvans · 2023-05-18T18:23:20Z

@rousik Do you feel like the tasks in here map to stuff that actually needs to get done to get us to having the additional DBF data translated to SQLite for FERC 2, 6, & 60? Is there enough work in those sub-tasks that we want to break them out into their own issues? It seems like there will probably be significant quirks to manage in the individual datasets.

e-belfer · 2023-07-24T20:18:11Z

Moved two quality issues into #2748, closing this issue.

zaneselvans added ferc1 Anything having to do with FERC Form 1 ferc2 Issues related to the FERC Form 2 dataset labels Oct 12, 2022

zaneselvans mentioned this issue Oct 12, 2022

Document new transforms & organize imports/logging #1962

Merged

zaneselvans added inframundo dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels Mar 3, 2023

zaneselvans added this to Catalyst Megaproject Mar 3, 2023

github-project-automation bot moved this to 🆕 New in Catalyst Megaproject Mar 3, 2023

zaneselvans added ferc6 ferc60 labels Mar 3, 2023

zaneselvans added ferceqr Data from FERC's Electric Quarterly Review (EQR) sqlite Issues related to interacting with sqlite databases labels Mar 15, 2023

zaneselvans added this to the 2023Q2 milestone Mar 15, 2023

zaneselvans moved this from 🆕 New to 🔖 Backlog in Catalyst Megaproject Mar 15, 2023

zaneselvans added the epic Any issue whose primary purpose is to organize other issues into a group. label Mar 15, 2023

zaneselvans mentioned this issue Mar 20, 2023

Scope FERC Forms 2 and 6 DBF extraction #2335

Closed

4 tasks

rousik self-assigned this Mar 23, 2023

zaneselvans added the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Apr 7, 2023

zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Apr 24, 2023

zaneselvans linked a pull request Apr 24, 2023 that will close this issue

Initial refactoring of dbf extraction process #2536

Merged

11 tasks

This was linked to pull requests May 25, 2023

Integrate FERC Form 2 dbf formats into ferc_to_sqlite #2564

Merged

Integrate FERC Form 6 from dbf #2595

Merged

zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 15, 2023

e-belfer mentioned this issue Jul 18, 2023

Extract FERC Form 60 DBF data to SQLite #2734

Merged

7 tasks

e-belfer linked a pull request Jul 18, 2023 that will close this issue

Extract FERC Form 60 DBF data to SQLite #2734

Merged

7 tasks

e-belfer mentioned this issue Jul 24, 2023

Improve DBF extraction functionality #2748

Open

2 tasks

e-belfer closed this as completed Jul 24, 2023

github-project-automation bot moved this from In review to Done in Catalyst Megaproject Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

zaneselvans commented Oct 12, 2022 •

edited by e-belfer

Loading

Tasks

cmgosnell commented Oct 13, 2022

zaneselvans commented Oct 13, 2022

zaneselvans commented Mar 3, 2023

zaneselvans commented Mar 13, 2023

zaneselvans commented May 2, 2023

rousik commented May 2, 2023

zaneselvans commented May 2, 2023

zaneselvans commented May 18, 2023

e-belfer commented Jul 24, 2023

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Comments

zaneselvans commented Oct 12, 2022 • edited by e-belfer Loading

Tasks

cmgosnell commented Oct 13, 2022

zaneselvans commented Oct 13, 2022

zaneselvans commented Mar 3, 2023

zaneselvans commented Mar 13, 2023

zaneselvans commented May 2, 2023

rousik commented May 2, 2023

zaneselvans commented May 2, 2023

zaneselvans commented May 18, 2023

e-belfer commented Jul 24, 2023

zaneselvans commented Oct 12, 2022 •

edited by e-belfer

Loading