Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize Visual FoxPro / DBF extraction code and separate it from FERC Form 1 #1984

Closed
10 tasks done
zaneselvans opened this issue Oct 12, 2022 · 9 comments · Fixed by #2536, #2564, #2595 or #2734
Closed
10 tasks done
Assignees
Labels
dbf Data coming from FERC's old Visual FoxPro DBF database file format. epic Any issue whose primary purpose is to organize other issues into a group. ferc1 Anything having to do with FERC Form 1 ferc2 Issues related to the FERC Form 2 dataset ferc6 ferc60 ferceqr Data from FERC's Electric Quarterly Review (EQR) good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. inframundo sqlite Issues related to interacting with sqlite databases
Milestone

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Oct 12, 2022

We have a bunch of Visual FoxPro / DBF to SQLite extraction / conversion code that lives in the pudl.extract.ferc1 module. Really this is more analogous to the XBRL extractor code, or the ExcelExtractor that we use for the EIA data, and should be split out into its own more generally applicable DBF extractor module.

The data this module should enable us to extract includes:

The forms 1, 2, 6, and 60 are already archived and should be available for programmatic access via pudl_datastore command. We do not anticipate any further updates to the existing DBF data that has been published. All new FERC data is being published using XBRL.

We use a modified version of the dbfread library to access this type of data.

@zschira created a scoping issue in #2335

Note that the old FERC EQR data could be structured very differently from the other form data, and it might be more appropriate to extract that data to CSV or Apache Parquet if it's just a couple of very long tables. This is a stretch goal and might really be an entirely separate project so... if the work doesn't apply directly, that's fine and it should be put off.

  • All years of DBF data for a given form should be extracted into a single SQLite DB.
  • All years of data for a particular table should be available in a single table in the SQLite DB.

Tasks

Preview Give feedback
  1. dbf ferc60
    e-belfer
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 ferc2 Issues related to the FERC Form 2 dataset labels Oct 12, 2022
@cmgosnell
Copy link
Member

I'd personally suggest separating the migration of the dbf -> sqlite conversion code from the pudl.extract.ferc1 module from the generalize-ing of the dbf extractor.

The former task is mostly copy/paste but will make working with the pudl.extract.ferc1 much more manageable. I'm also excited about the later task but know that form 2 is not on our immediate horizon.

@zaneselvans
Copy link
Member Author

Sure, I agree. Happy to do that on a small independent PR once we get everything merged into XBRL integration.

@zaneselvans zaneselvans added inframundo dbf Data coming from FERC's old Visual FoxPro DBF database file format. labels Mar 3, 2023
@zaneselvans
Copy link
Member Author

@zschira Not a lot here but related to #2335

@zaneselvans
Copy link
Member Author

@rousik As I said to @zschira there's not a whole lot here, but this is the issue we were chatting about.

I wrote the original code and am happy to flesh this out more if that would be useful.

@zaneselvans zaneselvans added ferceqr Data from FERC's Electric Quarterly Review (EQR) sqlite Issues related to interacting with sqlite databases labels Mar 15, 2023
@zaneselvans zaneselvans added this to the 2023Q2 milestone Mar 15, 2023
@zaneselvans zaneselvans moved this from 🆕 New to 🔖 Backlog in Catalyst Megaproject Mar 15, 2023
@zaneselvans zaneselvans added the epic Any issue whose primary purpose is to organize other issues into a group. label Mar 15, 2023
@rousik rousik self-assigned this Mar 23, 2023
@zaneselvans zaneselvans added the good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. label Apr 7, 2023
@zaneselvans zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Apr 24, 2023
@zaneselvans zaneselvans linked a pull request Apr 24, 2023 that will close this issue
11 tasks
@zaneselvans
Copy link
Member Author

@rousik asks:

I'm suspecting that we might want to do some cleanup of the FERC 2 table names, see https://github.com/catalyst-cooperative/pudl/blob/rousik-ferc2/src/pudl/package_data/ferc2/table_file_map.csv - e.g. f2_117_cmpinc_hedge seems to contain lot of excess useless information that just clutter the things. Thoughts?

My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB.

When we take the data from the translated DB and do more extensive transformations and combine it with the more recent XBRL data to create new PUDL DB tables, we'll give it a longer more readable table name.

For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data!

@rousik
Copy link
Collaborator

rousik commented May 2, 2023

@rousik asks:

I'm suspecting that we might want to do some cleanup of the FERC 2 table names, see https://github.com/catalyst-cooperative/pudl/blob/rousik-ferc2/src/pudl/package_data/ferc2/table_file_map.csv - e.g. f2_117_cmpinc_hedge seems to contain lot of excess useless information that just clutter the things. Thoughts?

My recollection is that the tables have names that are defined in the FoxPro DB independent of the filenames (the differed significantly from the filenames for FERC 1). Do the filenames just happen to match the names of the tables that are stored inside the database? Or are they being derived from the filenames? I don't think changing the table names in the translation from DBF to SQLite is a great idea, since if there's any documentation out there in the world explaining what's in the old DBs, it would apply pretty directly to the translated SQLite DB.

For FERC Form 2, it seems that the table names match filenames (tables are lower case, filenames all uppercase). I cxan leave as-is.

For FERC 1 with the DBF to SQLite translation we tried to keep all of the data without discriminating, since otherwise nobody can access it, and we weren't 100% sure what would be useful and what wouldn't. However, early on there were two unusual tables that seemed to contain binary data (like they’d jammed PDFs or word docs into the DB or something) which also made up ~90% of all the overall bulk of the database, and we skipped them in the translation to SQLite. FERC later revised all the old data archives, removing all of that mysterious data!

I haven't inspected the data itself, but I will take a look to see what's being exported to sqlite and if I can glean some meaning from those.

@zaneselvans
Copy link
Member Author

In what sense are the contents of f2_117_cmpinc_hedge messy?

Really all we are trying to do in this step is translate the data into a modern format that's easy to access. We happen to be able to easily get all the annual DBs into a single multi-year DB which is nice, but all of the cleaning, reshaping, renaming etc. takes place when we extract from the FERC SQLite DB and try to integrate it into PUDL.

@zaneselvans
Copy link
Member Author

@rousik Do you feel like the tasks in here map to stuff that actually needs to get done to get us to having the additional DBF data translated to SQLite for FERC 2, 6, & 60? Is there enough work in those sub-tasks that we want to break them out into their own issues? It seems like there will probably be significant quirks to manage in the individual datasets.

@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Jun 15, 2023
@e-belfer
Copy link
Member

Moved two quality issues into #2748, closing this issue.

@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbf Data coming from FERC's old Visual FoxPro DBF database file format. epic Any issue whose primary purpose is to organize other issues into a group. ferc1 Anything having to do with FERC Form 1 ferc2 Issues related to the FERC Form 2 dataset ferc6 ferc60 ferceqr Data from FERC's Electric Quarterly Review (EQR) good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. inframundo sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
4 participants