Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert EIA generation and fuel allocations to Dagster #2435

Closed
25 tasks done
Tracked by #1973
zaneselvans opened this issue Mar 22, 2023 · 6 comments · Fixed by #2527
Closed
25 tasks done
Tracked by #1973

Convert EIA generation and fuel allocations to Dagster #2435

zaneselvans opened this issue Mar 22, 2023 · 6 comments · Fixed by #2527
Assignees
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. dagster Issues related to our use of the Dagster orchestrator eia923 Anything having to do with EIA Form 923 output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases
Milestone

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Mar 22, 2023

Most of the code being migrated will be in pudl.analysis.allocate_net_gen

There are 3 versions of the net gen allocated table:

  • gen_fuel_by_generator_energy_source is the primary table from which the others are derived.
  • gen_fuel_by_generator_eia923 is an aggregation of the above table.
  • gen_fuel_by_generator_energy_source_owner_eia923 is an allocation of it.
  • A truncated version of gen_fuel_by_generator_eia923 is also used to provide a more complete filled version of generation_eia923.

The main functions that organize the creation of these tables are:

  • allocate_gen_fuel_by_generator_energy_source
  • aggregate_gen_fuel_by_generator

The input data is generated / handled by:

  • extract_input_tables
  • standardize_input_frequency

It looks like those 4 functions are the only ones that need to be converted to use Dagster. However they need to happen at both yearly and monthly frequency, so we probably want a factory of interconnected assets.

Tasks

Preview Give feedback

Data Problems (halp! @cmgosnell 🧠)

  • There's very low coverage of the capacity_mw and utility_id_eia column in the gen_fuel_by_gen_esc_own table, but only when it's aggregated to monthly frequency. It turns out this is because the ownership information is being merged on only annually rather than broadcast / date-merged, so in the monthly case only the January records end up with utility_id_eia and capacity_mw values. Since utility_id_eia is part of the primary key, this is invalid. So we need to either only create this particular table with annual frequency (is it only used for the FERC 1 to EIA entity matching?) or need to do a date_merge to bring in the utility ID and capacity information. For the moment I'm skipping generating this asset at monthly resolution but that feels a little weird.
  • About 0.6% of all the records are coming out with different values for energy_source_code_num between the existing PudlTabl and the new Dagster calculation. However, this only happens sometimes. The rest of the time they match up 100%. My hunch is that this is happening when there's a tie between multiple energy source codes that are being filled in, and the ordering of something determines which value ends up in slot 1 vs. 2. If that's the case then this doesn't matter (and it was probably happening before but we never noticed...). If it's something else... maybe it still doesn't matter, since it's only 0.6% of all records.
  • When testing that the net gen allocation outputs from PudlTabl were consistent with those from reading data out of the DB directly or asking Dagster to give me the table, I discovered that for the monthly allocated gen_fuel tables they were not! If not start_date or end_date are given when creating a PudlTabl object, it looks up the earliest and latest possible dates using pudl.helpers.get_working_dates_by_datasource(). The start_date and end_date are used to restrict the records that are read out of the DB. However, in the case of the monthly allocated gen_fuel tables there are some dates which are after the latest possible date (e.g. right now they run through 2022-12-01 even though the latest date we have data for is the EIA-860M which goes through 2022-09-01). I think this is because there's some monthly time series filling going on in the generation allocation process. Is this really what we want to do? It looks like all of the data for 2022 is NA because the EIA-860M doesn't contain any generation or fuel. It seems like ideally we wouldn't be writing that whole year of NA data into the DB.

Design Questions

  • The column energy_source_code_num is actually a string like energy_source_code_3 which is confusing. Maybe a name change?
  • These new table names are very long. Should we think of something else?
  • Right now only the gen_fuel_by_gen table is denormalized. Why is that? Should all of these outputs be denormalized to include plant & utility names, unit_id_pudl and other relevant IDs?
  • Use of utility_id_eia in the gen_fuel_by_gen_esc_owner table is confusing, since in this context (as in the ownership_eia860 table) the utility here is the owner, not the operator, and we may want to indicate that. Also, if we denormalize this table and bring in additional plant & utility information it'll have a name collision with the operating utility.
  • I didn't create a whole new table for the generation-only version of gen_fuel_by_gen since it's just a simple subset of the columns.
@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 output Exporting data from PUDL into other platforms or interchange formats. epic Any issue whose primary purpose is to organize other issues into a group. dagster Issues related to our use of the Dagster orchestrator sqlite Issues related to interacting with sqlite databases labels Mar 22, 2023
@zaneselvans zaneselvans added this to the 2023Q2 milestone Mar 22, 2023
@zaneselvans zaneselvans moved this from 🆕 New to 🔖 Backlog in Catalyst Megaproject Apr 3, 2023
@zaneselvans zaneselvans changed the title Convert derived EIA tables into Dagster assets Convert EIA generation and fuel allocations to Dagster Apr 13, 2023
@zaneselvans
Copy link
Member Author

@cmgosnell Do all of the generation_fuel_by_*_eia923 outputs only make sense when aggregated yearly or monthly? Like there's no "original" versions of them with all the records, correct?

@zaneselvans zaneselvans removed the epic Any issue whose primary purpose is to organize other issues into a group. label Apr 15, 2023
@zaneselvans zaneselvans linked a pull request Apr 15, 2023 that will close this issue
5 tasks
@zaneselvans zaneselvans self-assigned this Apr 15, 2023
@zaneselvans zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Apr 15, 2023
@zaneselvans zaneselvans added the analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. label Apr 15, 2023
@zaneselvans
Copy link
Member Author

Data Validation Test Failures

I think these are all expected, with the 1.446% difference due to the start_date / end_date issue mentioned in the issue body above. Just need to update the numbers and decide what to do with the date selection.

FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 432567 rows, expected 432570. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 30339 rows, expected 30340. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 46407 rows, expected 46408. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 46407 rows, expected 46408. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 432567 rows, expected 432570. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 432599 rows, expected 432602. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5096719 rows, expected 5171497. Off by -1.446%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 362369 rows, expected 362381. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 555107 rows, expected 555119. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 555107 rows, expected 555119. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5096719 rows, expected 5171497. Off by -1.446%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5097103 rows, expected 5171881. Off by -1.446%, allowed margin of 0.000%

Full Integration Test Failures

These all look expected, except why is it complaining about PudlTabl.gen_fuel_by_generator_energy_source_eia923?

FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-gen_eia923-12.0-kwargs9] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-gen_fuel_by_generator_eia923-12.0-kwargs10] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-capacity_factor-12.0-kwargs15] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-mcoe-12.0-kwargs16] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_annual_eia_outputs[gen_fuel_by_generator_energy_source_eia923] - AttributeError: 'PudlTabl' object has no attribute 'gen_fuel_by_generator_energy_source_eia923'
FAILED test/integration/output_test.py::test_annual_eia_outputs[gen_fuel_by_generator_energy_source_owner_eia923] - AssertionError: Allocated net generation by owner can only be calculated annually. Got a frequency of: MS

@cmgosnell
Copy link
Member

There's very low coverage of the capacity_mw and utility_id_eia column in the gen_fuel_by_gen_esc_own table, but only when it's aggregated to monthly frequency. It turns out this is because the ownership information is being merged on only annually rather than broadcast / date-merged, so in the monthly case only the January records end up with utility_id_eia and capacity_mw values. Since utility_id_eia is part of the primary key, this is invalid. So we need to either only create this particular table with annual frequency (is it only used for the FERC 1 to EIA entity matching?) or need to do a date_merge to bring in the utility ID and capacity information. For the moment I'm skipping generating this asset at monthly resolution but that feels a little weird.

That sounds like a thing that needs a date_merge for sure!

About 0.6% of all the records are coming out with different values for energy_source_code_num between the existing PudlTabl and the new Dagster calculation. However, this only happens sometimes. The rest of the time they match up 100%. My hunch is that this is happening when there's a tie between multiple energy source codes that are being filled in, and the ordering of something determines which value ends up in slot 1 vs. 2. If that's the case then this doesn't matter (and it was probably happening before but we never noticed...). If it's something else... maybe it still doesn't matter, since it's only 0.6% of all records.

I'd need you to tell me more about what actually changed in regards to the inputs in the dagster version to give any suggestion here.

When testing that the net gen allocation outputs from PudlTabl were consistent with those from reading data out of the DB directly or asking Dagster to give me the table, I discovered that for the monthly allocated gen_fuel tables they were not! If not start_date or end_date are given when creating a PudlTabl object, it looks up the earliest and latest possible dates using pudl.helpers.get_working_dates_by_datasource(). The start_date and end_date are used to restrict the records that are read out of the DB. However, in the case of the monthly allocated gen_fuel tables there are some dates which are after the latest possible date (e.g. right now they run through 2022-12-01 even though the latest date we have data for is the EIA-860M which goes through 2022-09-01). I think this is because there's some monthly time series filling going on in the generation allocation process. Is this really what we want to do? It looks like all of the data for 2022 is NA because the EIA-860M doesn't contain any generation or fuel. It seems like ideally we wouldn't be writing that whole year of NA data into the DB.

I... I don't know. I think it is this way rn because it is not super complicated or finicky to always keep these eia860m dates. If we wanted to do something different, the end_date would be different for different tables which sounds more finicky than its worth imho.

@cmgosnell Do all of the generation_fuel_by_*_eia923 outputs only make sense when aggregated yearly or monthly? Like there's no "original" versions of them with all the records, correct?

only annual or monthly makes sense. I think I added a freq requirement before.

@zaneselvans
Copy link
Member Author

From discussion with @cmgosnell:

  • Do the date merge on the gen by owner table.
  • Don't worry about 0.6% of records with differently ordered energy_source_code_num values for now.
  • Only use the applicable overlapping years in the Net Gen Allocation process (so no null final EIA860M year)
  • Make a list of the net-gen things that seem like they should have identical names, and come back with proposal / questions.
  • Don't worry about denormalizing any of the net-gen outputs that aren't already denormalized.

@zaneselvans
Copy link
Member Author

zaneselvans commented Apr 28, 2023

The Naming of Things

The different aggregations / allocations of fuel and generation include:

  • Original (incomplete) generation by generator
    • Asset: denorm_generation_AGG_eia923
    • Method: PudlTabl.gen_original_eia923()
  • Original generation fuel, by plant, prime-mover, and energy source code.
    • Asset: denorm_generation_fuel_combined_AGG_eia923
    • Method: PudlTabl.gf_eia923()
  • Original boiler fuel, by plant, boiler ID, and fuel type.
    • Asset: denorm_boiler_fuel_AGG_eia923
    • Method: PudlTabl.bf_eia923()
  • Allocated generation and fuel by generator and energy source code.
    • Asset: generation_fuel_by_generator_energy_source_AGG_eia923
    • Method: PudlTabl.gen_fuel_by_generator_energy_source_eia923
    • Factory Function: gen_fuel_by_gen_esc()
  • Allocated generation and fuel by generator, energy source code, and owner
    • Asset: generation_fuel_by_generator_energy_source_owner_yearly_eia923
    • Method: PudlTabl.gen_fuel_by_generator_energy_source_owner_eia923()
    • Factory Function: gen_fuel_by_gen_esc_owner()
  • Allocated generation by generator
    • Asset:generation_fuel_by_generator_AGG_eia923
    • Method: PudlTabl.gen_fuel_by_generator_eia923()
    • Factory Function: gen_fuel_by_gen()

Other names:

  • In pudl.analysis.allocate_net_gen the variable gen_pm_fuel often used for dataframes reported by generator, prime mover, and fuel?
  • Also in pudl.analysis.allocate_net_gen there's both _fuel and _esc (energy source code) to indicate a dataframe that has energy_source_code as part of its PK.
  • In the Plant Parts code we find plant_prime_fuel or plant_prime_mover. Does plant_ always indicate that there is no generator ID present? Does plant_prime_fuel mean (plant_id_eia, prime_mover_code, energy_source_code)? How do we or should we differentiate between plant, prime mover, energy source vs. generator, energy source (where a combination of (report_date, plant_id_eia, generator_id) necessarily implies a particular prime_mover_code)?

@cmgosnell are there other instances of naming in the generation fuel universe that we should be thinking about? Does anything here look crazy or wrong?

Possible names

  • All of the allocated generation fuel tables are organized by generator ID, so maybe that aspect doesn't need to be in the name? Could we identify the differences only by the finest allocation granularity? We could have:
    • generation_fuel (the original, un-allocated data)
    • generation_by_generator
    • generation_fuel_by_energy_source_code
    • generation_fuel_by_owner

@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject May 1, 2023
@jdangerx jdangerx moved this from In review to Backlog in Catalyst Megaproject Jun 5, 2023
@jdangerx jdangerx moved this from Backlog to In review in Catalyst Megaproject Jul 3, 2023
@zaneselvans zaneselvans moved this from In review to Done in Catalyst Megaproject Jul 17, 2023
@e-belfer
Copy link
Member

Closed by #2527.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. dagster Issues related to our use of the Dagster orchestrator eia923 Anything having to do with EIA Form 923 output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants