-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert EIA generation and fuel allocations to Dagster #2435
Comments
@cmgosnell Do all of the |
Data Validation Test FailuresI think these are all expected, with the 1.446% difference due to the
Full Integration Test FailuresThese all look expected, except why is it complaining about
|
That sounds like a thing that needs a
I'd need you to tell me more about what actually changed in regards to the inputs in the dagster version to give any suggestion here.
I... I don't know. I think it is this way rn because it is not super complicated or finicky to always keep these eia860m dates. If we wanted to do something different, the end_date would be different for different tables which sounds more finicky than its worth imho.
only annual or monthly makes sense. I think I added a freq requirement before. |
From discussion with @cmgosnell:
|
The Naming of ThingsThe different aggregations / allocations of fuel and generation include:
Other names:
@cmgosnell are there other instances of naming in the generation fuel universe that we should be thinking about? Does anything here look crazy or wrong? Possible names
|
Closed by #2527. |
Most of the code being migrated will be in
pudl.analysis.allocate_net_gen
There are 3 versions of the net gen allocated table:
gen_fuel_by_generator_energy_source
is the primary table from which the others are derived.gen_fuel_by_generator_eia923
is an aggregation of the above table.gen_fuel_by_generator_energy_source_owner_eia923
is an allocation of it.gen_fuel_by_generator_eia923
is also used to provide a more complete filled version ofgeneration_eia923
.The main functions that organize the creation of these tables are:
allocate_gen_fuel_by_generator_energy_source
aggregate_gen_fuel_by_generator
The input data is generated / handled by:
extract_input_tables
standardize_input_frequency
It looks like those 4 functions are the only ones that need to be converted to use Dagster. However they need to happen at both yearly and monthly frequency, so we probably want a factory of interconnected assets.
Tasks
Data Problems (halp! @cmgosnell 🧠)
capacity_mw
andutility_id_eia
column in thegen_fuel_by_gen_esc_own
table, but only when it's aggregated to monthly frequency. It turns out this is because the ownership information is being merged on only annually rather than broadcast / date-merged, so in the monthly case only the January records end up withutility_id_eia
andcapacity_mw
values. Sinceutility_id_eia
is part of the primary key, this is invalid. So we need to either only create this particular table with annual frequency (is it only used for the FERC 1 to EIA entity matching?) or need to do adate_merge
to bring in the utility ID and capacity information. For the moment I'm skipping generating this asset at monthly resolution but that feels a little weird.energy_source_code_num
between the existing PudlTabl and the new Dagster calculation. However, this only happens sometimes. The rest of the time they match up 100%. My hunch is that this is happening when there's a tie between multiple energy source codes that are being filled in, and the ordering of something determines which value ends up in slot 1 vs. 2. If that's the case then this doesn't matter (and it was probably happening before but we never noticed...). If it's something else... maybe it still doesn't matter, since it's only 0.6% of all records.PudlTabl
were consistent with those from reading data out of the DB directly or asking Dagster to give me the table, I discovered that for the monthly allocatedgen_fuel
tables they were not! If notstart_date
orend_date
are given when creating aPudlTabl
object, it looks up the earliest and latest possible dates usingpudl.helpers.get_working_dates_by_datasource()
. Thestart_date
andend_date
are used to restrict the records that are read out of the DB. However, in the case of the monthly allocated gen_fuel tables there are some dates which are after the latest possible date (e.g. right now they run through2022-12-01
even though the latest date we have data for is the EIA-860M which goes through2022-09-01
). I think this is because there's some monthly time series filling going on in the generation allocation process. Is this really what we want to do? It looks like all of the data for 2022 is NA because the EIA-860M doesn't contain any generation or fuel. It seems like ideally we wouldn't be writing that whole year of NA data into the DB.Design Questions
energy_source_code_num
is actually a string likeenergy_source_code_3
which is confusing. Maybe a name change?gen_fuel_by_gen
table is denormalized. Why is that? Should all of these outputs be denormalized to include plant & utility names,unit_id_pudl
and other relevant IDs?utility_id_eia
in thegen_fuel_by_gen_esc_owner
table is confusing, since in this context (as in theownership_eia860
table) the utility here is the owner, not the operator, and we may want to indicate that. Also, if we denormalize this table and bring in additional plant & utility information it'll have a name collision with the operating utility.gen_fuel_by_gen
since it's just a simple subset of the columns.The text was updated successfully, but these errors were encountered: