Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate greg M's PM code fix #2446

Merged
merged 23 commits into from
May 16, 2023
Merged

Integrate greg M's PM code fix #2446

merged 23 commits into from
May 16, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Mar 23, 2023

PR Overview

Final touches on @grgmiller 's pr to fix for the pm codes + several other bugs!

I'm going to suggest that we merge this in with the following scope and move the remaining tasks into #1113:

In scope:

Preview Give feedback

Out-of-scope

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@grgmiller
Copy link
Collaborator

@cmgosnell let me know what if anything you need from me to move this forward.

@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (a232c06) 86.9% compared to head (dfa3c03) 86.9%.

❗ Current head dfa3c03 differs from pull request most recent head 76037fa. Consider uploading reports for the commit 76037fa to get more accurate results

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2446   +/-   ##
=====================================
  Coverage   86.9%   86.9%           
=====================================
  Files         84      84           
  Lines       9604    9626   +22     
=====================================
+ Hits        8351    8374   +23     
+ Misses      1253    1252    -1     
Impacted Files Coverage Δ
src/pudl/analysis/allocate_net_gen.py 97.3% <100.0%> (+0.2%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cmgosnell
Copy link
Member Author

cmgosnell commented Mar 29, 2023

mkay... still not fully ready for review but I learned some things:

  • greg's addition all missing esc's from the gf table entirely removed the need to use _allocate_unassociated_records after merging in the gf table to the stacked generators in associate_generator_tables
  • i tried adding the bf table in the same fashion and it deeply reduced the number of records that needed to be _allocate_unassociated_records . Most of those were records with 0's.
  • One thing I learned while doing this is that the Ooopsies warning (see below) is coming up with waayyyy more records tripping this warning this way when both _allocate_unassociated_records are removed. still need to investigate that. But all of the other metrics look better with this new setup.

(new as of 4/28) with unassociated pm steps

2023-04-28 14:59:25 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:367 The granular data tables contain 38.1% of the fuel and 38.9% of net generation in the higher-coverage generation_fuel_eia923 table.
2023-04-28 14:59:29 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1459 Distributing 0.1% annually reported records to months.
2023-04-28 15:02:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1459 Distributing 0.2% annually reported records to months.
2023-04-28 15:06:38 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:919 Associating and allocating 54762 (0.7%) records with unexpected prime_mover_code.
2023-04-28 15:07:07 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:919 Associating and allocating 16624 (0.2%) records with unexpected prime_mover_code.
2023-04-28 15:08:28 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1124 Ratio calc types: 
   All gens w/in generation table:  1372731#, 2.3e+08 MW
   Some gens w/in generation table: 15799#, 1.2e+06 MW
   No gens w/in generation table:   5929734#, 2.9e+08 MW
2023-04-28 15:08:37 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1805 Ooopsies. You got 36 records where the 'frac' column isn't adding up to 1 for each 'IDX_PM_ESC' group. Check 'make_allocation_frac()' 
2023-04-28 15:09:37 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1849 1.774% of records have are partially off from their 'IDX_PM_ESC' group
2023-04-28 15:09:38 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1861 gen v fuel table net gen diff:      38.8%
2023-04-28 15:09:38 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1865 new v fuel table net gen diff:      98.4%
2023-04-28 15:09:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1902 1.51% of generator records are more that 5% off from the net generation table
2023-04-28 15:09:47 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1240 Ratio calc types: 
   All gens w/in boiler fuel table:  1014430#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 87665#, 1.3e+07 MW
   No gens w/in boiler fuel table:   6216169#, 3.2e+08 MW
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1960 net_generation_mwh: 1.1% of allocated plant/year's are off by more than 5%
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1971 net_generation_mwh: Min and max differnce are x-2.02 and x8.13
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1960 fuel_consumed_mmbtu: 1.1% of allocated plant/year's are off by more than 5%
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1971 fuel_consumed_mmbtu: Min and max differnce are x0.0 and x1.0
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1960 fuel_consumed_for_electricity_mmbtu: 1.1% of allocated plant/year's are off by more than 5%
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1971 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x-0.65 and x1.0

with gf unassociated allocation step removed

2023-03-31 12:08:58 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:367 The granular data tables contains 38.1% of the fuel and 38.9% of net generation in the higher-coverage generation_fuel_eia923 table.
2023-03-31 12:09:01 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1448 Distributing 0.1% annually reported records to months.
2023-03-31 12:11:50 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1448 Distributing 0.2% annually reported records to months.
2023-03-31 12:16:25 [    INFO] catalystcoop.pudl.transform.classes:823 54.1% of records (8997 rows) contain only {0.0, nan, <NA>} values in required columns. Dropped these 💩💩💩 records.
2023-03-31 12:16:25 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:908 Associating and allocating 6506 (0.1%) records with unexpected energy_source_code.
2023-03-31 12:21:26 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1113 Ratio calc types: 
   All gens w/in generation table:  1372763#, 2.3e+08 MW
   Some gens w/in generation table: 15767#, 1.2e+06 MW
   No gens w/in generation table:   5984955#, 2.9e+08 MW
2023-03-31 12:22:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1837 1.024% of records have are partially off from their 'IDX_PM_ESC' group
2023-03-31 12:22:22 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1846 55221 records have no capacity or net gen
2023-03-31 12:22:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1849 gen v fuel table net gen diff:      38.8%
2023-03-31 12:22:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1853 new v fuel table net gen diff:      97.0%
2023-03-31 12:22:28 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1888 1.50% of generator records are more that 5% off from the net generation table
2023-03-31 12:22:30 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1229 Ratio calc types: 
   All gens w/in boiler fuel table:  1015372#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 93778#, 1.4e+07 MW
   No gens w/in boiler fuel table:   6264335#, 3.2e+08 MW
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1946 net_generation_mwh: 1.5% of allocated plant/year's are off by more than 5%
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1957 net_generation_mwh: Min and max differnce are x-2.62 and x9.56
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1946 fuel_consumed_mmbtu: 1.5% of allocated plant/year's are off by more than 5%
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1957 fuel_consumed_mmbtu: Min and max differnce are x-0.0 and x1.0
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1946 fuel_consumed_for_electricity_mmbtu: 1.4% of allocated plant/year's are off by more than 5%
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1957 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x-0.0 and x1.0

with both unassociated allocation steps removed

2023-03-29 15:14:09 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:369 The granular data tables contain 38.1% of the fuel and 38.9% of net generation in the higher-coverage generation_fuel_eia923 table.
2023-03-29 15:14:11 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1460 Distributing 0.1% annually reported records to months.
2023-03-29 15:16:20 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1460 Distributing 0.2% annually reported records to months.
2023-03-29 15:18:12 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1630 Index(['plant_id_eia', 'report_date', 'prime_mover_code', 'energy_source_code', 'num'], dtype='object')
2023-03-29 15:20:53 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1125 Ratio calc types: 
   All gens w/in generation table:  1378884#, 2.3e+08 MW
   Some gens w/in generation table: 25784#, 1.9e+06 MW
   No gens w/in generation table:   5985441#, 2.9e+08 MW
2023-03-29 15:21:00 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1808 Ooopsies. You got 1301 records where the 'frac' column isn't adding up to 1 for each 'IDX_PM_ESC' group. Check 'make_allocation_frac()'
2023-03-29 15:21:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1850 1.063% of records have are partially off from their 'IDX_PM_ESC' group
2023-03-29 15:21:45 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1859 71845 records have no capacity or net gen
2023-03-29 15:21:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1862 gen v fuel table net gen diff:      38.8%
2023-03-29 15:21:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1866 new v fuel table net gen diff:      97.0%
2023-03-29 15:21:51 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1901 1.49% of generator records are more that 5% off from the net generation table
2023-03-29 15:21:53 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1241 Ratio calc types: 
   All gens w/in boiler fuel table:  1030024#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 89255#, 1.3e+07 MW
   No gens w/in boiler fuel table:   6270830#, 3.2e+08 MW
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1959 net_generation_mwh: 1.5% of allocated plant/year's are off by more than 5%
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1970 net_generation_mwh: Min and max differnce are x-2.62 and x9.56
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1959 fuel_consumed_mmbtu: 1.6% of allocated plant/year's are off by more than 5%
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1970 fuel_consumed_mmbtu: Min and max differnce are x-0.0 and x1.0
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1959 fuel_consumed_for_electricity_mmbtu: 1.5% of allocated plant/year's are off by more than 5%
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1970 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x-0.0 and x1.0

last-ish nightly log

2023-03-24 14:27:08 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1450 Distributing 0.1% annually reported records to months.
2023-03-24 14:34:09 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1450 Distributing 0.2% annually reported records to months.
2023-03-24 14:40:54 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:910 Associating and allocating 225389 (3.2%) records with unexpected prime_mover_code.
2023-03-24 14:41:52 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:910 Associating and allocating 64345 (0.9%) records with unexpected energy_source_code.
2023-03-24 14:44:17 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1121 Ratio calc types: 
   All gens w/in generation table:  1203491#, 2e+08 MW
   Some gens w/in generation table: 18563#, 1.3e+06 MW
   No gens w/in generation table:   5611085#, 2.6e+08 MW
2023-03-24 14:44:30 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1724 Ooopsies. You got 26 records where the 'frac' column isn't adding up to 1 for each 'IDX_PM_ESC' group. Check 'make_allocation_frac()'
2023-03-24 14:45:36 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1766 1.949% of records have are partially off from their 'IDX_PM_ESC' group
2023-03-24 14:45:37 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1778 gen v fuel table net gen diff:      38.8%
2023-03-24 14:45:37 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1782 new v fuel table net gen diff:      96.8%
2023-03-24 14:45:48 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1817 0.90% of generator records are more that 5% off from the net generation table
2023-03-24 14:45:50 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1234 Ratio calc types: 
   All gens w/in boiler fuel table:  967980#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 63060#, 7.9e+06 MW
   No gens w/in boiler fuel table:   5802099#, 2.8e+08 MW
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1875 net_generation_mwh: 2.2% of allocated plant/year's are off by more than 5%
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1886 net_generation_mwh: Min and max differnce are x-2.02 and x8.13
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1875 fuel_consumed_mmbtu: 5.9% of allocated plant/year's are off by more than 5%
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1886 fuel_consumed_mmbtu: Min and max differnce are x0.0 and x6.0
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1875 fuel_consumed_for_electricity_mmbtu: 5.9% of allocated plant/year's are off by more than 5%
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1886 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x0.0 and x6.0

Comment on lines 625 to 632
indicator=True, # used in _allocate_unassociated_records to find unassocited
)
.pipe(remove_inactive_generators)
.pipe(
_allocate_unassociated_records,
idx_cols=IDX_PM_ESC,
col_w_unexpected_codes="prime_mover_code",
data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS],
)
.drop(columns=["_merge"]) # drop do we can do this again in the bf_summed merge
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applying the treatment in add_missing_energy_source_codes_to_gens resulted in 0 records from the gf table (which was merged in just above this chunk) being "unassociated". which is to say all of the records from gf had matching IDX_PM_ESC in the updated generators table.

I considered either: a) keeping this in here just in case future records are unassociated OR b) raising an assertion or warning if any records are unassociated/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this is only an issue for the bf table now and not gf? I'm wondering if it would be worth keeping it just in case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably raise an exception if we encounter unassociated records - we don't expect any to show up, now that we've added the missing ESCs, and so I'd like it to fail loudly when something does sneak through!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after adding an assertion that failed, i added this step back in. I'm a little confused bc when I was first testing this there seemed to be no unassociated gf data after the new missing-esc treatment. but it definitely seems like there are still unassociated records, but the number dropped dramatically. so i've added the step back in!

Comment on lines 863 to 860
def _allocate_unassociated_records(
def _allocate_unassociated_bf_records(
gen_assoc: pd.DataFrame,
idx_cols: list[str],
col_w_unexpected_codes: Literal["energy_source_code", "prime_mover_code"],
data_columns: list[str],
) -> pd.DataFrame:
"""Associate unassociated gen_fuel table records on idx_cols.
"""Associate unassociated :ref:`boiler_fuel_eia923` table records on idx_cols.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several aspects of this function were designed for either the "energy_source_code" or the "prime_mover_code" being unassociated. But upon further reflection and in-notebook dissecting, we were using this to merge in the unassocaited records from the gf table and then the bf table. I converted it to be a little more tailored to the bf table.

Comment on lines 898 to 903
.pipe(
pudl.transform.classes.drop_invalid_rows,
pudl.transform.classes.InvalidRows(
required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0]
),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~ half of all of these unconnected records were 0's or nulls. this whole function is here to ensure we don't loose fuel in the association process. And 0 fuel isn't really fuel for this purpose. was easy to reuse the drop_invalid_rows transformer we made for ferc1 transformers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we do want to keep zero values here. A zero is still data about what was being consumed, and I'm wondering if this could be valuable. Have you tried running it both ways to see if the outputs differ (if you keep 0s in this step)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit to dropping the 0s vs keeping them? In past environments I've run into lots of problems when we don't distinguish between "we don't know what happened" and "we do know what happened - it was nothing." But I know much less about the actual data + use cases than either of you two.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed them in this case because these are the unassociated records and later in _allocate_unassociated_data_col they will be added to the associated records. so adding a zero to a value will result in no change. _allocate_unassociated_data_col already does nothing if the unassociated value is null because you can't add a null to a value.

But I do think there is some merit in keeping the zero's in because we do the following: og_value.fillna(0) + unassociated_value so this will result in more 0's in the case of the og_value being null.

Comment on lines -1156 to 1153
frac_from_g_tbl=lambda x: x.net_generation_mwh_g_tbl_pm_fuel
/ x.net_generation_mwh_gf_tbl,
frac_from_g_tbl=lambda x: np.where(
(x.net_generation_mwh_g_tbl_pm_fuel / x.net_generation_mwh_gf_tbl) < 1,
(x.net_generation_mwh_g_tbl_pm_fuel / x.net_generation_mwh_gf_tbl),
1,
),
# for records within these mix groups that do have net gen in the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a greg addition (+ the next frac_from_bf_tbl mirrored version). A great update to ensure we don't over allocate. This ensures we don't get a fraction that is over 100% to use for allocation.

@@ -1598,6 +1596,80 @@ def adjust_energy_source_codes(
return gens


def add_missing_energy_source_codes_to_gens(gens_at_freq, gf, bf):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the core of greg's update. all in all a great simplify-er. bc generators can have many esc & sometimes the data tables have more and different esc's, this function identifies missing esc's and adds them into the gens table.

return gens_at_freq


def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one thing I changed in here was adding the bf table. I also removed the dropping of the zeros. I don't fully understand why but removing that led to the output metrics being slightly better. I figured it didn't hurt to have more esc's in the gens table so I didn't deeply investigate why.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's been a while since I've looked at this so I forget exactly why I dropped the zeros, but it might have had something to do with escs associated with proposed or retired generators? What were the output metrics that improved?

Also, with the addition of the bf table and removing the zero dropping, you might need to update the docstring and inline comments.

Copy link
Member Author

@cmgosnell cmgosnell Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could definitely imagine that a lot of these zero-data ESCs are associated with non-operating generators, but that should be covered by your remove_inactive_generators. I tried turning this on and off and it seemed like we were getting much more unassociated records when we dropped the nulls, so I dropped the dropping.

for metrics, I've mostly been looking at the indicators in the logs. I've kept a little running list of the logs for the different configs here. Although I don't think I saved the outputs from dropping these nulls vs not.

These updates overall have decreased the # of records that are outside of tolerance, which is great.

@cmgosnell cmgosnell marked this pull request as ready for review March 31, 2023 17:41
@cmgosnell cmgosnell requested review from jdangerx and grgmiller March 31, 2023 17:41
@cmgosnell
Copy link
Member Author

@grgmiller and @jdangerx ! I finally got enough cycles of turning knobs and checking the outputs to feel like this is good and ready. Greg, I made a few preemptive comments on the PR to highlight were I slightly tweaked things you did. The tl;dr version is:

  • I added the boiler fuel table into identify_missing_gf_escs_in_gens (which i am now realizing I should change that func name)....
  • I removed the _allocate_unassociated_records after we merge in the gf table bc it was no longer necessary
  • I converted the _allocate_unassociated_records -> _allocate_unassociated_bf_records and am now doing this on the basis of prime_mover_code

@zaneselvans
Copy link
Member

zaneselvans commented Mar 31, 2023

This PR would close issue #2226 and also supersedes PR #2235 right?

Comment on lines 625 to 632
indicator=True, # used in _allocate_unassociated_records to find unassocited
)
.pipe(remove_inactive_generators)
.pipe(
_allocate_unassociated_records,
idx_cols=IDX_PM_ESC,
col_w_unexpected_codes="prime_mover_code",
data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS],
)
.drop(columns=["_merge"]) # drop do we can do this again in the bf_summed merge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this is only an issue for the bf table now and not gf? I'm wondering if it would be worth keeping it just in case?

idx_cols=IDX_GENS_PM_ESC,
col_w_unexpected_codes="energy_source_code",
col_w_unexpected_codes="prime_mover_code",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you are focused on matching unassociated prime movers, but it seems like the _allocate_unassociated_bf_records also has some code related to energy source codes? If this function is no longer being used to associate both prime movers and energy source codes, I'm wondering if it should be tailored to specifically associate prime mover codes in the bf table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious about this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe some of the energy_source_code_num stuff in the function was to aid in associating the non-matching ESC record's data to only the primary energy source code record instead of sloshing it across the full plant.

But you are very right that this should be removed. I tested it a few times and tailoring this to the PM-codes only seems to work great.

Comment on lines 898 to 903
.pipe(
pudl.transform.classes.drop_invalid_rows,
pudl.transform.classes.InvalidRows(
required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0]
),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we do want to keep zero values here. A zero is still data about what was being consumed, and I'm wondering if this could be valuable. Have you tried running it both ways to see if the outputs differ (if you keep 0s in this step)?

return gens_at_freq


def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's been a while since I've looked at this so I forget exactly why I dropped the zeros, but it might have had something to do with escs associated with proposed or retired generators? What were the output metrics that improved?

Also, with the addition of the bf table and removing the zero dropping, you might need to update the docstring and inline comments.

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no blocking concerns! I think some more tests would be nice, but sounds like you're planning on adding them in a separate PR - which is fine, though if there's a chance you'd end up punting on those changes I'd like to see a couple more tests showing example inputs/outputs. Otherwise I'm worried we'll run into the "whoops, put this down for a couple months and the behavior's all wonky" situation again.

Before I approve, though, I'd want to hop on a call and just put the code through its paces by hand with you - I was having some trouble getting it all to run in an ipython notebook, and couldn't get the test case to hit the _allocate_unassociated_bf_records with any unassociated BF records. I was sort of banging my head against that for a bit but figured a call would be faster!

docs/release_notes.rst Outdated Show resolved Hide resolved
src/pudl/analysis/allocate_net_gen.py Outdated Show resolved Hide resolved
src/pudl/analysis/allocate_net_gen.py Outdated Show resolved Hide resolved
Comment on lines 625 to 632
indicator=True, # used in _allocate_unassociated_records to find unassocited
)
.pipe(remove_inactive_generators)
.pipe(
_allocate_unassociated_records,
idx_cols=IDX_PM_ESC,
col_w_unexpected_codes="prime_mover_code",
data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS],
)
.drop(columns=["_merge"]) # drop do we can do this again in the bf_summed merge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably raise an exception if we encounter unassociated records - we don't expect any to show up, now that we've added the missing ESCs, and so I'd like it to fail loudly when something does sneak through!

Comment on lines 898 to 903
.pipe(
pudl.transform.classes.drop_invalid_rows,
pudl.transform.classes.InvalidRows(
required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0]
),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit to dropping the 0s vs keeping them? In past environments I've run into lots of problems when we don't distinguish between "we don't know what happened" and "we do know what happened - it was nothing." But I know much less about the actual data + use cases than either of you two.

@@ -320,7 +318,7 @@ def test_missing_energy_source():
"""report_date,plant_id_eia,energy_source_code,prime_mover_code,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu
2019-01-01,8023,DFO,ST,3369.286,35566.0,35566.0
2019-01-01,8023,RC,ST,5363193.71,56777578.0,56777578.0
2019-01-01,8023,SUB,ST,0.0, 0.0,0.0
2019-01-01,8023,SUB,ST,10000.0, 100000.0,100000.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you leave this at 0?

idx_cols=IDX_GENS_PM_ESC,
col_w_unexpected_codes="energy_source_code",
col_w_unexpected_codes="prime_mover_code",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious about this!

@jdangerx
Copy link
Member

Take-aways from call with @cmgosnell last week:

  • there's some funkiness where if you have an unknown prime mover code in the boiler_fuel table, it gets dropped before we get a chance to allocate it
  • before merging this in, we should add a few more tests that basically document the current expected behavior of the code - it's easier to understand what "should" be going on with toy datasets. @cmgosnell was going to maybe take a crack at writing a few of these test cases. Let me know if you want help!

@cmgosnell cmgosnell requested a review from jdangerx April 28, 2023 19:43
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you were able to pull out the ESC stuff from the _allocate_unassociated... function!

The new tests test some good behavior also, sweet! I think we can make them a bit more maintainable/clearer, so that when we pick this up again some months from now we aren't kicking ourselves. I'll spend a bit of time on some concrete suggestion code and see where that goes.

edit: here's a diff!

@@ -271,7 +271,8 @@ def test_allocated_sums_match(example_1_pudl_tabl):
# )


def test_missing_energy_source():
@pytest.fixture
def example_2_pudl_tabl():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling this out into a fixture makes a lot of sense!

I think the new tests make the example_1 tests basically redundant - example_2 touches on the allocation sums, and they also touch on the fuel ratios too. The only thing is that example_2 already has the bonus ESCs - so it tests a slightly less happy path. One thing we could do is have the following structure:

  • nix example_1
  • rename example_2 as one_plant_happy_path or something, and don't put in the extra ESCs or PM codes just yet
  • make new fixtures that add the extra ESCs or PM codes (extra_esc_in_gf, extra_pm_in_bf)
  • use the fixtures in parameterized test that tests that the sums all make sense
  • use the fixtures in parameterized test that tests that the ratios are correct for the happy path & the extra_esc_in_gf case
  • add a separate SPECIAL TEST for the weird PM code behavior we found

I'll put a bit of time into trying that structure out to see if that works!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are great suggestions! I like the option you seemed to land on in the diff you shared with parameterizing the various individual tests and pulling out a special test for the PM code problem.

return ratio_bf, ratio_allocated


def test_missing_energy_source(example_2_pudl_tabl):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is sort of testing two different things:

  • are we adding the ESCs in the data cleaning step?
  • are we using the new ESCs correctly?

IMO testing the first bullet is kind of useful but more of an implementation detail - what we care about is defining the behavior we expect from the main interface, and I don't necessarily want to write tests that will fail when we make a change that doesn't actually affect the behavior we care about.

That being said, I can see the value in calling out this specific step in tests. In any case, we should probably split this up into two different tests & maybe only keep one. Here's one way to do it...

diff --git a/test/unit/analysis/allocate_net_gen_test.py b/test/unit/analysis/allocate_net_gen_test.py
index 6565ecc1..1d4c361a 100644
--- a/test/unit/analysis/allocate_net_gen_test.py
+++ b/test/unit/analysis/allocate_net_gen_test.py
@@ -369,12 +369,15 @@ def get_ratio_from_bf_and_allocated_by_boiler(
     return ratio_bf, ratio_allocated
 
 
-def test_missing_energy_source(example_2_pudl_tabl):
-    gf, bf, bga, gens, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl)
+def test_add_missing_energy_source(example_2_pudl_tabl):
+    gf, bf, _, gens, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl)
     gens = allocate_net_gen.add_missing_energy_source_codes_to_gens(gens, gf, bf)
     # assert that the missing energy source code is RC
     assert gens.energy_source_code_8.unique() == "RC"
 
+
+def test_allocate_missing_energy_source(example_2_pudl_tabl):
+    _, bf, bga, _, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl)
     allocated = allocate_net_gen.allocate_gen_fuel_by_generator_energy_source(
         example_2_pudl_tabl
     )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I added these various stages in part to have a better ability to pinpoint possible errors in this allocation process. Bc the data does through so many stages, a time-suck in working with this module is often just figuring out where something went wrong. So while I 100% agree that the most important thing to test is the overall behavior, testing the stages feels like a helper addition imo.

assert ratio_bf == ratio_allocated


def test_missing_pm_code_in_bf(example_2_pudl_tabl):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above - this looks like two separate tests as well. Because it's documenting some weird behavior, I'd probably keep both.

.pipe(remove_inactive_generators)
.pipe(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I remember you mentioning that some weird data slipped through without this. What was going on, and why did double-applying the _allocate_unassociated_pm_records method fix it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had originally thought the addition of greg's "add all the ESC's" methodology would possibly eliminate the need for this step of associated the unassociated records post this gf merge. I thought i tested it and determined that this was indeed the case, but after this thread, I added back in an assertion to make sure and sure enough there were a small number of records that were unassociated.

Bc we have all of the ESC's, I knew it couldn't be the ESC that were mismatched. The other merge key here is the PM code, so I applied _allocate_unassociated_pm_records.

@cmgosnell
Copy link
Member Author

thank you for these suggestions @jdangerx !! this may be a basic question, but is there a way to directly integrate your gist diffs that you shared?? I like the way you broke them out and am inclined to use your suggestions directly but i've never really played with gists.

@jdangerx
Copy link
Member

jdangerx commented May 4, 2023

@cmgosnell I'm glad you like the changes! You should be able to download the file and then run git apply <file> to apply the patch.

@zaneselvans zaneselvans self-requested a review May 16, 2023 19:25
@zaneselvans zaneselvans merged commit 50d83af into dev May 16, 2023
@zaneselvans zaneselvans deleted the pm_code_fix branch May 16, 2023 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

allocate_net_gen dropping data when energy source codes don't match between gens and gf
4 participants