Integrate greg M's PM code fix #2446

cmgosnell · 2023-03-23T15:52:50Z

PR Overview

Final touches on @grgmiller 's pr to fix for the pm codes + several other bugs!

I'm going to suggest that we merge this in with the following scope and move the remaining tasks into #1113:

In scope:

Give feedback

Remove the now duplicated _allocate_unassociated_records for the energy source codes
Options

Out-of-scope

Add more tests & debug errors (will flesh this out in Track down and fix fuel consumption allocation #1113 )

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

For more information, see https://pre-commit.ci

…ode_fix

grgmiller · 2023-03-23T17:03:50Z

@cmgosnell let me know what if anything you need from me to move this forward.

codecov · 2023-03-27T15:24:19Z

Codecov Report

Patch coverage: 100.0% and no project coverage change.

Comparison is base (a232c06) 86.9% compared to head (dfa3c03) 86.9%.

❗ Current head dfa3c03 differs from pull request most recent head 76037fa. Consider uploading reports for the commit 76037fa to get more accurate results

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2446   +/-   ##
=====================================
  Coverage   86.9%   86.9%           
=====================================
  Files         84      84           
  Lines       9604    9626   +22     
=====================================
+ Hits        8351    8374   +23     
+ Misses      1253    1252    -1

Impacted Files	Coverage Δ
src/pudl/analysis/allocate_net_gen.py	`97.3% <100.0%> (+0.2%)`	⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

…g out of missing escs

cmgosnell · 2023-03-29T21:18:48Z

mkay... still not fully ready for review but I learned some things:

greg's addition all missing esc's from the gf table entirely removed the need to use _allocate_unassociated_records after merging in the gf table to the stacked generators in associate_generator_tables
i tried adding the bf table in the same fashion and it deeply reduced the number of records that needed to be _allocate_unassociated_records . Most of those were records with 0's.
One thing I learned while doing this is that the Ooopsies warning (see below) is coming up with waayyyy more records tripping this warning this way when both _allocate_unassociated_records are removed. still need to investigate that. But all of the other metrics look better with this new setup.

(new as of 4/28) with unassociated pm steps

2023-04-28 14:59:25 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:367 The granular data tables contain 38.1% of the fuel and 38.9% of net generation in the higher-coverage generation_fuel_eia923 table.
2023-04-28 14:59:29 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1459 Distributing 0.1% annually reported records to months.
2023-04-28 15:02:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1459 Distributing 0.2% annually reported records to months.
2023-04-28 15:06:38 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:919 Associating and allocating 54762 (0.7%) records with unexpected prime_mover_code.
2023-04-28 15:07:07 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:919 Associating and allocating 16624 (0.2%) records with unexpected prime_mover_code.
2023-04-28 15:08:28 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1124 Ratio calc types: 
   All gens w/in generation table:  1372731#, 2.3e+08 MW
   Some gens w/in generation table: 15799#, 1.2e+06 MW
   No gens w/in generation table:   5929734#, 2.9e+08 MW
2023-04-28 15:08:37 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1805 Ooopsies. You got 36 records where the 'frac' column isn't adding up to 1 for each 'IDX_PM_ESC' group. Check 'make_allocation_frac()' 
2023-04-28 15:09:37 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1849 1.774% of records have are partially off from their 'IDX_PM_ESC' group
2023-04-28 15:09:38 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1861 gen v fuel table net gen diff:      38.8%
2023-04-28 15:09:38 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1865 new v fuel table net gen diff:      98.4%
2023-04-28 15:09:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1902 1.51% of generator records are more that 5% off from the net generation table
2023-04-28 15:09:47 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1240 Ratio calc types: 
   All gens w/in boiler fuel table:  1014430#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 87665#, 1.3e+07 MW
   No gens w/in boiler fuel table:   6216169#, 3.2e+08 MW
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1960 net_generation_mwh: 1.1% of allocated plant/year's are off by more than 5%
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1971 net_generation_mwh: Min and max differnce are x-2.02 and x8.13
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1960 fuel_consumed_mmbtu: 1.1% of allocated plant/year's are off by more than 5%
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1971 fuel_consumed_mmbtu: Min and max differnce are x0.0 and x1.0
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1960 fuel_consumed_for_electricity_mmbtu: 1.1% of allocated plant/year's are off by more than 5%
2023-04-28 15:11:15 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1971 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x-0.65 and x1.0

with gf unassociated allocation step removed

2023-03-31 12:08:58 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:367 The granular data tables contains 38.1% of the fuel and 38.9% of net generation in the higher-coverage generation_fuel_eia923 table.
2023-03-31 12:09:01 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1448 Distributing 0.1% annually reported records to months.
2023-03-31 12:11:50 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1448 Distributing 0.2% annually reported records to months.
2023-03-31 12:16:25 [    INFO] catalystcoop.pudl.transform.classes:823 54.1% of records (8997 rows) contain only {0.0, nan, <NA>} values in required columns. Dropped these 💩💩💩 records.
2023-03-31 12:16:25 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:908 Associating and allocating 6506 (0.1%) records with unexpected energy_source_code.
2023-03-31 12:21:26 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1113 Ratio calc types: 
   All gens w/in generation table:  1372763#, 2.3e+08 MW
   Some gens w/in generation table: 15767#, 1.2e+06 MW
   No gens w/in generation table:   5984955#, 2.9e+08 MW
2023-03-31 12:22:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1837 1.024% of records have are partially off from their 'IDX_PM_ESC' group
2023-03-31 12:22:22 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1846 55221 records have no capacity or net gen
2023-03-31 12:22:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1849 gen v fuel table net gen diff:      38.8%
2023-03-31 12:22:22 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1853 new v fuel table net gen diff:      97.0%
2023-03-31 12:22:28 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1888 1.50% of generator records are more that 5% off from the net generation table
2023-03-31 12:22:30 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1229 Ratio calc types: 
   All gens w/in boiler fuel table:  1015372#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 93778#, 1.4e+07 MW
   No gens w/in boiler fuel table:   6264335#, 3.2e+08 MW
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1946 net_generation_mwh: 1.5% of allocated plant/year's are off by more than 5%
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1957 net_generation_mwh: Min and max differnce are x-2.62 and x9.56
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1946 fuel_consumed_mmbtu: 1.5% of allocated plant/year's are off by more than 5%
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1957 fuel_consumed_mmbtu: Min and max differnce are x-0.0 and x1.0
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1946 fuel_consumed_for_electricity_mmbtu: 1.4% of allocated plant/year's are off by more than 5%
2023-03-31 12:23:41 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1957 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x-0.0 and x1.0

with both unassociated allocation steps removed

2023-03-29 15:14:09 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:369 The granular data tables contain 38.1% of the fuel and 38.9% of net generation in the higher-coverage generation_fuel_eia923 table.
2023-03-29 15:14:11 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1460 Distributing 0.1% annually reported records to months.
2023-03-29 15:16:20 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1460 Distributing 0.2% annually reported records to months.
2023-03-29 15:18:12 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1630 Index(['plant_id_eia', 'report_date', 'prime_mover_code', 'energy_source_code', 'num'], dtype='object')
2023-03-29 15:20:53 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1125 Ratio calc types: 
   All gens w/in generation table:  1378884#, 2.3e+08 MW
   Some gens w/in generation table: 25784#, 1.9e+06 MW
   No gens w/in generation table:   5985441#, 2.9e+08 MW
2023-03-29 15:21:00 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1808 Ooopsies. You got 1301 records where the 'frac' column isn't adding up to 1 for each 'IDX_PM_ESC' group. Check 'make_allocation_frac()'
2023-03-29 15:21:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1850 1.063% of records have are partially off from their 'IDX_PM_ESC' group
2023-03-29 15:21:45 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1859 71845 records have no capacity or net gen
2023-03-29 15:21:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1862 gen v fuel table net gen diff:      38.8%
2023-03-29 15:21:45 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1866 new v fuel table net gen diff:      97.0%
2023-03-29 15:21:51 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1901 1.49% of generator records are more that 5% off from the net generation table
2023-03-29 15:21:53 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1241 Ratio calc types: 
   All gens w/in boiler fuel table:  1030024#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 89255#, 1.3e+07 MW
   No gens w/in boiler fuel table:   6270830#, 3.2e+08 MW
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1959 net_generation_mwh: 1.5% of allocated plant/year's are off by more than 5%
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1970 net_generation_mwh: Min and max differnce are x-2.62 and x9.56
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1959 fuel_consumed_mmbtu: 1.6% of allocated plant/year's are off by more than 5%
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1970 fuel_consumed_mmbtu: Min and max differnce are x-0.0 and x1.0
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1959 fuel_consumed_for_electricity_mmbtu: 1.5% of allocated plant/year's are off by more than 5%
2023-03-29 15:23:02 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1970 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x-0.0 and x1.0

last-ish nightly log

2023-03-24 14:27:08 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1450 Distributing 0.1% annually reported records to months.
2023-03-24 14:34:09 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1450 Distributing 0.2% annually reported records to months.
2023-03-24 14:40:54 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:910 Associating and allocating 225389 (3.2%) records with unexpected prime_mover_code.
2023-03-24 14:41:52 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:910 Associating and allocating 64345 (0.9%) records with unexpected energy_source_code.
2023-03-24 14:44:17 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1121 Ratio calc types: 
   All gens w/in generation table:  1203491#, 2e+08 MW
   Some gens w/in generation table: 18563#, 1.3e+06 MW
   No gens w/in generation table:   5611085#, 2.6e+08 MW
2023-03-24 14:44:30 [ WARNING] catalystcoop.pudl.analysis.allocate_net_gen:1724 Ooopsies. You got 26 records where the 'frac' column isn't adding up to 1 for each 'IDX_PM_ESC' group. Check 'make_allocation_frac()'
2023-03-24 14:45:36 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1766 1.949% of records have are partially off from their 'IDX_PM_ESC' group
2023-03-24 14:45:37 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1778 gen v fuel table net gen diff:      38.8%
2023-03-24 14:45:37 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1782 new v fuel table net gen diff:      96.8%
2023-03-24 14:45:48 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1817 0.90% of generator records are more that 5% off from the net generation table
2023-03-24 14:45:50 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1234 Ratio calc types: 
   All gens w/in boiler fuel table:  967980#, 1.8e+08 MW
   Some gens w/in boiler fuel table: 63060#, 7.9e+06 MW
   No gens w/in boiler fuel table:   5802099#, 2.8e+08 MW
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1875 net_generation_mwh: 2.2% of allocated plant/year's are off by more than 5%
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1886 net_generation_mwh: Min and max differnce are x-2.02 and x8.13
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1875 fuel_consumed_mmbtu: 5.9% of allocated plant/year's are off by more than 5%
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1886 fuel_consumed_mmbtu: Min and max differnce are x0.0 and x6.0
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1875 fuel_consumed_for_electricity_mmbtu: 5.9% of allocated plant/year's are off by more than 5%
2023-03-24 14:47:32 [    INFO] catalystcoop.pudl.analysis.allocate_net_gen:1886 fuel_consumed_for_electricity_mmbtu: Min and max differnce are x0.0 and x6.0

cmgosnell · 2023-03-31T17:04:58Z

src/pudl/analysis/allocate_net_gen.py

-            indicator=True,  # used in _allocate_unassociated_records to find unassocited
        )
        .pipe(remove_inactive_generators)
-        .pipe(
-            _allocate_unassociated_records,
-            idx_cols=IDX_PM_ESC,
-            col_w_unexpected_codes="prime_mover_code",
-            data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS],
-        )
-        .drop(columns=["_merge"])  # drop do we can do this again in the bf_summed merge


applying the treatment in add_missing_energy_source_codes_to_gens resulted in 0 records from the gf table (which was merged in just above this chunk) being "unassociated". which is to say all of the records from gf had matching IDX_PM_ESC in the updated generators table.

I considered either: a) keeping this in here just in case future records are unassociated OR b) raising an assertion or warning if any records are unassociated/

I'm wondering why this is only an issue for the bf table now and not gf? I'm wondering if it would be worth keeping it just in case?

I think we should probably raise an exception if we encounter unassociated records - we don't expect any to show up, now that we've added the missing ESCs, and so I'd like it to fail loudly when something does sneak through!

after adding an assertion that failed, i added this step back in. I'm a little confused bc when I was first testing this there seemed to be no unassociated gf data after the new missing-esc treatment. but it definitely seems like there are still unassociated records, but the number dropped dramatically. so i've added the step back in!

cmgosnell · 2023-03-31T17:08:33Z

src/pudl/analysis/allocate_net_gen.py

-def _allocate_unassociated_records(
+def _allocate_unassociated_bf_records(
    gen_assoc: pd.DataFrame,
    idx_cols: list[str],
    col_w_unexpected_codes: Literal["energy_source_code", "prime_mover_code"],
    data_columns: list[str],
 ) -> pd.DataFrame:
-    """Associate unassociated gen_fuel table records on idx_cols.
+    """Associate unassociated :ref:`boiler_fuel_eia923` table records on idx_cols.


several aspects of this function were designed for either the "energy_source_code" or the "prime_mover_code" being unassociated. But upon further reflection and in-notebook dissecting, we were using this to merge in the unassocaited records from the gf table and then the bf table. I converted it to be a little more tailored to the bf table.

cmgosnell · 2023-03-31T17:23:58Z

src/pudl/analysis/allocate_net_gen.py

+        .pipe(
+            pudl.transform.classes.drop_invalid_rows,
+            pudl.transform.classes.InvalidRows(
+                required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0]
+            ),
+        )


~ half of all of these unconnected records were 0's or nulls. this whole function is here to ensure we don't loose fuel in the association process. And 0 fuel isn't really fuel for this purpose. was easy to reuse the drop_invalid_rows transformer we made for ferc1 transformers

I'm wondering if we do want to keep zero values here. A zero is still data about what was being consumed, and I'm wondering if this could be valuable. Have you tried running it both ways to see if the outputs differ (if you keep 0s in this step)?

What's the benefit to dropping the 0s vs keeping them? In past environments I've run into lots of problems when we don't distinguish between "we don't know what happened" and "we do know what happened - it was nothing." But I know much less about the actual data + use cases than either of you two.

I removed them in this case because these are the unassociated records and later in _allocate_unassociated_data_col they will be added to the associated records. so adding a zero to a value will result in no change. _allocate_unassociated_data_col already does nothing if the unassociated value is null because you can't add a null to a value.

But I do think there is some merit in keeping the zero's in because we do the following: og_value.fillna(0) + unassociated_value so this will result in more 0's in the case of the og_value being null.

cmgosnell · 2023-03-31T17:25:13Z

src/pudl/analysis/allocate_net_gen.py

-        frac_from_g_tbl=lambda x: x.net_generation_mwh_g_tbl_pm_fuel
-        / x.net_generation_mwh_gf_tbl,
+        frac_from_g_tbl=lambda x: np.where(
+            (x.net_generation_mwh_g_tbl_pm_fuel / x.net_generation_mwh_gf_tbl) < 1,
+            (x.net_generation_mwh_g_tbl_pm_fuel / x.net_generation_mwh_gf_tbl),
+            1,
+        ),
        # for records within these mix groups that do have net gen in the


this was a greg addition (+ the next frac_from_bf_tbl mirrored version). A great update to ensure we don't over allocate. This ensures we don't get a fraction that is over 100% to use for allocation.

cmgosnell · 2023-03-31T17:28:58Z

src/pudl/analysis/allocate_net_gen.py

@@ -1598,6 +1596,80 @@ def adjust_energy_source_codes(
    return gens


+def add_missing_energy_source_codes_to_gens(gens_at_freq, gf, bf):


this was the core of greg's update. all in all a great simplify-er. bc generators can have many esc & sometimes the data tables have more and different esc's, this function identifies missing esc's and adds them into the gens table.

cmgosnell · 2023-03-31T17:31:33Z

src/pudl/analysis/allocate_net_gen.py

+    return gens_at_freq
+
+
+def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf):


the one thing I changed in here was adding the bf table. I also removed the dropping of the zeros. I don't fully understand why but removing that led to the output metrics being slightly better. I figured it didn't hurt to have more esc's in the gens table so I didn't deeply investigate why.

Hmm it's been a while since I've looked at this so I forget exactly why I dropped the zeros, but it might have had something to do with escs associated with proposed or retired generators? What were the output metrics that improved?

Also, with the addition of the bf table and removing the zero dropping, you might need to update the docstring and inline comments.

I could definitely imagine that a lot of these zero-data ESCs are associated with non-operating generators, but that should be covered by your remove_inactive_generators. I tried turning this on and off and it seemed like we were getting much more unassociated records when we dropped the nulls, so I dropped the dropping.

for metrics, I've mostly been looking at the indicators in the logs. I've kept a little running list of the logs for the different configs here. Although I don't think I saved the outputs from dropping these nulls vs not.

These updates overall have decreased the # of records that are outside of tolerance, which is great.

cmgosnell · 2023-03-31T17:49:40Z

@grgmiller and @jdangerx ! I finally got enough cycles of turning knobs and checking the outputs to feel like this is good and ready. Greg, I made a few preemptive comments on the PR to highlight were I slightly tweaked things you did. The tl;dr version is:

I added the boiler fuel table into identify_missing_gf_escs_in_gens (which i am now realizing I should change that func name)....
I removed the _allocate_unassociated_records after we merge in the gf table bc it was no longer necessary
I converted the _allocate_unassociated_records -> _allocate_unassociated_bf_records and am now doing this on the basis of prime_mover_code

zaneselvans · 2023-03-31T21:48:45Z

This PR would close issue #2226 and also supersedes PR #2235 right?

grgmiller · 2023-04-01T19:58:21Z

src/pudl/analysis/allocate_net_gen.py

-            indicator=True,  # used in _allocate_unassociated_records to find unassocited
        )
        .pipe(remove_inactive_generators)
-        .pipe(
-            _allocate_unassociated_records,
-            idx_cols=IDX_PM_ESC,
-            col_w_unexpected_codes="prime_mover_code",
-            data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS],
-        )
-        .drop(columns=["_merge"])  # drop do we can do this again in the bf_summed merge


I'm wondering why this is only an issue for the bf table now and not gf? I'm wondering if it would be worth keeping it just in case?

grgmiller · 2023-04-01T20:01:06Z

src/pudl/analysis/allocate_net_gen.py

            idx_cols=IDX_GENS_PM_ESC,
-            col_w_unexpected_codes="energy_source_code",
+            col_w_unexpected_codes="prime_mover_code",


So here you are focused on matching unassociated prime movers, but it seems like the _allocate_unassociated_bf_records also has some code related to energy source codes? If this function is no longer being used to associate both prime movers and energy source codes, I'm wondering if it should be tailored to specifically associate prime mover codes in the bf table?

I'm also curious about this!

I believe some of the energy_source_code_num stuff in the function was to aid in associating the non-matching ESC record's data to only the primary energy source code record instead of sloshing it across the full plant.

But you are very right that this should be removed. I tested it a few times and tailoring this to the PM-codes only seems to work great.

grgmiller · 2023-04-01T20:04:10Z

src/pudl/analysis/allocate_net_gen.py

+        .pipe(
+            pudl.transform.classes.drop_invalid_rows,
+            pudl.transform.classes.InvalidRows(
+                required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0]
+            ),
+        )


I'm wondering if we do want to keep zero values here. A zero is still data about what was being consumed, and I'm wondering if this could be valuable. Have you tried running it both ways to see if the outputs differ (if you keep 0s in this step)?

grgmiller · 2023-04-01T20:11:26Z

src/pudl/analysis/allocate_net_gen.py

+    return gens_at_freq
+
+
+def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf):


Hmm it's been a while since I've looked at this so I forget exactly why I dropped the zeros, but it might have had something to do with escs associated with proposed or retired generators? What were the output metrics that improved?

Also, with the addition of the bf table and removing the zero dropping, you might need to update the docstring and inline comments.

jdangerx

I have no blocking concerns! I think some more tests would be nice, but sounds like you're planning on adding them in a separate PR - which is fine, though if there's a chance you'd end up punting on those changes I'd like to see a couple more tests showing example inputs/outputs. Otherwise I'm worried we'll run into the "whoops, put this down for a couple months and the behavior's all wonky" situation again.

Before I approve, though, I'd want to hop on a call and just put the code through its paces by hand with you - I was having some trouble getting it all to run in an ipython notebook, and couldn't get the test case to hit the _allocate_unassociated_bf_records with any unassociated BF records. I was sort of banging my head against that for a bit but figured a call would be faster!

docs/release_notes.rst

src/pudl/analysis/allocate_net_gen.py

jdangerx · 2023-04-04T15:57:55Z

src/pudl/analysis/allocate_net_gen.py

-            indicator=True,  # used in _allocate_unassociated_records to find unassocited
        )
        .pipe(remove_inactive_generators)
-        .pipe(
-            _allocate_unassociated_records,
-            idx_cols=IDX_PM_ESC,
-            col_w_unexpected_codes="prime_mover_code",
-            data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS],
-        )
-        .drop(columns=["_merge"])  # drop do we can do this again in the bf_summed merge


I think we should probably raise an exception if we encounter unassociated records - we don't expect any to show up, now that we've added the missing ESCs, and so I'd like it to fail loudly when something does sneak through!

jdangerx · 2023-04-04T16:31:20Z

src/pudl/analysis/allocate_net_gen.py

+        .pipe(
+            pudl.transform.classes.drop_invalid_rows,
+            pudl.transform.classes.InvalidRows(
+                required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0]
+            ),
+        )


What's the benefit to dropping the 0s vs keeping them? In past environments I've run into lots of problems when we don't distinguish between "we don't know what happened" and "we do know what happened - it was nothing." But I know much less about the actual data + use cases than either of you two.

jdangerx · 2023-04-04T16:46:24Z

test/unit/analysis/allocate_net_gen_test.py

@@ -320,7 +318,7 @@ def test_missing_energy_source():
            """report_date,plant_id_eia,energy_source_code,prime_mover_code,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu
    2019-01-01,8023,DFO,ST,3369.286,35566.0,35566.0
    2019-01-01,8023,RC,ST,5363193.71,56777578.0,56777578.0
-    2019-01-01,8023,SUB,ST,0.0, 0.0,0.0
+    2019-01-01,8023,SUB,ST,10000.0, 100000.0,100000.0


What happens if you leave this at 0?

jdangerx · 2023-04-04T16:48:04Z

src/pudl/analysis/allocate_net_gen.py

            idx_cols=IDX_GENS_PM_ESC,
-            col_w_unexpected_codes="energy_source_code",
+            col_w_unexpected_codes="prime_mover_code",


I'm also curious about this!

jdangerx · 2023-04-17T16:51:11Z

Take-aways from call with @cmgosnell last week:

there's some funkiness where if you have an unknown prime mover code in the boiler_fuel table, it gets dropped before we get a chance to allocate it
before merging this in, we should add a few more tests that basically document the current expected behavior of the code - it's easier to understand what "should" be going on with toy datasets. @cmgosnell was going to maybe take a crack at writing a few of these test cases. Let me know if you want help!

…bf table

jdangerx

I'm glad you were able to pull out the ESC stuff from the _allocate_unassociated... function!

The new tests test some good behavior also, sweet! I think we can make them a bit more maintainable/clearer, so that when we pick this up again some months from now we aren't kicking ourselves. I'll spend a bit of time on some concrete suggestion code and see where that goes.

edit: here's a diff!

jdangerx · 2023-05-01T21:19:11Z

test/unit/analysis/allocate_net_gen_test.py

@@ -271,7 +271,8 @@ def test_allocated_sums_match(example_1_pudl_tabl):
    # )


-def test_missing_energy_source():
+@pytest.fixture
+def example_2_pudl_tabl():


Pulling this out into a fixture makes a lot of sense!

I think the new tests make the example_1 tests basically redundant - example_2 touches on the allocation sums, and they also touch on the fuel ratios too. The only thing is that example_2 already has the bonus ESCs - so it tests a slightly less happy path. One thing we could do is have the following structure:

nix example_1

rename example_2 as one_plant_happy_path or something, and don't put in the extra ESCs or PM codes just yet

make new fixtures that add the extra ESCs or PM codes (extra_esc_in_gf, extra_pm_in_bf)

use the fixtures in parameterized test that tests that the sums all make sense

use the fixtures in parameterized test that tests that the ratios are correct for the happy path & the extra_esc_in_gf case

add a separate SPECIAL TEST for the weird PM code behavior we found

I'll put a bit of time into trying that structure out to see if that works!

these are great suggestions! I like the option you seemed to land on in the diff you shared with parameterizing the various individual tests and pulling out a special test for the PM code problem.

jdangerx · 2023-05-01T21:39:57Z

test/unit/analysis/allocate_net_gen_test.py

+    return ratio_bf, ratio_allocated
+
+
+def test_missing_energy_source(example_2_pudl_tabl):


This test is sort of testing two different things:

are we adding the ESCs in the data cleaning step?

are we using the new ESCs correctly?

IMO testing the first bullet is kind of useful but more of an implementation detail - what we care about is defining the behavior we expect from the main interface, and I don't necessarily want to write tests that will fail when we make a change that doesn't actually affect the behavior we care about.

That being said, I can see the value in calling out this specific step in tests. In any case, we should probably split this up into two different tests & maybe only keep one. Here's one way to do it...

diff --git a/test/unit/analysis/allocate_net_gen_test.py b/test/unit/analysis/allocate_net_gen_test.py index 6565ecc1..1d4c361a 100644 --- a/test/unit/analysis/allocate_net_gen_test.py +++ b/test/unit/analysis/allocate_net_gen_test.py @@ -369,12 +369,15 @@ def get_ratio_from_bf_and_allocated_by_boiler( return ratio_bf, ratio_allocated -def test_missing_energy_source(example_2_pudl_tabl): - gf, bf, bga, gens, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl) +def test_add_missing_energy_source(example_2_pudl_tabl): + gf, bf, _, gens, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl) gens = allocate_net_gen.add_missing_energy_source_codes_to_gens(gens, gf, bf) # assert that the missing energy source code is RC assert gens.energy_source_code_8.unique() == "RC" + +def test_allocate_missing_energy_source(example_2_pudl_tabl): + _, bf, bga, _, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl) allocated = allocate_net_gen.allocate_gen_fuel_by_generator_energy_source( example_2_pudl_tabl )

I believe I added these various stages in part to have a better ability to pinpoint possible errors in this allocation process. Bc the data does through so many stages, a time-suck in working with this module is often just figuring out where something went wrong. So while I 100% agree that the most important thing to test is the overall behavior, testing the stages feels like a helper addition imo.

jdangerx · 2023-05-01T21:48:27Z

test/unit/analysis/allocate_net_gen_test.py

+    assert ratio_bf == ratio_allocated
+
+
+def test_missing_pm_code_in_bf(example_2_pudl_tabl):


Similar comment as above - this looks like two separate tests as well. Because it's documenting some weird behavior, I'd probably keep both.

jdangerx · 2023-05-01T21:52:13Z

src/pudl/analysis/allocate_net_gen.py

        .pipe(remove_inactive_generators)
+        .pipe(


non-blocking: I remember you mentioning that some weird data slipped through without this. What was going on, and why did double-applying the _allocate_unassociated_pm_records method fix it?

i had originally thought the addition of greg's "add all the ESC's" methodology would possibly eliminate the need for this step of associated the unassociated records post this gf merge. I thought i tested it and determined that this was indeed the case, but after this thread, I added back in an assertion to make sure and sure enough there were a small number of records that were unassociated.

Bc we have all of the ESC's, I knew it couldn't be the ESC that were mismatched. The other merge key here is the PM code, so I applied _allocate_unassociated_pm_records.

cmgosnell · 2023-05-02T13:00:59Z

thank you for these suggestions @jdangerx !! this may be a basic question, but is there a way to directly integrate your gist diffs that you shared?? I like the way you broke them out and am inclined to use your suggestions directly but i've never really played with gists.

jdangerx · 2023-05-04T16:34:36Z

@cmgosnell I'm glad you like the changes! You should be able to download the file and then run git apply <file> to apply the patch.

grgmiller and others added 7 commits January 25, 2023 00:50

fix gen fuel allocation bugs

12d7239

[pre-commit.ci] auto fixes from pre-commit.com hooks

4d21dad

For more information, see https://pre-commit.ci

Merge branch 'dev' into oge_dev

6842fe5

improve missing esc identification speed

fbe5b4c

merge origin

faa0267

[pre-commit.ci] auto fixes from pre-commit.com hooks

167ee89

For more information, see https://pre-commit.ci

Merge branch 'oge_dev' of https://github.com/grgmiller/pudl into pm_c…

828e462

…ode_fix

cmgosnell added 2 commits March 24, 2023 14:00

make tests work w new pm code adder

1695ee7

Merge branch 'dev' into pm_code_fix

fafb750

first pass at removing the unassocaited allocation post gregs fleshin…

4fd7b4d

…g out of missing escs

cmgosnell added 4 commits March 31, 2023 11:36

settle on fuel allocation config

ab440b3

Merge branch 'dev' into pm_code_fix

2df68fd

convert unassociated mask back to _merge indicator

41dd270

add release notes for the pm code/allocaiton pr

49e5204

cmgosnell commented Mar 31, 2023

View reviewed changes

convert the unassociated merge to use the pm code

168753c

cmgosnell marked this pull request as ready for review March 31, 2023 17:41

cmgosnell requested review from jdangerx and grgmiller March 31, 2023 17:41

zaneselvans added this to the 2023Q2 milestone Mar 31, 2023

zaneselvans linked an issue Mar 31, 2023 that may be closed by this pull request

allocate_net_gen dropping data when energy source codes don't match between gens and gf #2226

Closed

grgmiller reviewed Apr 1, 2023

View reviewed changes

jdangerx reviewed Apr 4, 2023

View reviewed changes

jdangerx mentioned this pull request Apr 10, 2023

Fix allocation bugs in allocate_net_gen #2235

Closed

jdangerx assigned cmgosnell Apr 10, 2023

cmgosnell added 5 commits April 24, 2023 10:56

Merge branch 'dev' into pm_code_fix

4b316b6

Merge branch 'dev' into pm_code_fix

8b059ed

add unit test to describe quite dropping of non-matching ESCs in the …

bcc3cf0

…bf table

add back in the unassociated step for the gf table and update does

a35dc50

Merge branch 'dev' into pm_code_fix

8bcf360

cmgosnell requested a review from jdangerx April 28, 2023 19:43

lil baby docs update

dfa3c03

jdangerx requested changes May 1, 2023

View reviewed changes

cmgosnell added 2 commits May 16, 2023 07:13

integrate dazhong's unit test cleanup/standardization

b04dbcf

Merge branch 'dev' into pm_code_fix

76037fa

zaneselvans self-requested a review May 16, 2023 19:25

zaneselvans approved these changes May 16, 2023

View reviewed changes

zaneselvans merged commit 50d83af into dev May 16, 2023

zaneselvans deleted the pm_code_fix branch May 16, 2023 19:28

		@@ -1598,6 +1596,80 @@ def adjust_energy_source_codes(
		return gens


		def add_missing_energy_source_codes_to_gens(gens_at_freq, gf, bf):

		return gens_at_freq


		def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf):

		return ratio_bf, ratio_allocated


		def test_missing_energy_source(example_2_pudl_tabl):

		assert ratio_bf == ratio_allocated


		def test_missing_pm_code_in_bf(example_2_pudl_tabl):

Integrate greg M's PM code fix #2446

Integrate greg M's PM code fix #2446

Conversation

cmgosnell commented Mar 23, 2023 • edited Loading

PR Overview

In scope:

Out-of-scope

PR Checklist

grgmiller commented Mar 23, 2023

codecov bot commented Mar 27, 2023 • edited Loading

Codecov Report

cmgosnell commented Mar 29, 2023 • edited Loading

(new as of 4/28) with unassociated pm steps

with gf unassociated allocation step removed

with both unassociated allocation steps removed

last-ish nightly log

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

cmgosnell commented Mar 31, 2023

zaneselvans commented Mar 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Apr 17, 2023

jdangerx left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell commented May 2, 2023

jdangerx commented May 4, 2023

cmgosnell commented Mar 23, 2023 •

edited

Loading

codecov bot commented Mar 27, 2023 •

edited

Loading

cmgosnell commented Mar 29, 2023 •

edited

Loading

cmgosnell Apr 28, 2023 •

edited

Loading

zaneselvans commented Mar 31, 2023 •

edited

Loading

jdangerx left a comment •

edited

Loading