-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate greg M's PM code fix #2446
Conversation
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
@cmgosnell let me know what if anything you need from me to move this forward. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2446 +/- ##
=====================================
Coverage 86.9% 86.9%
=====================================
Files 84 84
Lines 9604 9626 +22
=====================================
+ Hits 8351 8374 +23
+ Misses 1253 1252 -1
☔ View full report in Codecov by Sentry. |
…g out of missing escs
mkay... still not fully ready for review but I learned some things:
(new as of 4/28) with unassociated pm steps
with gf unassociated allocation step removed
with both unassociated allocation steps removed
last-ish nightly log
|
indicator=True, # used in _allocate_unassociated_records to find unassocited | ||
) | ||
.pipe(remove_inactive_generators) | ||
.pipe( | ||
_allocate_unassociated_records, | ||
idx_cols=IDX_PM_ESC, | ||
col_w_unexpected_codes="prime_mover_code", | ||
data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS], | ||
) | ||
.drop(columns=["_merge"]) # drop do we can do this again in the bf_summed merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applying the treatment in add_missing_energy_source_codes_to_gens
resulted in 0 records from the gf table (which was merged in just above this chunk) being "unassociated". which is to say all of the records from gf had matching IDX_PM_ESC
in the updated generators table.
I considered either: a) keeping this in here just in case future records are unassociated OR b) raising an assertion or warning if any records are unassociated/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why this is only an issue for the bf table now and not gf? I'm wondering if it would be worth keeping it just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably raise an exception if we encounter unassociated records - we don't expect any to show up, now that we've added the missing ESCs, and so I'd like it to fail loudly when something does sneak through!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after adding an assertion that failed, i added this step back in. I'm a little confused bc when I was first testing this there seemed to be no unassociated gf data after the new missing-esc treatment. but it definitely seems like there are still unassociated records, but the number dropped dramatically. so i've added the step back in!
def _allocate_unassociated_records( | ||
def _allocate_unassociated_bf_records( | ||
gen_assoc: pd.DataFrame, | ||
idx_cols: list[str], | ||
col_w_unexpected_codes: Literal["energy_source_code", "prime_mover_code"], | ||
data_columns: list[str], | ||
) -> pd.DataFrame: | ||
"""Associate unassociated gen_fuel table records on idx_cols. | ||
"""Associate unassociated :ref:`boiler_fuel_eia923` table records on idx_cols. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several aspects of this function were designed for either the "energy_source_code" or the "prime_mover_code" being unassociated. But upon further reflection and in-notebook dissecting, we were using this to merge in the unassocaited records from the gf table and then the bf table. I converted it to be a little more tailored to the bf table.
.pipe( | ||
pudl.transform.classes.drop_invalid_rows, | ||
pudl.transform.classes.InvalidRows( | ||
required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0] | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~ half of all of these unconnected records were 0's or nulls. this whole function is here to ensure we don't loose fuel in the association process. And 0 fuel isn't really fuel for this purpose. was easy to reuse the drop_invalid_rows
transformer we made for ferc1 transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we do want to keep zero values here. A zero is still data about what was being consumed, and I'm wondering if this could be valuable. Have you tried running it both ways to see if the outputs differ (if you keep 0s in this step)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit to dropping the 0s vs keeping them? In past environments I've run into lots of problems when we don't distinguish between "we don't know what happened" and "we do know what happened - it was nothing." But I know much less about the actual data + use cases than either of you two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed them in this case because these are the unassociated records and later in _allocate_unassociated_data_col they will be added to the associated records. so adding a zero to a value will result in no change. _allocate_unassociated_data_col already does nothing if the unassociated value is null because you can't add a null to a value.
But I do think there is some merit in keeping the zero's in because we do the following: og_value.fillna(0) + unassociated_value
so this will result in more 0's in the case of the og_value
being null.
frac_from_g_tbl=lambda x: x.net_generation_mwh_g_tbl_pm_fuel | ||
/ x.net_generation_mwh_gf_tbl, | ||
frac_from_g_tbl=lambda x: np.where( | ||
(x.net_generation_mwh_g_tbl_pm_fuel / x.net_generation_mwh_gf_tbl) < 1, | ||
(x.net_generation_mwh_g_tbl_pm_fuel / x.net_generation_mwh_gf_tbl), | ||
1, | ||
), | ||
# for records within these mix groups that do have net gen in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a greg addition (+ the next frac_from_bf_tbl
mirrored version). A great update to ensure we don't over allocate. This ensures we don't get a fraction that is over 100% to use for allocation.
@@ -1598,6 +1596,80 @@ def adjust_energy_source_codes( | |||
return gens | |||
|
|||
|
|||
def add_missing_energy_source_codes_to_gens(gens_at_freq, gf, bf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the core of greg's update. all in all a great simplify-er. bc generators can have many esc & sometimes the data tables have more and different esc's, this function identifies missing esc's and adds them into the gens table.
return gens_at_freq | ||
|
||
|
||
def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the one thing I changed in here was adding the bf
table. I also removed the dropping of the zeros. I don't fully understand why but removing that led to the output metrics being slightly better. I figured it didn't hurt to have more esc's in the gens table so I didn't deeply investigate why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it's been a while since I've looked at this so I forget exactly why I dropped the zeros, but it might have had something to do with escs associated with proposed or retired generators? What were the output metrics that improved?
Also, with the addition of the bf table and removing the zero dropping, you might need to update the docstring and inline comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could definitely imagine that a lot of these zero-data ESCs are associated with non-operating generators, but that should be covered by your remove_inactive_generators
. I tried turning this on and off and it seemed like we were getting much more unassociated records when we dropped the nulls, so I dropped the dropping.
for metrics, I've mostly been looking at the indicators in the logs. I've kept a little running list of the logs for the different configs here. Although I don't think I saved the outputs from dropping these nulls vs not.
These updates overall have decreased the # of records that are outside of tolerance, which is great.
@grgmiller and @jdangerx ! I finally got enough cycles of turning knobs and checking the outputs to feel like this is good and ready. Greg, I made a few preemptive comments on the PR to highlight were I slightly tweaked things you did. The tl;dr version is:
|
indicator=True, # used in _allocate_unassociated_records to find unassocited | ||
) | ||
.pipe(remove_inactive_generators) | ||
.pipe( | ||
_allocate_unassociated_records, | ||
idx_cols=IDX_PM_ESC, | ||
col_w_unexpected_codes="prime_mover_code", | ||
data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS], | ||
) | ||
.drop(columns=["_merge"]) # drop do we can do this again in the bf_summed merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why this is only an issue for the bf table now and not gf? I'm wondering if it would be worth keeping it just in case?
idx_cols=IDX_GENS_PM_ESC, | ||
col_w_unexpected_codes="energy_source_code", | ||
col_w_unexpected_codes="prime_mover_code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you are focused on matching unassociated prime movers, but it seems like the _allocate_unassociated_bf_records
also has some code related to energy source codes? If this function is no longer being used to associate both prime movers and energy source codes, I'm wondering if it should be tailored to specifically associate prime mover codes in the bf table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe some of the energy_source_code_num stuff in the function was to aid in associating the non-matching ESC record's data to only the primary energy source code record instead of sloshing it across the full plant.
But you are very right that this should be removed. I tested it a few times and tailoring this to the PM-codes only seems to work great.
.pipe( | ||
pudl.transform.classes.drop_invalid_rows, | ||
pudl.transform.classes.InvalidRows( | ||
required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0] | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we do want to keep zero values here. A zero is still data about what was being consumed, and I'm wondering if this could be valuable. Have you tried running it both ways to see if the outputs differ (if you keep 0s in this step)?
return gens_at_freq | ||
|
||
|
||
def identify_missing_gf_escs_in_gens(gens_at_freq, gf, bf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it's been a while since I've looked at this so I forget exactly why I dropped the zeros, but it might have had something to do with escs associated with proposed or retired generators? What were the output metrics that improved?
Also, with the addition of the bf table and removing the zero dropping, you might need to update the docstring and inline comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no blocking concerns! I think some more tests would be nice, but sounds like you're planning on adding them in a separate PR - which is fine, though if there's a chance you'd end up punting on those changes I'd like to see a couple more tests showing example inputs/outputs. Otherwise I'm worried we'll run into the "whoops, put this down for a couple months and the behavior's all wonky" situation again.
Before I approve, though, I'd want to hop on a call and just put the code through its paces by hand with you - I was having some trouble getting it all to run in an ipython notebook, and couldn't get the test case to hit the _allocate_unassociated_bf_records
with any unassociated BF records. I was sort of banging my head against that for a bit but figured a call would be faster!
indicator=True, # used in _allocate_unassociated_records to find unassocited | ||
) | ||
.pipe(remove_inactive_generators) | ||
.pipe( | ||
_allocate_unassociated_records, | ||
idx_cols=IDX_PM_ESC, | ||
col_w_unexpected_codes="prime_mover_code", | ||
data_columns=[f"{col}_gf_tbl" for col in DATA_COLUMNS], | ||
) | ||
.drop(columns=["_merge"]) # drop do we can do this again in the bf_summed merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably raise an exception if we encounter unassociated records - we don't expect any to show up, now that we've added the missing ESCs, and so I'd like it to fail loudly when something does sneak through!
.pipe( | ||
pudl.transform.classes.drop_invalid_rows, | ||
pudl.transform.classes.InvalidRows( | ||
required_valid_cols=data_columns, invalid_values=[pd.NA, np.nan, 0.0] | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit to dropping the 0s vs keeping them? In past environments I've run into lots of problems when we don't distinguish between "we don't know what happened" and "we do know what happened - it was nothing." But I know much less about the actual data + use cases than either of you two.
@@ -320,7 +318,7 @@ def test_missing_energy_source(): | |||
"""report_date,plant_id_eia,energy_source_code,prime_mover_code,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu | |||
2019-01-01,8023,DFO,ST,3369.286,35566.0,35566.0 | |||
2019-01-01,8023,RC,ST,5363193.71,56777578.0,56777578.0 | |||
2019-01-01,8023,SUB,ST,0.0, 0.0,0.0 | |||
2019-01-01,8023,SUB,ST,10000.0, 100000.0,100000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you leave this at 0?
idx_cols=IDX_GENS_PM_ESC, | ||
col_w_unexpected_codes="energy_source_code", | ||
col_w_unexpected_codes="prime_mover_code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious about this!
Take-aways from call with @cmgosnell last week:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you were able to pull out the ESC stuff from the _allocate_unassociated...
function!
The new tests test some good behavior also, sweet! I think we can make them a bit more maintainable/clearer, so that when we pick this up again some months from now we aren't kicking ourselves. I'll spend a bit of time on some concrete suggestion code and see where that goes.
edit: here's a diff!
@@ -271,7 +271,8 @@ def test_allocated_sums_match(example_1_pudl_tabl): | |||
# ) | |||
|
|||
|
|||
def test_missing_energy_source(): | |||
@pytest.fixture | |||
def example_2_pudl_tabl(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling this out into a fixture makes a lot of sense!
I think the new tests make the example_1
tests basically redundant - example_2
touches on the allocation sums, and they also touch on the fuel ratios too. The only thing is that example_2
already has the bonus ESCs - so it tests a slightly less happy path. One thing we could do is have the following structure:
- nix
example_1
- rename
example_2
asone_plant_happy_path
or something, and don't put in the extra ESCs or PM codes just yet - make new fixtures that add the extra ESCs or PM codes (
extra_esc_in_gf
,extra_pm_in_bf
) - use the fixtures in parameterized test that tests that the sums all make sense
- use the fixtures in parameterized test that tests that the ratios are correct for the happy path & the
extra_esc_in_gf
case - add a separate SPECIAL TEST for the weird PM code behavior we found
I'll put a bit of time into trying that structure out to see if that works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are great suggestions! I like the option you seemed to land on in the diff you shared with parameterizing the various individual tests and pulling out a special test for the PM code problem.
return ratio_bf, ratio_allocated | ||
|
||
|
||
def test_missing_energy_source(example_2_pudl_tabl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is sort of testing two different things:
- are we adding the ESCs in the data cleaning step?
- are we using the new ESCs correctly?
IMO testing the first bullet is kind of useful but more of an implementation detail - what we care about is defining the behavior we expect from the main interface, and I don't necessarily want to write tests that will fail when we make a change that doesn't actually affect the behavior we care about.
That being said, I can see the value in calling out this specific step in tests. In any case, we should probably split this up into two different tests & maybe only keep one. Here's one way to do it...
diff --git a/test/unit/analysis/allocate_net_gen_test.py b/test/unit/analysis/allocate_net_gen_test.py
index 6565ecc1..1d4c361a 100644
--- a/test/unit/analysis/allocate_net_gen_test.py
+++ b/test/unit/analysis/allocate_net_gen_test.py
@@ -369,12 +369,15 @@ def get_ratio_from_bf_and_allocated_by_boiler(
return ratio_bf, ratio_allocated
-def test_missing_energy_source(example_2_pudl_tabl):
- gf, bf, bga, gens, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl)
+def test_add_missing_energy_source(example_2_pudl_tabl):
+ gf, bf, _, gens, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl)
gens = allocate_net_gen.add_missing_energy_source_codes_to_gens(gens, gf, bf)
# assert that the missing energy source code is RC
assert gens.energy_source_code_8.unique() == "RC"
+
+def test_allocate_missing_energy_source(example_2_pudl_tabl):
+ _, bf, bga, _, _ = allocate_net_gen.extract_input_tables(example_2_pudl_tabl)
allocated = allocate_net_gen.allocate_gen_fuel_by_generator_energy_source(
example_2_pudl_tabl
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I added these various stages in part to have a better ability to pinpoint possible errors in this allocation process. Bc the data does through so many stages, a time-suck in working with this module is often just figuring out where something went wrong. So while I 100% agree that the most important thing to test is the overall behavior, testing the stages feels like a helper addition imo.
assert ratio_bf == ratio_allocated | ||
|
||
|
||
def test_missing_pm_code_in_bf(example_2_pudl_tabl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as above - this looks like two separate tests as well. Because it's documenting some weird behavior, I'd probably keep both.
.pipe(remove_inactive_generators) | ||
.pipe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: I remember you mentioning that some weird data slipped through without this. What was going on, and why did double-applying the _allocate_unassociated_pm_records
method fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had originally thought the addition of greg's "add all the ESC's" methodology would possibly eliminate the need for this step of associated the unassociated records post this gf merge. I thought i tested it and determined that this was indeed the case, but after this thread, I added back in an assertion to make sure and sure enough there were a small number of records that were unassociated.
Bc we have all of the ESC's, I knew it couldn't be the ESC that were mismatched. The other merge key here is the PM code, so I applied _allocate_unassociated_pm_records
.
thank you for these suggestions @jdangerx !! this may be a basic question, but is there a way to directly integrate your gist diffs that you shared?? I like the way you broke them out and am inclined to use your suggestions directly but i've never really played with gists. |
@cmgosnell I'm glad you like the changes! You should be able to download the file and then run |
PR Overview
Final touches on @grgmiller 's pr to fix for the pm codes + several other bugs!
I'm going to suggest that we merge this in with the following scope and move the remaining tasks into #1113:
In scope:
Out-of-scope
PR Checklist
dev
).