Check for incorrect primary key columns in _tidy_class_dfs #2638

zaneselvans · 2023-06-06T23:11:49Z

The function pudl.transform.eia861._tidy_class_dfs() will silently drop records if given an incomplete list of primary key columns and the resulting reshaped dataframe will have unique values for the (incorrect) primary keys that were specified. As pointed out by @christiantfong in #2636 and fixed (for that one case) in #2637.

We should implement a check in that function which prevents this behavior, since it's used 20 times all across the EIA-861 transforms, and it would be easy for another incorrect set of PK columns to have slipped by us.

Tables affected:

EIA-861 Demand Response: 40/19120 (0.21%) records with duplicated PKs
EIA-861 Sales: 402/417180 (0.10%) records with duplicated PKs

Fixes

Give feedback

Verify that input dataframes have very limited (<0.5%) duplicate primary key values.
Consolidate reported data values across the small number of duplicate records.
Drop duplicates from the denormalized (non-data) columns as well.
Explicitly validate the merge which brings together the denormalized columns and reshaped columns.
Options

The text was updated successfully, but these errors were encountered:

zaneselvans · 2023-06-07T00:11:17Z

Check to make sure that the idx_cols are actually valid primary key cols is tricky, because NA values in the BA Code column creates actual duplicate values. These NA values are filled with UNK and the duplicates are later dropped, which means we are losing data. But it's not obvious how we could actually fill in real BA codes. In sales_eia861 there are about 67 duplicate records out of several hundred thousand. Dropping the NA BA Code values fixes all this... Grr.

If it weren't for that problem we could just do:

    if not data_cols.index.is_unique:
        raise AssertionError(
            f"Found {len(data_cols.reset_index().duplicated(idx_cols))} non-unique rows"
        )

Maybe we can just check that it's a very small number of duplicate records? Or maybe we should actually be dropping these rows with NA values in the PK fields, since that's going to cause pain elsewhere.

zaneselvans · 2023-06-08T00:10:45Z

After talking to @e-belfer about this a bit, it seems like the best short-run fix is probably to aggregate these records with NA values for the BA codes into a single record, rather than dropping them. This will at least keep the aggregate sales numbers equal to the original data, unlike the current situation where that data is dropped.

Will make this a focused fix that won't aggregate records that are duplicated for any other reason, so we don't unknowingly lump data together.

zaneselvans added the eia861 Anything having to do with EIA Form 861 label Jun 6, 2023

github-project-automation bot added this to Catalyst Megaproject Jun 6, 2023

github-project-automation bot moved this to New in Catalyst Megaproject Jun 6, 2023

This was referenced Jun 6, 2023

Missing delivery utilities in sales_eia861 #2636

Closed

Add business_model, service_type to sales_eia861 PK #2637

Merged

zaneselvans self-assigned this Jun 7, 2023

zaneselvans moved this from New to In review in Catalyst Megaproject Jun 8, 2023

zaneselvans mentioned this issue Jun 8, 2023

Manage duplicate PKs in EIA-861 table transforms #2648

Merged

8 tasks

zaneselvans linked a pull request Jun 8, 2023 that will close this issue

Manage duplicate PKs in EIA-861 table transforms #2648

Merged

8 tasks

zaneselvans closed this as completed in c050d18 Jun 14, 2023

github-project-automation bot moved this from In review to Done in Catalyst Megaproject Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for incorrect primary key columns in _tidy_class_dfs #2638

Check for incorrect primary key columns in _tidy_class_dfs #2638

zaneselvans commented Jun 6, 2023 •

edited

Loading

Fixes

zaneselvans commented Jun 7, 2023 •

edited

Loading

zaneselvans commented Jun 8, 2023

Check for incorrect primary key columns in _tidy_class_dfs #2638

Check for incorrect primary key columns in _tidy_class_dfs #2638

Comments

zaneselvans commented Jun 6, 2023 • edited Loading

Tables affected:

Fixes

zaneselvans commented Jun 7, 2023 • edited Loading

zaneselvans commented Jun 8, 2023

zaneselvans commented Jun 6, 2023 •

edited

Loading

zaneselvans commented Jun 7, 2023 •

edited

Loading