Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for incorrect primary key columns in _tidy_class_dfs #2638

Closed
4 tasks done
zaneselvans opened this issue Jun 6, 2023 · 2 comments · Fixed by #2648
Closed
4 tasks done

Check for incorrect primary key columns in _tidy_class_dfs #2638

zaneselvans opened this issue Jun 6, 2023 · 2 comments · Fixed by #2648
Assignees
Labels
eia861 Anything having to do with EIA Form 861

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Jun 6, 2023

The function pudl.transform.eia861._tidy_class_dfs() will silently drop records if given an incomplete list of primary key columns and the resulting reshaped dataframe will have unique values for the (incorrect) primary keys that were specified. As pointed out by @christiantfong in #2636 and fixed (for that one case) in #2637.

We should implement a check in that function which prevents this behavior, since it's used 20 times all across the EIA-861 transforms, and it would be easy for another incorrect set of PK columns to have slipped by us.

Tables affected:

  • EIA-861 Demand Response: 40/19120 (0.21%) records with duplicated PKs
  • EIA-861 Sales: 402/417180 (0.10%) records with duplicated PKs

Fixes

Preview Give feedback
@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 7, 2023

Check to make sure that the idx_cols are actually valid primary key cols is tricky, because NA values in the BA Code column creates actual duplicate values. These NA values are filled with UNK and the duplicates are later dropped, which means we are losing data. But it's not obvious how we could actually fill in real BA codes. In sales_eia861 there are about 67 duplicate records out of several hundred thousand. Dropping the NA BA Code values fixes all this... Grr.

If it weren't for that problem we could just do:

    if not data_cols.index.is_unique:
        raise AssertionError(
            f"Found {len(data_cols.reset_index().duplicated(idx_cols))} non-unique rows"
        )

Maybe we can just check that it's a very small number of duplicate records? Or maybe we should actually be dropping these rows with NA values in the PK fields, since that's going to cause pain elsewhere.

@zaneselvans zaneselvans self-assigned this Jun 7, 2023
@zaneselvans
Copy link
Member Author

After talking to @e-belfer about this a bit, it seems like the best short-run fix is probably to aggregate these records with NA values for the BA codes into a single record, rather than dropping them. This will at least keep the aggregate sales numbers equal to the original data, unlike the current situation where that data is dropped.

Will make this a focused fix that won't aggregate records that are duplicated for any other reason, so we don't unknowingly lump data together.

@zaneselvans zaneselvans moved this from New to In review in Catalyst Megaproject Jun 8, 2023
@zaneselvans zaneselvans linked a pull request Jun 8, 2023 that will close this issue
8 tasks
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia861 Anything having to do with EIA Form 861
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant