-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for incorrect primary key columns in _tidy_class_dfs #2638
Comments
Check to make sure that the idx_cols are actually valid primary key cols is tricky, because NA values in the BA Code column creates actual duplicate values. These NA values are filled with UNK and the duplicates are later dropped, which means we are losing data. But it's not obvious how we could actually fill in real BA codes. In sales_eia861 there are about 67 duplicate records out of several hundred thousand. Dropping the NA BA Code values fixes all this... Grr. If it weren't for that problem we could just do: if not data_cols.index.is_unique:
raise AssertionError(
f"Found {len(data_cols.reset_index().duplicated(idx_cols))} non-unique rows"
) Maybe we can just check that it's a very small number of duplicate records? Or maybe we should actually be dropping these rows with NA values in the PK fields, since that's going to cause pain elsewhere. |
After talking to @e-belfer about this a bit, it seems like the best short-run fix is probably to aggregate these records with NA values for the BA codes into a single record, rather than dropping them. This will at least keep the aggregate sales numbers equal to the original data, unlike the current situation where that data is dropped. Will make this a focused fix that won't aggregate records that are duplicated for any other reason, so we don't unknowingly lump data together. |
The function
pudl.transform.eia861._tidy_class_dfs()
will silently drop records if given an incomplete list of primary key columns and the resulting reshaped dataframe will have unique values for the (incorrect) primary keys that were specified. As pointed out by @christiantfong in #2636 and fixed (for that one case) in #2637.We should implement a check in that function which prevents this behavior, since it's used 20 times all across the EIA-861 transforms, and it would be easy for another incorrect set of PK columns to have slipped by us.
Tables affected:
Fixes
The text was updated successfully, but these errors were encountered: