Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add records correcting FERC 1 calculations that are off #2620

Merged
merged 9 commits into from
Jun 2, 2023

Conversation

zaneselvans
Copy link
Member

PR Overview

  • Fix a bad reference to old XBRL Metadata asset names in our FERC 1 ETL debugging notebook.
  • Add a generic {xbrl_factoid}_correction component to all calculations in process_xbrl_metadata() method of the FERC 1 table transformers.
  • Add a step to the check_table_calculations method that calculates the difference between the reported totals and our calculated totals and adds corrective records so that these totals match.

Questions:

  • It seemed wrong to have any value of record_id in the correction records since they have no source record, so I set it to NA. Are there any downstream consequences for this?
  • It seemed wrong to say that the correction records had xbrl_row_type value of reported_value or calculated_value so I set them to correction and updated the ENUM constraint on that field.
  • What set of calculations we want to create correction records for? Right now they're only being generated for the ones that fail the np.isclose() test (and end up in off_df), but we say "match exactly" in some places, and there are lots of records which are off by just one dollar -- none of which will get corrections in the current setup, which seems fine, even though there will technically still be some difference between the calculated and reported values.
  • Do we want to change the name of check_table_calculations to correct_table_calculations or something else that indicates we're actually altering the database tables?

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans zaneselvans requested a review from cmgosnell June 2, 2023 00:50
@zaneselvans zaneselvans added this to the 2023 Spring milestone Jun 2, 2023
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 data-repair Interpolating or extrapolating data that we don't actually have. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. rmi xbrl Related to the FERC XBRL transition labels Jun 2, 2023
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one suggestion for fillna-ing & a nit/musing about a slight simplification. but overall this looks great

It seemed wrong to have any value of record_id in the correction records since they have no source record, so I set it to NA. Are there any downstream consequences for this?

I'm nervous about downstream PK/merge problems. even though we don't typically use the record_id as a pk or a post-transform merge key so maybe it'll be fine 🤷🏻 . I'm personally not opposed to the idea of using the record_id of the calculated value because this is truly the source.

It seemed wrong to say that the correction records had xbrl_row_type value of reported_value or calculated_value so I set them to correction and updated the ENUM constraint on that field.

this seems like a good change.

What set of calculations we want to create correction records for? Right now they're only being generated for the ones that fail the np.isclose() test (and end up in off_df), but we say "match exactly" in some places, and there are lots of records which are off by just one dollar -- none of which will get corrections in the current setup, which seems fine, even though there will technically still be some difference between the calculated and reported values.

I am of two minds about this. I like the way you did it. But also having a correction be applied across the board no matter what seems reasonable and maybe more informative to users (i.e. getting the $1 or mostly $0 corrections).

Do we want to change the name of check_table_calculations to correct_table_calculations or something else that indicates we're actually altering the database tables?

this might be a good idea. its a lil weird bc we are indeed mostly checking but its a good edit.

src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
Base automatically changed from check_calcs_in_transformers to xbrl_meta_reshape June 2, 2023 13:34
@zaneselvans
Copy link
Member Author

I am of two minds about this. I like the way you did it. But also having a correction be applied across the board no matter what seems reasonable and maybe more informative to users (i.e. getting the $1 or mostly $0 corrections).

If we correct these apparent rounding errors, there will be a lot of corrections, and they'll be correcting values that are off by only 1 part in a million. But they won't always show up, because there will be many cases in which nothing was off by enough to trip up np.isclose() -- and in those cases no corrections of any kind would be generated.

Unless we store these dollar values as decimal numbers (rather than floating point -- decimal numbers are just an integer number of cents with a baked-in assumption that you have 2 sigfigs after the decimal point) the only reasonable way to compare them is with something like np.isclose() and I think correcting only those values that np.arentclose() will be clearer -- they'll be real corrections of real reporting issues, rather than a ton of fixes most of which aren't important.

I'm personally not opposed to the idea of using the record_id of the calculated value because this is truly the source.

I can see how this isn't crazy, but it could also end up creating additional tables with non-unique record_id values for reasons that have nothing to do with reshaping the data. It doesn't appear to right now though. I really want to change this to source_record_id across the board so folks don't ever try and use it as a unique ID for the tables you find it in but that's another story. I just don't want anyone to think that these correction values come from any original data source, which is kinda what having a record_id indicates to me.

Should I go ahead and rename check_table_calculations to something that indicates both checking and correction is happening?

And do we truly always want corrections to be added?

@cmgosnell
Copy link
Member

using np.arentclose() as the threshold for whether to make a correction sounds grand to me

Should I go ahead and rename check_table_calculations to something that indicates both checking and correction is happening?

Sure!

And do we truly always want corrections to be added?

If we end up not wanting to we can always add another param into CheckTableCalculations

I renamed check_table_calculations / CheckTableCalculations to
reconcile_table_calculations / ReconcileTableCalculations because
we're not just checking the values now -- we're altering the data in
many cases, to include correction records, and accounting reconciliation
has a more active "make these things match" connotation.

Also now using `fillna()` when generating the correction records, so
that if we're correcting a calculated value that's NA we actually get
a real correction (we're treating NA values as 0.0 in the aggregations
already so this is internally self-consistent).
@zaneselvans zaneselvans self-assigned this Jun 2, 2023
@codecov
Copy link

codecov bot commented Jun 2, 2023

Codecov Report

Patch coverage: 94.4% and no project coverage change.

Comparison is base (fe4fc9e) 87.0% compared to head (274acad) 87.1%.

Additional details and impacted files
@@                Coverage Diff                @@
##           xbrl_meta_reshape   #2620   +/-   ##
=================================================
  Coverage               87.0%   87.1%           
=================================================
  Files                     84      84           
  Lines                   9861    9878   +17     
=================================================
+ Hits                    8588    8605   +17     
  Misses                  1273    1273           
Impacted Files Coverage Δ
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/ferc1.py 96.6% <94.4%> (-0.2%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans merged commit f96ee5b into xbrl_meta_reshape Jun 2, 2023
@zaneselvans zaneselvans deleted the calc-correction-records branch June 5, 2023 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-repair Interpolating or extrapolating data that we don't actually have. ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. rmi xbrl Related to the FERC XBRL transition
Projects
Archived in project
2 participants