Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change SEC 10-K table schemas to fix FK errors and use quarterly naming. #4046

Merged
merged 4 commits into from
Feb 6, 2025

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Feb 5, 2025

Overview

  • Nightly builds failed with FK constraint errors in the new SEC 10-K tables.
  • These are mostly due to the temporal resolution mismatch between the SEC 10-K tables (quarterly) and the core_eia860__scd_utilities table (annual).
  • In addition, the new quarterly tables didn't follow our existing naming conventions, which indicate the temporal frequency in the name.
  • However, there are a handful of FK constraint violations even when the FK constraint only includes utility_id_eia.

What did you change?

  • Excluded the autogenerated FK constraint between these tables as we have for many of our monthly tables that reference utilities.
  • Updated the names of the SEC 10K core tables to reflect their quarterly frequency.
  • Dropped the 3 records with the unrecognized utility_id_eia

Documentation

Tasks

Preview Give feedback

Testing

  • Materialized the sec10k tables locally and ran pudl_check_fks and got the same failure as the nightly builds.
  • Updated the DB schema and FK constraints, rematerialized, and ran pudl_check_fks again and got just 3 records violating the constraint.

To-do list

Preview Give feedback

@zaneselvans zaneselvans added docs Documentation for users and contributors. sqlite Issues related to interacting with sqlite databases metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. schema-change Used to label any PR that changes table or column names, what columns appear in a table, etc. sec10k Issues related to SEC 10K filing data. labels Feb 5, 2025
@zaneselvans zaneselvans self-assigned this Feb 5, 2025
@zaneselvans zaneselvans requested a review from zschira February 5, 2025 17:30
@zaneselvans
Copy link
Member Author

@zschira can you provide more extensive table descriptions for these new guys?

@zaneselvans
Copy link
Member Author

It looks like the offending utility_id_eia is 3579 aka Cirro Group, Inc in Texas.

report_date filename_sec10k company_id_sec10k utility_id_eia company_name_raw location_of_incorporation parent_company_central_index_key files_sec10k
499327 2010-02-26 00:00:00 edgar/data/715957/0001193125-10-042883.txt 0000715957_cirro group incorporated_texas 3579 cirro group, inc texas 0000715957 False
523722 2010-02-26 00:00:00 edgar/data/103682/0001193125-10-042883.txt 0000103682_cirro group incorporated_texas 3579 cirro group, inc texas 0000103682 False
963038 2023-02-23 00:00:00 edgar/data/1013871/0001013871-23-000004.txt 0001013871_cirro group incorporated_texas 3579 cirro group, inc texas 0001013871 False

Comment on lines +127 to +129
bad_utility_ids = [
3579, # Cirro Group, Inc. in Texas
]
Copy link
Member Author

@zaneselvans zaneselvans Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this is a rogue utility_id_eia that only exists in the EIA-861 data, and was the only example of such a utility that ended up matching to an SEC company. It should go away if we remove the experimental EIA-861 ID harvesting upstream, as @katie-lamb said she intends to.

Comment on lines +503 to +506
"core_sec10k__quarterly_filings",
"core_sec10k__quarterly_exhibit_21_company_ownership",
"core_sec10k__quarterly_company_information",
"out_sec10k__parents_and_subsidiaries",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exclude these because they don't have the same (annual) time frequency as the entity table.

def core_sec10k__company_information() -> pd.DataFrame:
def core_sec10k__quarterly_company_information() -> pd.DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add time frequency prefix.

"core_sec10k__filings": {
"core_sec10k__quarterly_filings": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename these to include time frequency prefix.

@zaneselvans zaneselvans marked this pull request as ready for review February 6, 2025 02:58
@zaneselvans
Copy link
Member Author

Branch build seems to have succeeded last night! (with the exception of the Zenodo Sandbox flakeout, which is unrelated)

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for stepping in with the foreign key fixes! I wonder if we should just merge @katie-lamb's metadata into this branch once we finish review, then merge everything into main? I guess it doesn't make too much difference if we get both PR's done today

@zaneselvans
Copy link
Member Author

I think we might have more work to do on the documentation, and it'd be good to get the nightly build ETL passing with the quarterly updates in the works, so I say we merge this one and work on the docs / metadata independently.

@zaneselvans zaneselvans added this pull request to the merge queue Feb 6, 2025
Merged via the queue into main with commit fd4b1c6 Feb 6, 2025
17 checks passed
@zaneselvans zaneselvans deleted the sec10k-schema-fixes branch February 6, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. schema-change Used to label any PR that changes table or column names, what columns appear in a table, etc. sec10k Issues related to SEC 10K filing data. sqlite Issues related to interacting with sqlite databases
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants