Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sec10k metadata directly in PUDL #4035

Merged
merged 23 commits into from
Feb 5, 2025
Merged
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Convert year_quarters to dates
zschira committed Jan 29, 2025
commit e2bca001e24e3b4184600451f0af5044fd5dd2cf
14 changes: 14 additions & 0 deletions src/pudl/analysis/pudl_models.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd that this is under the analysis subpackage given that the analysis isn't done here. It seems more like a straightforward ETL of some tables, albeit tables that we ourselves have produced elsewhere. Would it be more legible if it followed the same pattern as for other datasets, adding an sec10k module under the extract and transform subpackages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the organization feels weird, but I don't think that would quite make sense either. We are doing some basic transformations in PUDL right now, but ideally these would all be moved back upstream so all we actually do in PUDL is read the parquet files in so they can be distributed. In that case the extract step would be creating core/out tables which doesn't feel right. Would it make sense to add this to the etl directory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is just a new class of externally processed table? If it doesn't clearly fit into one of the other categories right now then I guess we can figure out where to put it when other tables like this show up.

The etl directory feels like a weird junk drawer to me. I still like the idea of organizing the subpackages by data source, given that different data sources have ended up having slightly different ETL journeys that don't necessarily fit neatly in to E, T, and L (and the fact that L has been almost totally take over by Dagster infra)

Original file line number Diff line number Diff line change
@@ -18,6 +18,11 @@ def _compute_fraction_owned(percent_ownership: pd.Series) -> pd.Series:
) / 100.0


def _year_quarter_to_date(year_quarter: pd.Series) -> pd.Series:
"""Clean percent ownership, convert to float, then convert percent to ratio."""
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
return pd.PeriodIndex(year_quarter, freq="Q").to_timestamp()


@asset(
io_manager_key="parquet_io_manager",
zaneselvans marked this conversation as resolved.
Show resolved Hide resolved
group_name="pudl_models",
@@ -35,6 +40,9 @@ def core_sec10k__company_information() -> pd.DataFrame:
}
)

# Get date from year quarters
df["report_date"] = _year_quarter_to_date(df.year_quarter)

return df


@@ -56,6 +64,9 @@ def core_sec10k__exhibit_21_company_ownership() -> pd.DataFrame:
# Convert ownership percentage
df["fraction_owned"] = _compute_fraction_owned(df.ownership_percentage)

# Get date from year quarters
df["report_date"] = _year_quarter_to_date(df.year_quarter)

return df


@@ -73,6 +84,9 @@ def core_sec10k__filings() -> pd.DataFrame:
}
)

# Get date from year quarters
df["report_date"] = _year_quarter_to_date(df.year_quarter)

return df


4 changes: 0 additions & 4 deletions src/pudl/metadata/fields.py
Original file line number Diff line number Diff line change
@@ -5036,10 +5036,6 @@
"type": "integer",
"description": "Year the data was reported in, used for partitioning EPA CEMS.",
},
"year_quarter": {
"type": "string",
"description": "Year quarter filing applies to.",
},
"zip_code": {
"type": "string",
"description": "Five digit US Zip Code.",
6 changes: 3 additions & 3 deletions src/pudl/metadata/resources/sec10k.py
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@
"sec10k_version",
"date_filed",
"exhibit_21_version",
"year_quarter",
"report_date",
],
"primary_key": [
"filename_sec10k",
@@ -32,7 +32,7 @@
"subsidiary_company_name",
"subsidiary_location",
"fraction_owned",
"year_quarter",
"report_date",
],
},
"sources": ["sec10k"],
@@ -50,7 +50,7 @@
"company_information_block_count",
"company_information_fact_name",
"company_information_fact_value",
"year_quarter",
"report_date",
],
"primary_key": [
"filename_sec10k",