-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create simple SQL view assets #2445
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
src/pudl/output/helpers.py
Outdated
from dagster import AssetsDefinition, asset | ||
|
||
|
||
def sql_asset_factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a better location for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem like a weird place for it. How about pudl.output.sql
or pudl.output.sql_views
?
Do we want to think about how the assets should ultimately be organized and start populating that hierarchy with the new assets that we're creating? Something based on Dagster's recommended project layout? What would the sub-packages under pudl.assets
be?
It seems like the layout becomes much more flexible once we're using assets and referring to them in the global asset namespace, rather than calling functions and passing all the dataframes around in function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assets in the same asset group should live in the same module so that we can call load_assets_from_modules([module], group_name="group_name")
.
With our current project structure we could either have one large output table asset group or asset groups for each data source: "denorm_eia860", "denorm_ferc1". At this point in the data processing, it might not make sense to be grouping by datasource because the data has been harvested and we are combining tables from different datasources.
@zaneselvans how would you categorize our output table groups?
We want the IO Manager to throw an error if we haven't created resource meatdata for a view but we also don't want a table schema created for the view in the database.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2445 +/- ##
=====================================
Coverage 87.2% 87.2%
=====================================
Files 81 84 +3
Lines 9511 9537 +26
=====================================
+ Hits 8302 8325 +23
- Misses 1209 1212 +3
... and 1 file with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zaneselvans I still need to make some changes but the structure is there and we have an output view and table in the database. Would appreciate some feedback on the questions I flagged.
…flesh out output table conversion notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments, mostly about what feels like some proliferating brittle complexity and wondering how we are going to organize the new structures we're creating so that we can keep track of them easily.
Assuming we're going to punt the decision on how to organize the data dictionaries I think this is ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two hard problems in computer science: data types, naming things, and off-by-one errors.
Mainly:
- with the table names should we standardize on "utilities" rather than somtimes using "utils" (even though we're preserving "utils" in the
PudlTabl
methods for backwards consistency). - Should we / can we use
enforce_dtypes()
when reading tables out of the DB for use in building other tables to avoid later dtype shenanigans? If the table is in the DB then it should have metadata defined for it, right?
logger.warning( | ||
"pudl.output.eia860.utilities_eia860() will be deprecated in a future version of PUDL." | ||
" In the future, call the PudlTabl.utils_eia860() method or pull the denorm_utilities_eia table" | ||
"directly from the pudl.sqlite database." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to fix this to merge, but I think we should probably avoid directing folks to use the PudlTabl
outputs since we intend to deprecate that too -- if we're giving them directions for the future, it should probably be to use the DB tables directly.
PR Overview
This PR contains:
denorm_plants_utils_ferc1
which performs the simple inner joinpudl.output.ferc1.plants_utils_ferc1()
does in pandas.PudlTabl.pu_ferc1()
has been updated to pull the SQL view from the database instead of callingplants_utils_ferc1()
.sql_asset_factory
which returns an asset that sends SQL code to the db to be executed. The SQL code is stored inpudl.output.sql
.denorm_utilities_eia860
is just a dagsterified version ofpudl.output.eia860.utilities_eia860()
.denorm_utilities_eia860
anddenorm_plants_utils_ferc1
.Tasks
PR Checklist
dev
).