Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create simple SQL view assets #2445

Merged
merged 13 commits into from
Apr 6, 2023
Merged

Create simple SQL view assets #2445

merged 13 commits into from
Apr 6, 2023

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Mar 23, 2023

PR Overview

This PR contains:

  • A SQL view called denorm_plants_utils_ferc1 which performs the simple inner join pudl.output.ferc1.plants_utils_ferc1() does in pandas. PudlTabl.pu_ferc1() has been updated to pull the SQL view from the database instead of calling plants_utils_ferc1().
  • A helper function calledsql_asset_factory which returns an asset that sends SQL code to the db to be executed. The SQL code is stored in pudl.output.sql.
  • An example python output table called denorm_utilities_eia860 is just a dagsterified version of pudl.output.eia860.utilities_eia860().
  • Metadata for denorm_utilities_eia860 and denorm_plants_utils_ferc1.
  • A new parameter for SQLiteIOManagers that allow you to exclude certain table schemas for being created in the database. I made this change because we want the IO manager to throw an error if we haven’t created the resource metadata for a view but we also don’t want it to create a table schema in the db for the view.
  • A notebook that documents the output table conversion process.

Tasks

Preview Give feedback

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bendnorman bendnorman marked this pull request as draft March 23, 2023 00:45
@bendnorman bendnorman requested a review from zaneselvans March 23, 2023 00:45
src/pudl/output/ferc1.py Show resolved Hide resolved
from dagster import AssetsDefinition, asset


def sql_asset_factory(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a better location for this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like a weird place for it. How about pudl.output.sql or pudl.output.sql_views?

Do we want to think about how the assets should ultimately be organized and start populating that hierarchy with the new assets that we're creating? Something based on Dagster's recommended project layout? What would the sub-packages under pudl.assets be?

It seems like the layout becomes much more flexible once we're using assets and referring to them in the global asset namespace, rather than calling functions and passing all the dataframes around in function calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assets in the same asset group should live in the same module so that we can call load_assets_from_modules([module], group_name="group_name").

With our current project structure we could either have one large output table asset group or asset groups for each data source: "denorm_eia860", "denorm_ferc1". At this point in the data processing, it might not make sense to be grouping by datasource because the data has been harvested and we are combining tables from different datasources.

@zaneselvans how would you categorize our output table groups?

@bendnorman bendnorman linked an issue Mar 23, 2023 that may be closed by this pull request
We want the IO Manager to throw an error if we haven't created resource meatdata
for a view but we also don't want a table schema created for the view
in the database.
@codecov
Copy link

codecov bot commented Mar 24, 2023

Codecov Report

Patch coverage: 97.2% and no project coverage change.

Comparison is base (060b72d) 87.2% compared to head (2364040) 87.2%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2445   +/-   ##
=====================================
  Coverage   87.2%   87.2%           
=====================================
  Files         81      84    +3     
  Lines       9511    9537   +26     
=====================================
+ Hits        8302    8325   +23     
- Misses      1209    1212    +3     
Impacted Files Coverage Δ
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/metadata/resources/epacems.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/io_managers.py 90.0% <50.0%> (-0.5%) ⬇️
src/pudl/metadata/classes.py 86.3% <100.0%> (ø)
src/pudl/output/denorm_eia.py 100.0% <100.0%> (ø)
src/pudl/output/denorm_ferc1.py 100.0% <100.0%> (ø)
src/pudl/output/eia860.py 72.9% <100.0%> (+0.1%) ⬆️
src/pudl/output/ferc1.py 98.6% <100.0%> (+<0.1%) ⬆️
... and 2 more

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member Author

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneselvans I still need to make some changes but the structure is there and we have an output view and table in the database. Would appreciate some feedback on the questions I flagged.

src/pudl/etl/denormalized_assets.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/ferc1.py Outdated Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/pudltabl.py Outdated Show resolved Hide resolved
src/pudl/output/ferc1.py Show resolved Hide resolved
src/pudl/output/eia860.py Outdated Show resolved Hide resolved
@bendnorman bendnorman marked this pull request as ready for review March 24, 2023 19:26
@bendnorman bendnorman marked this pull request as draft March 24, 2023 19:27
@bendnorman bendnorman requested a review from jdangerx March 24, 2023 21:55
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, mostly about what feels like some proliferating brittle complexity and wondering how we are going to organize the new structures we're creating so that we can keep track of them easily.

src/pudl/etl/__init__.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia860.py Outdated Show resolved Hide resolved
src/pudl/output/sql/denorm_plants_utils_ferc1.sql Outdated Show resolved Hide resolved
src/pudl/output/helpers.py Outdated Show resolved Hide resolved
src/pudl/output/pudltabl.py Outdated Show resolved Hide resolved
test/unit/io_managers_test.py Outdated Show resolved Hide resolved
test/unit/io_managers_test.py Outdated Show resolved Hide resolved
Base automatically changed from dagster-eia861 to dev March 27, 2023 19:14
@zaneselvans zaneselvans added sqlite Issues related to interacting with sqlite databases dagster Issues related to our use of the Dagster orchestrator output Exporting data from PUDL into other platforms or interchange formats. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Mar 31, 2023
@zaneselvans zaneselvans added this to the 2023Q2 milestone Apr 4, 2023
@bendnorman
Copy link
Member Author

Assuming we're going to punt the decision on how to organize the data dictionaries I think this is ready to go.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two hard problems in computer science: data types, naming things, and off-by-one errors.

Mainly:

  • with the table names should we standardize on "utilities" rather than somtimes using "utils" (even though we're preserving "utils" in the PudlTabl methods for backwards consistency).
  • Should we / can we use enforce_dtypes() when reading tables out of the DB for use in building other tables to avoid later dtype shenanigans? If the table is in the DB then it should have metadata defined for it, right?

src/pudl/metadata/resources/ferc1.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/ferc1.py Outdated Show resolved Hide resolved
src/pudl/output/pudltabl.py Outdated Show resolved Hide resolved
src/pudl/output/pudltabl.py Show resolved Hide resolved
src/pudl/output/sql/denorm_plants_utils_ferc1.sql Outdated Show resolved Hide resolved
@bendnorman bendnorman marked this pull request as ready for review April 5, 2023 21:50
Comment on lines +31 to +35
logger.warning(
"pudl.output.eia860.utilities_eia860() will be deprecated in a future version of PUDL."
" In the future, call the PudlTabl.utils_eia860() method or pull the denorm_utilities_eia table"
"directly from the pudl.sqlite database."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to fix this to merge, but I think we should probably avoid directing folks to use the PudlTabl outputs since we intend to deprecate that too -- if we're giving them directions for the future, it should probably be to use the DB tables directly.

@zaneselvans zaneselvans self-requested a review April 6, 2023 01:25
@bendnorman bendnorman merged commit c8782d5 into dev Apr 6, 2023
@bendnorman bendnorman deleted the init-sql-views branch April 6, 2023 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. output Exporting data from PUDL into other platforms or interchange formats. sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create data structure for dagster SQL views
2 participants