-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dagsterize EIA output tables #2490
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2490 +/- ##
=======================================
- Coverage 87.2% 86.7% -0.6%
=======================================
Files 85 84 -1
Lines 9509 9530 +21
=======================================
- Hits 8297 8263 -34
- Misses 1212 1267 +55
... and 5 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Issues to address:
and
@zaneselvans Do you have thoughts on how to proceed here? Do we need to integrate enforce_schema() into the pudltabl functions as well? |
Yes, I think the intention is that every database table will have the necessary resource definition to be able to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots to think about in here! Thank you for making the first attempt at this process. Let me know if you want to get on a call.
src/pudl/metadata/resources/eia.py
Outdated
"pu_eia": { | ||
"description": ( | ||
"Denormalized table containing all plant and utility IDs and names from EIA." | ||
), | ||
"schema": { | ||
"fields": [ | ||
"report_date", | ||
"plant_id_eia", | ||
"plant_name_eia", | ||
"plant_id_pudl", | ||
"utility_id_eia", | ||
"utility_name_eia", | ||
"utility_id_pudl", | ||
], | ||
"primary_key": ["plant_id_eia", "report_date"], | ||
}, | ||
"field_namespace": "eia", | ||
"sources": ["eia860", "eia923"], | ||
"etl_group": "outputs", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should be persisting pu_eia
in the DB? Is it useful on its own, or is it only really used in construction of the other denormalized tables? If it's just a building block, maybe it should use the default IO Manager which would just pickle it to disk.
Or is it so widely integrated into other functions beyond the construction of the denormalized tables that getting rid of it would be annoying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an interim table. Use the default IO Manager.
src/pudl/metadata/resources/eia.py
Outdated
"utility_name_eia", | ||
"utility_id_pudl", | ||
], | ||
"primary_key": ["plant_id_eia", "report_date"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pu_eia
will need to have additional primary key columns. At least one of them should refer to the utility. The primary key needs to be unique within the table.
src/pudl/output/pudltabl.py
Outdated
@@ -129,21 +129,39 @@ def __init__( | |||
# Used to persist the output tables. Returns None if they don't exist. | |||
self._dfs = defaultdict(lambda: None) | |||
|
|||
def pu_eia860(self, update=False): | |||
"""Pull a dataframe of EIA plant-utility associations. | |||
def filter_query(self, table: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we anticipating other kinds of selection happening in here? If not I think a more specific name than filter_query()
would improve readability. Maybe select_between_dates()
?
logger.warning( | ||
"pudl.output.eia860.plants_eia860() will be deprecated in a future version of PUDL." | ||
" In the future, call the PudlTabl.plants_eia860() method or pull the denorm_plants_eia table" | ||
"directly from the pudl.sqlite database." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We intend to deprecate the PudlTabl
object entirely, so we shouldn't be pointing people at that solution going forward. It's eventually gonna be a database table, period. I'm not sure if we want to mention the name of the database table though since I think we have yet to make a decision about how the tables are going to be named going forward. If we expect the denorm_
tables to be the primary points of access for users in our "Data Warehouse", then probably we'll want to ditch the prefix so the primary point of data access has the simplest most legible name.
If we put these warnings in every function I think they're going to become overwhelming. Given that we're going to deprecate PudlTabl
and mostly we read through those objects right now (and so do other people) maybe the right thing to do is put a warnings.warn()
in PudlTabl.__init__()
so that it comes up when anyone creates the object.
Why do we need to largely duplicate the contents of this function in denorm_plants_eia()
rather than converting this function into the asset directly or separating the functionality as needed between the two? This seems dangerous (hard to maintain, likely to make it ambiguous where information is coming from -- not that we want this to stick around for a long time but... famous last words).
If need be we can name the asset whatever we want, independent of the name of the function itself with additional parameters in the @asset()
decorator. But maybe there's a design constraint I'm not seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zane should check in with @bendnorman about whether we should be deleting the old output functions now, and assume that everyone is accessing data through the PudlTabl
class for now. Having duplicate info paths is scary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree we run the risk of people accessing data using the old output functions (functions in pudl.output.{datasource}) if we keep them around. I'm happy to delete them once we've confirmed the output of the new asset is the same. I added the deprecation warning in an effort to maintain backward compatibility in case folks are bypassing
PudlTabl
and calling the underlying functions. Based on the dagster retro though it sounds like we've agreed maintaining backward compatibility of our code isn't super important. - Good point about not including the table name in the deprecation warning, given we haven't settled on a naming pattern yet. We should direct folks to PudlTabl if we keep the deprecation warnings around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bendnorman and I discussed this and decided that we'll try removing the old functions after verifying that the outputs are functionally identical, but it if that ends up being a complicated mess because the intermediate situation with some functions transitioned and others not is too complex, we'll defer removing the old functions until all of the new functions are in place.
src/pudl/output/denorm_eia.py
Outdated
|
||
|
||
@asset(io_manager_key="pudl_sqlite_io_manager", compute_kind="Python") | ||
def pu_eia( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pu_eia
really just an intermediary table that's used to build up other denormalized tables? Or does it have stand-alone use cases for which we need to distribute it in the DB? If it's just a helper for building others, maybe it would be appropriate to persist it using the default (pickled dataframe on disk) IO Manager?
I think the he integration test failure has to do with the I think this was mainly a way to persist an entire I'm concerned that for our largest tables, while having the outputs pre-computed and stored in the DB will remove large up-front computations and reduce overall memory usage, reading a very large dataframe out of SQLite will be significantly slower than reading it out of memory, and this may be noticeable and annoying in some use cases. Long term, I suspect the right solution to this problem will be using a much faster storage format (either Parquet or DuckDB once their file format stabilizes with v1.0). Short term I think we'll just have to make the "no big computations, but slower read from the DB" trade-off. It'll certainly be much better than having to do both lots of reading from SQLite and lots of compute. |
The reason I added the functionality was indeed to be able to serialize an entire Another point is that because this functionality wasn't released, we don't have much code (if any that I can think of) that depends on it, I mention this in case you want to remove it at some point soon before it slowly breaks in confusing ways. Normally I'd offer to help fix it so that it keeps working but given the plans for |
Create new dagster assets: - denorm_plants_utilities_eia which replaces pu_eia860 - denorm_ownership_eia860 which replaces ownership_eia860 / own_eia860 Both of these are getting written into the DB for now, since both of them had / have dedicated methods in the `PudlTabl` output manager class.
Convert the newly denormalized entity tables and EIA-860 output tables to use the new dynamic output methods in the PudlTabl object.
This PR adds the EIA entity output tables and associated dependencies into dagster.
Warnings about future deprecations have been added to all ouput tables. Where there are boolean options (e.g.
fill_tech_desc
, these have been set to their defaults and the boolean options deprecated.) For now, the conversions are being done through Python wrappers, though some of these transitions may eventually be replaced by SQL views.Note that I've moved all the EIA denormalized tables into the EIA metadata, rather than EIA 860, to reflect that data in the tables may be sourced from more than one EIA form. I've also added minor changes to the jupyter notebook to reflect the schema changes and other little pieces of advice for others!
Questions for reviewer
pu_eia
in the final database? Or is this just an intermediary table?fill_tech_desc
andunit_ids
be incorporated into theConfig
so they can be set for a whole ETL run?Dependencies
denorm_utilities_eia
is the dagster asset version of the existingutils_eia860
output table. It is currently generated throughpudl.output.eia860.utilities_eia860
and called inpudltabl.utils_eia860
.It depends on these normalized tables:
utilities_entity_eia
,utilities_eia860
,utilities_eia
denorm_plants_eia
is the dagster asset version of the existingplants_eia860
output table. It is currently generated throughpudl.output.eia860.plants_eia860
and called inpudltabl.plants_eia860
.It depends on these normalized tables:
plants_entity_eia
,plants_eia860
,plants_eia
denorm_generators_eia
is the dagster asset version of the existinggens_eia860
output table. It is currently generated throughpudl.output.eia860.generators_eia860
and called inpudltabl.gens_eia860
.It depends on these normalized tables:
generators_entity_eia
,generators_eia860
,plants_entity_eia
,boiler_generator_assn_eia860
,pu_eia
denorm_boilers_eia
is the dagster asset version of the existingboil_eia860
output table. It is currently generated throughpudl.output.eia860.boilers_eia860
and called inpudltabl.boil_eia860
.It depends on these normalized tables:
generators_entity_eia
,generators_eia860
,plants_entity_eia
,boiler_generator_assn_eia860
,pu_eia
PR Checklist
dev
).