Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagsterize EIA output tables #2490

Closed
wants to merge 18 commits into from
Closed

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Apr 4, 2023

This PR adds the EIA entity output tables and associated dependencies into dagster.

Warnings about future deprecations have been added to all ouput tables. Where there are boolean options (e.g. fill_tech_desc, these have been set to their defaults and the boolean options deprecated.) For now, the conversions are being done through Python wrappers, though some of these transitions may eventually be replaced by SQL views.

Note that I've moved all the EIA denormalized tables into the EIA metadata, rather than EIA 860, to reflect that data in the tables may be sourced from more than one EIA form. I've also added minor changes to the jupyter notebook to reflect the schema changes and other little pieces of advice for others!

Questions for reviewer

  • Do we want pu_eia in the final database? Or is this just an intermediary table?
  • Should output table parameters like fill_tech_desc and unit_ids be incorporated into the Config so they can be set for a whole ETL run?

Dependencies

  • denorm_utilities_eia is the dagster asset version of the existing utils_eia860 output table. It is currently generated through pudl.output.eia860.utilities_eia860 and called in pudltabl.utils_eia860.
    It depends on these normalized tables: utilities_entity_eia, utilities_eia860, utilities_eia
  • denorm_plants_eia is the dagster asset version of the existing plants_eia860 output table. It is currently generated through pudl.output.eia860.plants_eia860 and called in pudltabl.plants_eia860.
    It depends on these normalized tables: plants_entity_eia, plants_eia860, plants_eia
  • denorm_generators_eia is the dagster asset version of the existing gens_eia860 output table. It is currently generated through pudl.output.eia860.generators_eia860 and called in pudltabl.gens_eia860.
    It depends on these normalized tables: generators_entity_eia, generators_eia860, plants_entity_eia, boiler_generator_assn_eia860, pu_eia
  • denorm_boilers_eia is the dagster asset version of the existing boil_eia860 output table. It is currently generated through pudl.output.eia860.boilers_eia860 and called in pudltabl.boil_eia860.
    It depends on these normalized tables: generators_entity_eia, generators_eia860, plants_entity_eia, boiler_generator_assn_eia860, pu_eia

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Apr 4, 2023

Codecov Report

Patch coverage: 97.3% and project coverage change: -0.6 ⚠️

Comparison is base (ecb44b1) 87.2% compared to head (40c799f) 86.7%.

❗ Current head 40c799f differs from pull request most recent head c2a2185. Consider uploading reports for the commit c2a2185 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2490     +/-   ##
=======================================
- Coverage   87.2%   86.7%   -0.6%     
=======================================
  Files         85      84      -1     
  Lines       9509    9530     +21     
=======================================
- Hits        8297    8263     -34     
- Misses      1212    1267     +55     
Impacted Files Coverage Δ
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia860.py 100.0% <ø> (ø)
src/pudl/output/eia860.py 50.9% <33.3%> (-22.1%) ⬇️
src/pudl/output/denorm_eia.py 100.0% <100.0%> (ø)
src/pudl/output/pudltabl.py 96.7% <100.0%> (+0.2%) ⬆️

... and 5 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@e-belfer
Copy link
Member Author

e-belfer commented Apr 4, 2023

Issues to address:

  1. Adding enforce_schema() to the PudlSQLiteIOManager means that when we load a denormalized table asset into Python, the schema is applied. On the other hand, if we read the table in from SQL using an updated pudltabl function, the data types are being determined from apply_data_types(). This results in differences between the treatment of categoricals for the following two lines of code:
asset_name = "denorm_plants_eia"
new_df = defs.load_asset_value(AssetKey(asset_name))

and

def plants_denorm_eia860(self) -> pd.DataFrame:
        """Pull a dataframe of plant level info reported in EIA 860.

        Returns:
            A denormalized table for interactive use.
        """
        return pd.read_sql("denorm_plants_eia", self.pudl_engine).pipe(
            apply_pudl_dtypes, group="eia"
        )
new_pudl_out_df = pudl_out.plants_denorm_eia860()

new_df.dtypes == new_pudl_out_df.dtypes yields:

column name dtypes equivalent
plant_id_eia True
plant_name_eia True
city True
county True
latitude True
longitude True
state True
street_address True
zip_code True
timezone False
report_date True
ash_impoundment True
ash_impoundment_lined True
ash_impoundment_status True
balancing_authority_code_eia True
balancing_authority_name_eia True
datum True
energy_storage True
ferc_cogen_docket_no True
ferc_cogen_status True
ferc_exempt_wholesale_generator_docket_no True
ferc_exempt_wholesale_generator True
ferc_small_power_producer_docket_no True
ferc_small_power_producer True
ferc_qualifying_facility_docket_no True
grid_voltage_1_kv True
grid_voltage_2_kv True
grid_voltage_3_kv True
iso_rto_code True
liquefied_natural_gas_storage True
natural_gas_local_distribution_company True
natural_gas_storage True
natural_gas_pipeline_name_1 True
natural_gas_pipeline_name_2 True
natural_gas_pipeline_name_3 True
nerc_region False
net_metering True
pipeline_notes True
primary_purpose_id_naics True
regulatory_status_code True
reporting_frequency_code False
sector_id_eia True
sector_name_eia True
service_area True
transmission_distribution_owner_id True
transmission_distribution_owner_name True
transmission_distribution_owner_state True
utility_id_eia True
water_source True
data_maturity True
plant_id_pudl True
utility_name_eia True
utility_id_pudl True
balancing_authority_code_eia_consistent_rate True

@zaneselvans Do you have thoughts on how to proceed here? Do we need to integrate enforce_schema() into the pudltabl functions as well?

@zaneselvans
Copy link
Member

Yes, I think the intention is that every database table will have the necessary resource definition to be able to use enforce_schema() and we should be using that both in the PudlTabl context and in the IOManager so they provide exactly the same outputs.

@e-belfer e-belfer linked an issue Apr 4, 2023 that may be closed by this pull request
@e-belfer e-belfer linked an issue Apr 4, 2023 that may be closed by this pull request
@e-belfer e-belfer changed the title WIP dagsterize plant entity output table Dagsterize EIA plant entity output tables Apr 5, 2023
@e-belfer e-belfer self-assigned this Apr 5, 2023
@e-belfer e-belfer requested a review from zaneselvans April 5, 2023 15:15
@e-belfer e-belfer marked this pull request as ready for review April 5, 2023 15:18
@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 output Exporting data from PUDL into other platforms or interchange formats. dagster Issues related to our use of the Dagster orchestrator labels Apr 5, 2023
@zaneselvans zaneselvans added this to the 2023 Spring milestone Apr 5, 2023
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots to think about in here! Thank you for making the first attempt at this process. Let me know if you want to get on a call.

src/pudl/metadata/fields.py Show resolved Hide resolved
Comment on lines 806 to 825
"pu_eia": {
"description": (
"Denormalized table containing all plant and utility IDs and names from EIA."
),
"schema": {
"fields": [
"report_date",
"plant_id_eia",
"plant_name_eia",
"plant_id_pudl",
"utility_id_eia",
"utility_name_eia",
"utility_id_pudl",
],
"primary_key": ["plant_id_eia", "report_date"],
},
"field_namespace": "eia",
"sources": ["eia860", "eia923"],
"etl_group": "outputs",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should be persisting pu_eia in the DB? Is it useful on its own, or is it only really used in construction of the other denormalized tables? If it's just a building block, maybe it should use the default IO Manager which would just pickle it to disk.

Or is it so widely integrated into other functions beyond the construction of the denormalized tables that getting rid of it would be annoying?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an interim table. Use the default IO Manager.

"utility_name_eia",
"utility_id_pudl",
],
"primary_key": ["plant_id_eia", "report_date"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pu_eia will need to have additional primary key columns. At least one of them should refer to the utility. The primary key needs to be unique within the table.

src/pudl/metadata/resources/eia.py Show resolved Hide resolved
src/pudl/output/denorm_eia.py Outdated Show resolved Hide resolved
src/pudl/output/denorm_eia.py Outdated Show resolved Hide resolved
@@ -129,21 +129,39 @@ def __init__(
# Used to persist the output tables. Returns None if they don't exist.
self._dfs = defaultdict(lambda: None)

def pu_eia860(self, update=False):
"""Pull a dataframe of EIA plant-utility associations.
def filter_query(self, table: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we anticipating other kinds of selection happening in here? If not I think a more specific name than filter_query() would improve readability. Maybe select_between_dates()?

Comment on lines +101 to +105
logger.warning(
"pudl.output.eia860.plants_eia860() will be deprecated in a future version of PUDL."
" In the future, call the PudlTabl.plants_eia860() method or pull the denorm_plants_eia table"
"directly from the pudl.sqlite database."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intend to deprecate the PudlTabl object entirely, so we shouldn't be pointing people at that solution going forward. It's eventually gonna be a database table, period. I'm not sure if we want to mention the name of the database table though since I think we have yet to make a decision about how the tables are going to be named going forward. If we expect the denorm_ tables to be the primary points of access for users in our "Data Warehouse", then probably we'll want to ditch the prefix so the primary point of data access has the simplest most legible name.

If we put these warnings in every function I think they're going to become overwhelming. Given that we're going to deprecate PudlTabl and mostly we read through those objects right now (and so do other people) maybe the right thing to do is put a warnings.warn() in PudlTabl.__init__() so that it comes up when anyone creates the object.

Why do we need to largely duplicate the contents of this function in denorm_plants_eia() rather than converting this function into the asset directly or separating the functionality as needed between the two? This seems dangerous (hard to maintain, likely to make it ambiguous where information is coming from -- not that we want this to stick around for a long time but... famous last words).

If need be we can name the asset whatever we want, independent of the name of the function itself with additional parameters in the @asset() decorator. But maybe there's a design constraint I'm not seeing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zane should check in with @bendnorman about whether we should be deleting the old output functions now, and assume that everyone is accessing data through the PudlTabl class for now. Having duplicate info paths is scary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree we run the risk of people accessing data using the old output functions (functions in pudl.output.{datasource}) if we keep them around. I'm happy to delete them once we've confirmed the output of the new asset is the same. I added the deprecation warning in an effort to maintain backward compatibility in case folks are bypassing PudlTabl and calling the underlying functions. Based on the dagster retro though it sounds like we've agreed maintaining backward compatibility of our code isn't super important.
  • Good point about not including the table name in the deprecation warning, given we haven't settled on a naming pattern yet. We should direct folks to PudlTabl if we keep the deprecation warnings around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendnorman and I discussed this and decided that we'll try removing the old functions after verifying that the outputs are functionally identical, but it if that ends up being a complicated mess because the intermediate situation with some functions transitioned and others not is too complex, we'll defer removing the old functions until all of the new functions are in place.



@asset(io_manager_key="pudl_sqlite_io_manager", compute_kind="Python")
def pu_eia(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pu_eia really just an intermediary table that's used to build up other denormalized tables? Or does it have stand-alone use cases for which we need to distribute it in the DB? If it's just a helper for building others, maybe it would be appropriate to persist it using the default (pickled dataframe on disk) IO Manager?

@zaneselvans
Copy link
Member

I think the he integration test failure has to do with the PudlTabl class no longer satisfying the requirements for use with pickle. This functionality was added recently by @arengel and I'm not that familiar with it.

I think this was mainly a way to persist an entire PudlTabl object to disk -- including all of the dataframes that are cached within it -- to avoid needing to re-do sometimes time-consuming calculations to regenerate them later. With the move to all of the outputs being pre-computed and stored within the PUDL DB directly, I suspect that the need for this will go away, and given that we're not going to be caching dfs inside PudlTabl any more, probably can't be maintained.

I'm concerned that for our largest tables, while having the outputs pre-computed and stored in the DB will remove large up-front computations and reduce overall memory usage, reading a very large dataframe out of SQLite will be significantly slower than reading it out of memory, and this may be noticeable and annoying in some use cases.

Long term, I suspect the right solution to this problem will be using a much faster storage format (either Parquet or DuckDB once their file format stabilizes with v1.0). Short term I think we'll just have to make the "no big computations, but slower read from the DB" trade-off. It'll certainly be much better than having to do both lots of reading from SQLite and lots of compute.

@bendnorman bendnorman mentioned this pull request Apr 5, 2023
8 tasks
@arengel
Copy link
Collaborator

arengel commented Apr 6, 2023

I think the he integration test failure has to do with the PudlTabl class no longer satisfying the requirements for use with pickle. This functionality was added recently by @arengel and I'm not that familiar with it.

I think this was mainly a way to persist an entire PudlTabl object to disk -- including all of the dataframes that are cached within it -- to avoid needing to re-do sometimes time-consuming calculations to regenerate them later. With the move to all of the outputs being pre-computed and stored within the PUDL DB directly, I suspect that the need for this will go away, and given that we're not going to be caching dfs inside PudlTabl any more, probably can't be maintained.

I'm concerned that for our largest tables, while having the outputs pre-computed and stored in the DB will remove large up-front computations and reduce overall memory usage, reading a very large dataframe out of SQLite will be significantly slower than reading it out of memory, and this may be noticeable and annoying in some use cases.

Long term, I suspect the right solution to this problem will be using a much faster storage format (either Parquet or DuckDB once their file format stabilizes with v1.0). Short term I think we'll just have to make the "no big computations, but slower read from the DB" trade-off. It'll certainly be much better than having to do both lots of reading from SQLite and lots of compute.

The reason I added the functionality was indeed to be able to serialize an entire PudlTabl object to avoid repeating time-consuming calculations. So I agree that having all the tables pre-computed and in in the SQLite DB removes the need for this functionality. To the extent DB reads become an issue for us, we can just modify our current process of caching everything as parquets to plug into the DB rather than PudlTabl.

Another point is that because this functionality wasn't released, we don't have much code (if any that I can think of) that depends on it, I mention this in case you want to remove it at some point soon before it slowly breaks in confusing ways. Normally I'd offer to help fix it so that it keeps working but given the plans for PudlTabl that doesn't seem worth it.

Base automatically changed from init-sql-views to dev April 6, 2023 06:36
@e-belfer e-belfer assigned zaneselvans and unassigned e-belfer Apr 10, 2023
@zaneselvans zaneselvans marked this pull request as draft April 10, 2023 20:58
Create new dagster assets:

- denorm_plants_utilities_eia which replaces pu_eia860
- denorm_ownership_eia860 which replaces ownership_eia860 / own_eia860

Both of these are getting written into the DB for now, since both of
them had / have dedicated methods in the `PudlTabl` output manager
class.
Convert the newly denormalized entity tables and EIA-860 output tables
to use the new dynamic output methods in the PudlTabl object.
@zaneselvans zaneselvans changed the title Dagsterize EIA plant entity output tables Dagsterize EIA output tables Apr 11, 2023
@zaneselvans zaneselvans mentioned this pull request Apr 11, 2023
8 tasks
@zaneselvans zaneselvans deleted the dagsterize-eia-entity-outputs branch April 11, 2023 03:18
@zaneselvans zaneselvans mentioned this pull request Apr 11, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 inframundo output Exporting data from PUDL into other platforms or interchange formats.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Convert EIA entity / annual output tables to Dagster assets Convert pu_eia860 to SQL view asset
4 participants