Dagsterize EIA output tables #2490

e-belfer · 2023-04-04T18:34:49Z

This PR adds the EIA entity output tables and associated dependencies into dagster.

Warnings about future deprecations have been added to all ouput tables. Where there are boolean options (e.g. fill_tech_desc, these have been set to their defaults and the boolean options deprecated.) For now, the conversions are being done through Python wrappers, though some of these transitions may eventually be replaced by SQL views.

Note that I've moved all the EIA denormalized tables into the EIA metadata, rather than EIA 860, to reflect that data in the tables may be sourced from more than one EIA form. I've also added minor changes to the jupyter notebook to reflect the schema changes and other little pieces of advice for others!

Questions for reviewer

Do we want pu_eia in the final database? Or is this just an intermediary table?
Should output table parameters like fill_tech_desc and unit_ids be incorporated into the Config so they can be set for a whole ETL run?

Dependencies

denorm_utilities_eia is the dagster asset version of the existing utils_eia860 output table. It is currently generated through pudl.output.eia860.utilities_eia860 and called in pudltabl.utils_eia860.
It depends on these normalized tables: utilities_entity_eia, utilities_eia860, utilities_eia
denorm_plants_eia is the dagster asset version of the existing plants_eia860 output table. It is currently generated through pudl.output.eia860.plants_eia860 and called in pudltabl.plants_eia860.
It depends on these normalized tables: plants_entity_eia, plants_eia860, plants_eia
denorm_generators_eia is the dagster asset version of the existing gens_eia860 output table. It is currently generated through pudl.output.eia860.generators_eia860 and called in pudltabl.gens_eia860.
It depends on these normalized tables: generators_entity_eia, generators_eia860, plants_entity_eia, boiler_generator_assn_eia860, pu_eia
denorm_boilers_eia is the dagster asset version of the existing boil_eia860 output table. It is currently generated through pudl.output.eia860.boilers_eia860 and called in pudltabl.boil_eia860.
It depends on these normalized tables: generators_entity_eia, generators_eia860, plants_entity_eia, boiler_generator_assn_eia860, pu_eia

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

review-notebook-app · 2023-04-04T18:34:54Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2023-04-04T18:36:22Z

Codecov Report

Patch coverage: 97.3% and project coverage change: -0.6 ⚠️

Comparison is base (ecb44b1) 87.2% compared to head (40c799f) 86.7%.

❗ Current head 40c799f differs from pull request most recent head c2a2185. Consider uploading reports for the commit c2a2185 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #2490     +/-   ##
=======================================
- Coverage   87.2%   86.7%   -0.6%     
=======================================
  Files         85      84      -1     
  Lines       9509    9530     +21     
=======================================
- Hits        8297    8263     -34     
- Misses      1212    1267     +55

Impacted Files	Coverage Δ
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia860.py	`100.0% <ø> (ø)`
src/pudl/output/eia860.py	`50.9% <33.3%> (-22.1%)`	⬇️
src/pudl/output/denorm_eia.py	`100.0% <100.0%> (ø)`
src/pudl/output/pudltabl.py	`96.7% <100.0%> (+0.2%)`	⬆️

... and 5 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

e-belfer · 2023-04-04T19:09:22Z

Issues to address:

Adding enforce_schema() to the PudlSQLiteIOManager means that when we load a denormalized table asset into Python, the schema is applied. On the other hand, if we read the table in from SQL using an updated pudltabl function, the data types are being determined from apply_data_types(). This results in differences between the treatment of categoricals for the following two lines of code:

asset_name = "denorm_plants_eia"
new_df = defs.load_asset_value(AssetKey(asset_name))

and

def plants_denorm_eia860(self) -> pd.DataFrame:
        """Pull a dataframe of plant level info reported in EIA 860.

        Returns:
            A denormalized table for interactive use.
        """
        return pd.read_sql("denorm_plants_eia", self.pudl_engine).pipe(
            apply_pudl_dtypes, group="eia"
        )
new_pudl_out_df = pudl_out.plants_denorm_eia860()

new_df.dtypes == new_pudl_out_df.dtypes yields:

column name	dtypes equivalent
plant_id_eia	True
plant_name_eia	True
city	True
county	True
latitude	True
longitude	True
state	True
street_address	True
zip_code	True
timezone	False
report_date	True
ash_impoundment	True
ash_impoundment_lined	True
ash_impoundment_status	True
balancing_authority_code_eia	True
balancing_authority_name_eia	True
datum	True
energy_storage	True
ferc_cogen_docket_no	True
ferc_cogen_status	True
ferc_exempt_wholesale_generator_docket_no	True
ferc_exempt_wholesale_generator	True
ferc_small_power_producer_docket_no	True
ferc_small_power_producer	True
ferc_qualifying_facility_docket_no	True
grid_voltage_1_kv	True
grid_voltage_2_kv	True
grid_voltage_3_kv	True
iso_rto_code	True
liquefied_natural_gas_storage	True
natural_gas_local_distribution_company	True
natural_gas_storage	True
natural_gas_pipeline_name_1	True
natural_gas_pipeline_name_2	True
natural_gas_pipeline_name_3	True
nerc_region	False
net_metering	True
pipeline_notes	True
primary_purpose_id_naics	True
regulatory_status_code	True
reporting_frequency_code	False
sector_id_eia	True
sector_name_eia	True
service_area	True
transmission_distribution_owner_id	True
transmission_distribution_owner_name	True
transmission_distribution_owner_state	True
utility_id_eia	True
water_source	True
data_maturity	True
plant_id_pudl	True
utility_name_eia	True
utility_id_pudl	True
balancing_authority_code_eia_consistent_rate	True

@zaneselvans Do you have thoughts on how to proceed here? Do we need to integrate enforce_schema() into the pudltabl functions as well?

zaneselvans · 2023-04-04T19:15:24Z

Yes, I think the intention is that every database table will have the necessary resource definition to be able to use enforce_schema() and we should be using that both in the PudlTabl context and in the IOManager so they provide exactly the same outputs.

zaneselvans

Lots to think about in here! Thank you for making the first attempt at this process. Let me know if you want to get on a call.

src/pudl/metadata/fields.py

zaneselvans · 2023-04-05T16:00:58Z

src/pudl/metadata/resources/eia.py

+    "pu_eia": {
+        "description": (
+            "Denormalized table containing all plant and utility IDs and names from EIA."
+        ),
+        "schema": {
+            "fields": [
+                "report_date",
+                "plant_id_eia",
+                "plant_name_eia",
+                "plant_id_pudl",
+                "utility_id_eia",
+                "utility_name_eia",
+                "utility_id_pudl",
+            ],
+            "primary_key": ["plant_id_eia", "report_date"],
+        },
+        "field_namespace": "eia",
+        "sources": ["eia860", "eia923"],
+        "etl_group": "outputs",
+    },


Do you think we should be persisting pu_eia in the DB? Is it useful on its own, or is it only really used in construction of the other denormalized tables? If it's just a building block, maybe it should use the default IO Manager which would just pickle it to disk.

Or is it so widely integrated into other functions beyond the construction of the denormalized tables that getting rid of it would be annoying?

Should be an interim table. Use the default IO Manager.

zaneselvans · 2023-04-05T16:02:03Z

src/pudl/metadata/resources/eia.py

+                "utility_name_eia",
+                "utility_id_pudl",
+            ],
+            "primary_key": ["plant_id_eia", "report_date"],


I think pu_eia will need to have additional primary key columns. At least one of them should refer to the utility. The primary key needs to be unique within the table.

src/pudl/metadata/resources/eia.py

src/pudl/output/denorm_eia.py

zaneselvans · 2023-04-05T19:32:51Z

src/pudl/output/pudltabl.py

@@ -129,21 +129,39 @@ def __init__(
        # Used to persist the output tables. Returns None if they don't exist.
        self._dfs = defaultdict(lambda: None)

-    def pu_eia860(self, update=False):
-        """Pull a dataframe of EIA plant-utility associations.
+    def filter_query(self, table: str):


Are we anticipating other kinds of selection happening in here? If not I think a more specific name than filter_query() would improve readability. Maybe select_between_dates()?

zaneselvans · 2023-04-05T19:40:07Z

src/pudl/output/eia860.py

+    logger.warning(
+        "pudl.output.eia860.plants_eia860() will be deprecated in a future version of PUDL."
+        " In the future, call the PudlTabl.plants_eia860() method or pull the denorm_plants_eia table"
+        "directly from the pudl.sqlite database."
+    )


We intend to deprecate the PudlTabl object entirely, so we shouldn't be pointing people at that solution going forward. It's eventually gonna be a database table, period. I'm not sure if we want to mention the name of the database table though since I think we have yet to make a decision about how the tables are going to be named going forward. If we expect the denorm_ tables to be the primary points of access for users in our "Data Warehouse", then probably we'll want to ditch the prefix so the primary point of data access has the simplest most legible name.

If we put these warnings in every function I think they're going to become overwhelming. Given that we're going to deprecate PudlTabl and mostly we read through those objects right now (and so do other people) maybe the right thing to do is put a warnings.warn() in PudlTabl.__init__() so that it comes up when anyone creates the object.

Why do we need to largely duplicate the contents of this function in denorm_plants_eia() rather than converting this function into the asset directly or separating the functionality as needed between the two? This seems dangerous (hard to maintain, likely to make it ambiguous where information is coming from -- not that we want this to stick around for a long time but... famous last words).

If need be we can name the asset whatever we want, independent of the name of the function itself with additional parameters in the @asset() decorator. But maybe there's a design constraint I'm not seeing.

Zane should check in with @bendnorman about whether we should be deleting the old output functions now, and assume that everyone is accessing data through the PudlTabl class for now. Having duplicate info paths is scary.

I agree we run the risk of people accessing data using the old output functions (functions in pudl.output.{datasource}) if we keep them around. I'm happy to delete them once we've confirmed the output of the new asset is the same. I added the deprecation warning in an effort to maintain backward compatibility in case folks are bypassing PudlTabl and calling the underlying functions. Based on the dagster retro though it sounds like we've agreed maintaining backward compatibility of our code isn't super important.

Good point about not including the table name in the deprecation warning, given we haven't settled on a naming pattern yet. We should direct folks to PudlTabl if we keep the deprecation warnings around.

@bendnorman and I discussed this and decided that we'll try removing the old functions after verifying that the outputs are functionally identical, but it if that ends up being a complicated mess because the intermediate situation with some functions transitioned and others not is too complex, we'll defer removing the old functions until all of the new functions are in place.

zaneselvans · 2023-04-05T19:49:20Z

src/pudl/output/denorm_eia.py

+
+
+@asset(io_manager_key="pudl_sqlite_io_manager", compute_kind="Python")
+def pu_eia(


Is pu_eia really just an intermediary table that's used to build up other denormalized tables? Or does it have stand-alone use cases for which we need to distribute it in the DB? If it's just a helper for building others, maybe it would be appropriate to persist it using the default (pickled dataframe on disk) IO Manager?

zaneselvans · 2023-04-05T20:02:04Z

I think the he integration test failure has to do with the PudlTabl class no longer satisfying the requirements for use with pickle. This functionality was added recently by @arengel and I'm not that familiar with it.

I think this was mainly a way to persist an entire PudlTabl object to disk -- including all of the dataframes that are cached within it -- to avoid needing to re-do sometimes time-consuming calculations to regenerate them later. With the move to all of the outputs being pre-computed and stored within the PUDL DB directly, I suspect that the need for this will go away, and given that we're not going to be caching dfs inside PudlTabl any more, probably can't be maintained.

I'm concerned that for our largest tables, while having the outputs pre-computed and stored in the DB will remove large up-front computations and reduce overall memory usage, reading a very large dataframe out of SQLite will be significantly slower than reading it out of memory, and this may be noticeable and annoying in some use cases.

Long term, I suspect the right solution to this problem will be using a much faster storage format (either Parquet or DuckDB once their file format stabilizes with v1.0). Short term I think we'll just have to make the "no big computations, but slower read from the DB" trade-off. It'll certainly be much better than having to do both lots of reading from SQLite and lots of compute.

arengel · 2023-04-06T06:02:42Z

I think the he integration test failure has to do with the PudlTabl class no longer satisfying the requirements for use with pickle. This functionality was added recently by @arengel and I'm not that familiar with it.

I think this was mainly a way to persist an entire PudlTabl object to disk -- including all of the dataframes that are cached within it -- to avoid needing to re-do sometimes time-consuming calculations to regenerate them later. With the move to all of the outputs being pre-computed and stored within the PUDL DB directly, I suspect that the need for this will go away, and given that we're not going to be caching dfs inside PudlTabl any more, probably can't be maintained.

I'm concerned that for our largest tables, while having the outputs pre-computed and stored in the DB will remove large up-front computations and reduce overall memory usage, reading a very large dataframe out of SQLite will be significantly slower than reading it out of memory, and this may be noticeable and annoying in some use cases.

Long term, I suspect the right solution to this problem will be using a much faster storage format (either Parquet or DuckDB once their file format stabilizes with v1.0). Short term I think we'll just have to make the "no big computations, but slower read from the DB" trade-off. It'll certainly be much better than having to do both lots of reading from SQLite and lots of compute.

The reason I added the functionality was indeed to be able to serialize an entire PudlTabl object to avoid repeating time-consuming calculations. So I agree that having all the tables pre-computed and in in the SQLite DB removes the need for this functionality. To the extent DB reads become an issue for us, we can just modify our current process of caching everything as parquets to plug into the DB rather than PudlTabl.

Another point is that because this functionality wasn't released, we don't have much code (if any that I can think of) that depends on it, I mention this in case you want to remove it at some point soon before it slowly breaks in confusing ways. Normally I'd offer to help fix it so that it keeps working but given the plans for PudlTabl that doesn't seem worth it.

Create new dagster assets: - denorm_plants_utilities_eia which replaces pu_eia860 - denorm_ownership_eia860 which replaces ownership_eia860 / own_eia860 Both of these are getting written into the DB for now, since both of them had / have dedicated methods in the `PudlTabl` output manager class.

Convert the newly denormalized entity tables and EIA-860 output tables to use the new dynamic output methods in the PudlTabl object.

WIP dagsterize plant entity output table

f627a40

Update schema in pudltabl

0f0deb6

e-belfer linked an issue Apr 4, 2023 that may be closed by this pull request

Convert EIA entity / annual output tables to Dagster assets #2433

Closed

e-belfer mentioned this pull request Apr 4, 2023

Convert pu_eia860 to SQL view asset #2339

Closed

e-belfer linked an issue Apr 4, 2023 that may be closed by this pull request

Convert pu_eia860 to SQL view asset #2339

Closed

e-belfer and others added 2 commits April 4, 2023 17:54

Add pu_eia and denorm_boilers_eia

6ce0304

Merge branch 'init-sql-views' into dagsterize-eia-entity-outputs

14141cc

e-belfer changed the title ~~WIP dagsterize plant entity output table~~ Dagsterize EIA plant entity output tables Apr 5, 2023

e-belfer self-assigned this Apr 5, 2023

e-belfer added the inframundo label Apr 5, 2023

Adding generators, fixing metadata

f94f5ad

e-belfer requested a review from zaneselvans April 5, 2023 15:15

e-belfer marked this pull request as ready for review April 5, 2023 15:18

e-belfer added 2 commits April 5, 2023 11:27

Undo accidental change to pu_ferc1

be87560

Undo accidental change to pu_ferc1

b86694e

zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 output Exporting data from PUDL into other platforms or interchange formats. dagster Issues related to our use of the Dagster orchestrator labels Apr 5, 2023

zaneselvans added this to the 2023 Spring milestone Apr 5, 2023

e-belfer added 4 commits April 5, 2023 11:39

Fix denorm_eia function descriptions

cf9a097

WIP ownership and gen table

a61afbb

Add time-filter into to PudlTabl

97256b4

Walk back changes that should be an another branch - whoops!

9c9bf7d

zaneselvans requested changes Apr 5, 2023

View reviewed changes

bendnorman mentioned this pull request Apr 5, 2023

Create simple SQL view assets #2445

Merged

8 tasks

Base automatically changed from init-sql-views to dev April 6, 2023 06:36

zaneselvans mentioned this pull request Apr 7, 2023

Create Dynamic PudlTabl methods #2498

Merged

zaneselvans added 2 commits April 7, 2023 19:39

Merge branch 'dev' into dagsterize-eia-entity-outputs

e241dde

Merge branch 'dev' into dagsterize-eia-entity-outputs

40c799f

e-belfer assigned zaneselvans and unassigned e-belfer Apr 10, 2023

zaneselvans marked this pull request as draft April 10, 2023 20:58

zaneselvans added 5 commits April 10, 2023 20:39

Define denormalized ownership_eia860 resource.

f08531b

Rename pu_eia860 denorm_plants_utilities_eia. Add utility_id_eia to PK.

6c9b4b7

Read denormalized entity + EIA-860 tables from SQLite

30ce490

Convert the newly denormalized entity tables and EIA-860 output tables to use the new dynamic output methods in the PudlTabl object.

Merge branch 'dev' into dagsterize-eia-entity-outputs

c2a2185

zaneselvans changed the title ~~Dagsterize EIA plant entity output tables~~ Dagsterize EIA output tables Apr 11, 2023

zaneselvans mentioned this pull request Apr 11, 2023

Dagsterize EIA output tables #2513

Closed

8 tasks

zaneselvans closed this Apr 11, 2023

zaneselvans deleted the dagsterize-eia-entity-outputs branch April 11, 2023 03:18

zaneselvans mentioned this pull request Apr 11, 2023

Dagsterize EIA output tables #2519

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagsterize EIA output tables #2490

Dagsterize EIA output tables #2490

e-belfer commented Apr 4, 2023 •

edited by zaneselvans

Loading

review-notebook-app bot commented Apr 4, 2023

codecov bot commented Apr 4, 2023 •

edited

Loading

e-belfer commented Apr 4, 2023 •

edited

Loading

zaneselvans commented Apr 4, 2023

zaneselvans left a comment

zaneselvans Apr 5, 2023

zaneselvans Apr 5, 2023

zaneselvans Apr 5, 2023

zaneselvans Apr 5, 2023

zaneselvans Apr 5, 2023

zaneselvans Apr 5, 2023

bendnorman Apr 5, 2023

zaneselvans Apr 7, 2023

zaneselvans Apr 5, 2023

zaneselvans commented Apr 5, 2023

arengel commented Apr 6, 2023



		@asset(io_manager_key="pudl_sqlite_io_manager", compute_kind="Python")
		def pu_eia(

Dagsterize EIA output tables #2490

Dagsterize EIA output tables #2490

Conversation

e-belfer commented Apr 4, 2023 • edited by zaneselvans Loading

Questions for reviewer

Dependencies

PR Checklist

review-notebook-app bot commented Apr 4, 2023

codecov bot commented Apr 4, 2023 • edited Loading

Codecov Report

e-belfer commented Apr 4, 2023 • edited Loading

zaneselvans commented Apr 4, 2023

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Apr 5, 2023

arengel commented Apr 6, 2023

e-belfer commented Apr 4, 2023 •

edited by zaneselvans

Loading

codecov bot commented Apr 4, 2023 •

edited

Loading

e-belfer commented Apr 4, 2023 •

edited

Loading