FERC to EIA Entity Matching Refactor #3184

katie-lamb · 2023-12-21T19:23:24Z

Overview

I'm not sure how to pin the issue since it's in a different repo, but it closes catalyst-cooperative/ccai-entity-matching#108 in the CCAI repo.

This PR incorporates the new vectorization infrastructure for entity matching problems into the FERC to EIA matching model.

Major Changes

Removed the recordlinkage dependency that was previously being used to vectorize the model input dataframes and replaced that vectorization process with the infrastructure that newly went in with the FERC to FERC model update.
Added cleaning steps to the columns to improve results.
Had to rework the dataframe embedding graph and ops because I couldn't get it to cooperate with a factory function (see below inline comment)

Still need to do:

Update the doc string in the new vectorization functions that I added to the embed_dataframe module to be more descriptive.

Out of scope

Experiment tracking
Integration of blocking with faiss clustering
Splink model
More advanced imputations of nulls in columns.

Testing

How did you make sure this worked? How can a reviewer verify this?

Accuracy is now at .74 on the test set (previously .73). Consistency of matches is still .74. I'm going to compare the output matches from the model in dev and this model and see if there are any changes in what got matched.

To-do list

Give feedback

Make sure full ETL runs & make pytest-integration-full passes locally
For major data coverage & analysis changes, run data validation tests
If updating analyses or data processing functions: make sure to update or write data validation tests
Update the release notes: reference the PR and related issues.
Review the PR yourself and call out any questions or issues you have
Options

This reverts commit 4f38401.

src/pudl/analysis/plant_parts_eia.py

src/pudl/analysis/record_linkage/classify_plants_ferc1.py

katie-lamb · 2024-01-03T00:23:54Z

src/pudl/analysis/record_linkage/embed_dataframe.py

@@ -32,6 +35,7 @@ class FeatureMatrix:
    """

    matrix: np.ndarray | scipy.sparse.csr_matrix
+    index: pd.Index


I added this variable to keep track of the index of the matrix, because in the FERC to EIA context, it's useful to know what record_id_eia x record_id_ferc1 pair each row in the matrix corresponds to.

katie-lamb · 2024-01-03T00:26:48Z

src/pudl/analysis/record_linkage/embed_dataframe.py

@@ -69,37 +73,36 @@ def as_pipeline(self):
        )


-def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]):


I'm pretty confused why I was unable to use this dataframe_embedder_factory. Basically, when I created another instance of the factory in the FERC to EIA model, it created duplicate nodes with the same name as the ops in the FERC to FERC model. When I moved the two ops and graph outside of this factory function then I was able to call the graph twice, once in each of the two different models. Maybe it should somehow be made into a class? It would be nice to bundle the two ops and graph into one container.

I'll see if I can figure out why this isn't working, but I think for the time being I think it's ok moving forward as is

katie-lamb · 2024-01-03T00:27:54Z

src/pudl/analysis/record_linkage/embed_dataframe.py

-    return embed_dataframe
+
+@graph
+def embed_dataframe_graph(df: pd.DataFrame, vectorizers) -> FeatureMatrix:


Maybe embed_dataframe_graph isn't the best name for this graph. It was previously just embed_dataframe but now that you actually call this graph instead of just the dataframe_embedder_factory, I wanted to make sure that this graph did't ahve the same name as the module (embed_dataframe).

katie-lamb · 2024-01-03T00:28:57Z

src/pudl/analysis/record_linkage/embed_dataframe.py

+    category_cols = df.select_dtypes(include="category").columns
+    df[category_cols] = df[category_cols].astype("str")


I'm not sure why, but I was getting errors when cleaning the category type columns because converting nulls to the empty string adds another category value that's not set in categories. I'm not sure why this wasn't an issue with plant_type in the FERC to FERC model. Maybe I should reset the categories themselves instead of converting all category columns to strings?

The plant and construction types are both being converted to strings in the ferc-ferc model before vectorizing, so I think this is pretty much equivalent to how that works

src/pudl/analysis/record_linkage/embed_dataframe.py

katie-lamb · 2024-01-03T00:32:10Z

src/pudl/analysis/record_linkage/name_cleaner.py

+        if isinstance(df, pd.DataFrame) and len(df.columns) > 1:
+            clean_df = pd.DataFrame()
+            for col in df.columns:
+                clean_df = pd.concat(
+                    [clean_df, df[col].apply(self.get_clean_data)], axis=1
+                )
+            return clean_df


This allows the name cleaner to work on a dataframe with multiple columns, so if you pass in a dataframe with plant_name_ferc1 and plant_name_eia then it will clean both columns separately and concatenate them back together.

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py

zaneselvans · 2024-01-05T18:51:42Z

Hey @katie-lamb if we know that the integration tests can't pass, I'm going to switch this back to a draft so that every new push doesn't cost us a few dollars in GitHub runner time.

katie-lamb · 2024-01-05T19:53:02Z

@zschira The CI is passing locally for me, so I'm not sure why it's not passing here. Investigating!

zschira

Glad it seemed pretty straightforward to adapt to the new infrastructure! I don't have anything blocking, just some minor comments/suggestions.

src/pudl/analysis/record_linkage/classify_plants_ferc1.py

zschira · 2024-01-05T19:58:09Z

src/pudl/analysis/record_linkage/embed_dataframe.py

@@ -69,37 +73,36 @@ def as_pipeline(self):
        )


-def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]):


I'll see if I can figure out why this isn't working, but I think for the time being I think it's ok moving forward as is

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py

zschira · 2024-01-05T20:32:07Z

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py

+
+
+@op
+def get_all_pairs_df(inputs):


It would be nice to be able to use the same op for get_all_pairs_df and get_train_pairs_df since they're doing the same thing. Maybe we should pass the inputs to an op that just does the merge?

I changed this so that inputs is passed into one op that returns both all_pairs_df and train_pairs_df.

…/pudl into ferc_to_eia

katie-lamb · 2024-01-08T20:58:44Z

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py

+@asset(
+    name="out_pudl__yearly_assn_eia_ferc1_plant_parts",
+    io_manager_key="pudl_sqlite_io_manager",
+    compute_kind="Python",
+)
+def out_pudl__yearly_assn_eia_ferc1_plant_parts(
+    _out_pudl__yearly_assn_eia_ferc1_plant_parts: pd.DataFrame,
+) -> pd.DataFrame:
+    """Linkage between FERC1 plants and EIA plant parts for persistent storage.
+
+    Args:
+        out_eia__yearly_plant_parts: The EIA plant parts list.
+    """
+    return _out_pudl__yearly_assn_eia_ferc1_plant_parts


@zschira I found out that you can't write an output to persistent storage from a graph_asset so I made this seperate asset which takes the intermediary output from the graph_asset _out_pudl__yearly_assn_eia_ferc1_plant_parts and basically just outputs an asset to the DB. Maybe there's a better way to do this besides having this wrapper asset which is almost identically named? Maybe I'm not understanding how to write to the DB directly from a graph asset? It's nice to keep the graph asset though so you can run individual ops, and not have to run the whole linking process at once.

I think the problem is that we're not setting the io manager in the graph asset. I remember messing with this awhile back for something, and I think I could only get it to work by using a graph_multi_asset because the API is a little weird. I can see if I can get it to work that way in the morning, and if not it's probably ok as is

Hmm I think I tried setting it in the graph_multi_asset and didn't have success. I think graph_multi_asset doesn't take io_manager_key as a parameter, and so I tried to set it within the outs parameter:

outs={"out_pudl__yearly_assn_eia_ferc1_plant_parts" : AssetOut(io_manager_key="pudl_sqlite_io_manager")

But maybe that's not the right way to go about it.

zschira · 2024-01-10T15:39:52Z

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py

+    )
+
+
+@op


@katie-lamb I see what you mean, I tried using the graph_multi_asset and specifying the io-manager through the outs, but it's still not actually writing to the DB for some reason. I did you find you can specify the io-manager for the final op in the graph, and that works. IDK if this is actually better than just using the second asset, but it's an option.

Suggested change

@op

@op(out={"out_pudl__yearly_assn_eia_ferc1_plant_parts": Out(io_manager_key="pudl_sqlite_manager_key"})

I tested this locally and I was able to remove the wrapper asset and get this to write to the DB as expected. Either way is a bit janky, it would be nice if dagster just let us specify the asset manager directly in a graph_asset.

Ya I think I like this better, just because then there's no confusion with what out_pudl__yearly... vs _out_pudl__yearly... is. Will make this change.

katie-lamb · 2024-01-11T02:33:32Z

After comparing results of the current vs incoming matching model, there are 519 out of the 24430 matches (2% of matches) that change from the old to the new model. After spot checking these changed matches and looking at the average change in each of the "matching columns" between the new and old results, it doesn't seem like the 519 new matches are decidedly better or worse. The change likely comes from the name cleaning and the null filling that happens in this PR.

codecov · 2024-01-11T22:08:05Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (8a2c131) 92.7% compared to head (f14d121) 92.7%.
Report is 14 commits behind head on main.

Files	Patch %	Lines
...rc/pudl/analysis/record_linkage/embed_dataframe.py	92.1%	5 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #3184   +/-   ##
=====================================
  Coverage   92.7%   92.7%           
=====================================
  Files        143     143           
  Lines      12972   13040   +68     
=====================================
+ Hits       12022   12087   +65     
- Misses       950     953    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zaneselvans · 2024-01-12T19:19:15Z

Merging this manually as an admin since it's passed CI 3 times and keeps getting scooped by other PRs.

katie-lamb added 15 commits December 16, 2023 08:41

move linkage into recordlinkage directory

4f38401

Revert "move linkage into recordlinkage directory"

0fa7ae1

This reverts commit 4f38401.

add vectorization infrastructure

0da6121

Merge branch 'entity_matching' into ferc_to_eia

ca88f2b

uncomment dag ops

d7cef05

move module into record linkage directory

b09c59e

call make features

b5faaa1

make graph asset

94eb23c

Merge branch 'entity_matching' into ferc_to_eia

3f57ca5

try and fix graph asset

7cbd849

Merge branch 'dev' into ferc_to_eia

0a74a36

remove embedder factory

487c50d

get feature vectorizer running

e6509dc

small changes to improve performance

4fb3ed7

Merge branch 'dev' into ferc_to_eia

01eb3d3

katie-lamb marked this pull request as ready for review January 3, 2024 00:09

katie-lamb self-assigned this Jan 3, 2024

katie-lamb requested a review from zschira January 3, 2024 00:10

katie-lamb commented Jan 3, 2024

View reviewed changes

src/pudl/analysis/plant_parts_eia.py Show resolved Hide resolved

katie-lamb commented Jan 3, 2024

View reviewed changes

src/pudl/analysis/record_linkage/classify_plants_ferc1.py Outdated Show resolved Hide resolved

katie-lamb commented Jan 3, 2024

View reviewed changes

src/pudl/analysis/record_linkage/embed_dataframe.py Show resolved Hide resolved

katie-lamb commented Jan 3, 2024

View reviewed changes

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py Outdated Show resolved Hide resolved

katie-lamb commented Jan 3, 2024

View reviewed changes

src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py Show resolved Hide resolved

katie-lamb marked this pull request as draft January 3, 2024 00:38

move eia to ferc train module and fix tests

a89b7d8

zaneselvans marked this pull request as draft January 5, 2024 18:51

Merge branch 'main' into ferc_to_eia

5388681

zschira approved these changes Jan 5, 2024

View reviewed changes

katie-lamb added 3 commits January 7, 2024 17:38

actually persist the matches

8c6c01b

Merge branch 'ferc_to_eia' of https://github.com/catalyst-cooperative…

c2d9275

…/pudl into ferc_to_eia

Merge branch 'main' into ferc_to_eia

190a4f3

katie-lamb commented Jan 8, 2024

View reviewed changes

katie-lamb added 2 commits January 8, 2024 12:58

move vectorizers into op

35260b0

get rid of migration

7eae5be

zschira reviewed Jan 10, 2024

View reviewed changes

katie-lamb added 4 commits January 10, 2024 14:05

fix typo

edbb936

Merge branch 'main' into ferc_to_eia

ced4a6a

fix docs

5cc1c3f

write to db from final op

b0040d8

katie-lamb added 3 commits January 10, 2024 19:46

updates to docstrings

f6f7e0f

fix docs

97616ad

update release notes

d8857c6

katie-lamb marked this pull request as ready for review January 11, 2024 20:22

Merge branch 'main' into ferc_to_eia

1caf245

katie-lamb enabled auto-merge (squash) January 12, 2024 03:16

katie-lamb added 2 commits January 12, 2024 08:39

Merge branch 'main' into ferc_to_eia

f14d121

Merge branch 'main' into ferc_to_eia

a6d102f

zaneselvans disabled auto-merge January 12, 2024 19:17

zaneselvans merged commit f59a98e into main Jan 12, 2024
11 of 13 checks passed

zaneselvans deleted the ferc_to_eia branch January 12, 2024 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FERC to EIA Entity Matching Refactor #3184

FERC to EIA Entity Matching Refactor #3184

katie-lamb commented Dec 21, 2023 •

edited

Loading

To-do list

katie-lamb Jan 3, 2024

katie-lamb Jan 3, 2024

zschira Jan 5, 2024

katie-lamb Jan 3, 2024

katie-lamb Jan 3, 2024 •

edited

Loading

zschira Jan 10, 2024

katie-lamb Jan 3, 2024 •

edited

Loading

zaneselvans commented Jan 5, 2024

katie-lamb commented Jan 5, 2024

zschira left a comment

zschira Jan 5, 2024

zschira Jan 5, 2024

katie-lamb Jan 8, 2024

katie-lamb Jan 8, 2024 •

edited

Loading

zschira Jan 10, 2024

katie-lamb Jan 10, 2024

zschira Jan 10, 2024

katie-lamb Jan 10, 2024

katie-lamb commented Jan 11, 2024

codecov bot commented Jan 11, 2024 •

edited

Loading

zaneselvans commented Jan 12, 2024

		@@ -69,37 +73,36 @@ def as_pipeline(self):
		)


		def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]):

		category_cols = df.select_dtypes(include="category").columns
		df[category_cols] = df[category_cols].astype("str")

	@op
	@op(out={"out_pudl__yearly_assn_eia_ferc1_plant_parts": Out(io_manager_key="pudl_sqlite_manager_key"})

FERC to EIA Entity Matching Refactor #3184

FERC to EIA Entity Matching Refactor #3184

Conversation

katie-lamb commented Dec 21, 2023 • edited Loading

Overview

Testing

To-do list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

zaneselvans commented Jan 5, 2024

katie-lamb commented Jan 5, 2024

zschira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katie-lamb commented Jan 11, 2024

codecov bot commented Jan 11, 2024 • edited Loading

Codecov Report

zaneselvans commented Jan 12, 2024

katie-lamb commented Dec 21, 2023 •

edited

Loading

katie-lamb Jan 3, 2024 •

edited

Loading

katie-lamb Jan 3, 2024 •

edited

Loading

katie-lamb Jan 8, 2024 •

edited

Loading

codecov bot commented Jan 11, 2024 •

edited

Loading