Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FERC to EIA Entity Matching Refactor #3184

Merged
merged 33 commits into from
Jan 12, 2024
Merged

FERC to EIA Entity Matching Refactor #3184

merged 33 commits into from
Jan 12, 2024

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Dec 21, 2023

Overview

I'm not sure how to pin the issue since it's in a different repo, but it closes catalyst-cooperative/ccai-entity-matching#108 in the CCAI repo.

This PR incorporates the new vectorization infrastructure for entity matching problems into the FERC to EIA matching model.

Major Changes

  • Removed the recordlinkage dependency that was previously being used to vectorize the model input dataframes and replaced that vectorization process with the infrastructure that newly went in with the FERC to FERC model update.
  • Added cleaning steps to the columns to improve results.
  • Had to rework the dataframe embedding graph and ops because I couldn't get it to cooperate with a factory function (see below inline comment)

Still need to do:

  • Update the doc string in the new vectorization functions that I added to the embed_dataframe module to be more descriptive.

Out of scope

  • Experiment tracking
  • Integration of blocking with faiss clustering
  • Splink model
  • More advanced imputations of nulls in columns.

Testing

How did you make sure this worked? How can a reviewer verify this?

Accuracy is now at .74 on the test set (previously .73). Consistency of matches is still .74. I'm going to compare the output matches from the model in dev and this model and see if there are any changes in what got matched.

To-do list

Preview Give feedback

@katie-lamb katie-lamb marked this pull request as ready for review January 3, 2024 00:09
@katie-lamb katie-lamb self-assigned this Jan 3, 2024
@katie-lamb katie-lamb requested a review from zschira January 3, 2024 00:10
@@ -32,6 +35,7 @@ class FeatureMatrix:
"""

matrix: np.ndarray | scipy.sparse.csr_matrix
index: pd.Index
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this variable to keep track of the index of the matrix, because in the FERC to EIA context, it's useful to know what record_id_eia x record_id_ferc1 pair each row in the matrix corresponds to.

@@ -69,37 +73,36 @@ def as_pipeline(self):
)


def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confused why I was unable to use this dataframe_embedder_factory. Basically, when I created another instance of the factory in the FERC to EIA model, it created duplicate nodes with the same name as the ops in the FERC to FERC model. When I moved the two ops and graph outside of this factory function then I was able to call the graph twice, once in each of the two different models. Maybe it should somehow be made into a class? It would be nice to bundle the two ops and graph into one container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can figure out why this isn't working, but I think for the time being I think it's ok moving forward as is

return embed_dataframe

@graph
def embed_dataframe_graph(df: pd.DataFrame, vectorizers) -> FeatureMatrix:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe embed_dataframe_graph isn't the best name for this graph. It was previously just embed_dataframe but now that you actually call this graph instead of just the dataframe_embedder_factory, I wanted to make sure that this graph did't ahve the same name as the module (embed_dataframe).

Comment on lines +156 to +157
category_cols = df.select_dtypes(include="category").columns
df[category_cols] = df[category_cols].astype("str")
Copy link
Member Author

@katie-lamb katie-lamb Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why, but I was getting errors when cleaning the category type columns because converting nulls to the empty string adds another category value that's not set in categories. I'm not sure why this wasn't an issue with plant_type in the FERC to FERC model. Maybe I should reset the categories themselves instead of converting all category columns to strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plant and construction types are both being converted to strings in the ferc-ferc model before vectorizing, so I think this is pretty much equivalent to how that works

Comment on lines +269 to +275
if isinstance(df, pd.DataFrame) and len(df.columns) > 1:
clean_df = pd.DataFrame()
for col in df.columns:
clean_df = pd.concat(
[clean_df, df[col].apply(self.get_clean_data)], axis=1
)
return clean_df
Copy link
Member Author

@katie-lamb katie-lamb Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the name cleaner to work on a dataframe with multiple columns, so if you pass in a dataframe with plant_name_ferc1 and plant_name_eia then it will clean both columns separately and concatenate them back together.

@katie-lamb katie-lamb marked this pull request as draft January 3, 2024 00:38
@zaneselvans
Copy link
Member

Hey @katie-lamb if we know that the integration tests can't pass, I'm going to switch this back to a draft so that every new push doesn't cost us a few dollars in GitHub runner time.

@zaneselvans zaneselvans marked this pull request as draft January 5, 2024 18:51
@katie-lamb
Copy link
Member Author

@zschira The CI is passing locally for me, so I'm not sure why it's not passing here. Investigating!

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad it seemed pretty straightforward to adapt to the new infrastructure! I don't have anything blocking, just some minor comments/suggestions.

src/pudl/analysis/record_linkage/classify_plants_ferc1.py Outdated Show resolved Hide resolved
src/pudl/analysis/record_linkage/classify_plants_ferc1.py Outdated Show resolved Hide resolved
@@ -69,37 +73,36 @@ def as_pipeline(self):
)


def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can figure out why this isn't working, but I think for the time being I think it's ok moving forward as is



@op
def get_all_pairs_df(inputs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to use the same op for get_all_pairs_df and get_train_pairs_df since they're doing the same thing. Maybe we should pass the inputs to an op that just does the merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this so that inputs is passed into one op that returns both all_pairs_df and train_pairs_df.

Comment on lines 372 to 385
@asset(
name="out_pudl__yearly_assn_eia_ferc1_plant_parts",
io_manager_key="pudl_sqlite_io_manager",
compute_kind="Python",
)
def out_pudl__yearly_assn_eia_ferc1_plant_parts(
_out_pudl__yearly_assn_eia_ferc1_plant_parts: pd.DataFrame,
) -> pd.DataFrame:
"""Linkage between FERC1 plants and EIA plant parts for persistent storage.

Args:
out_eia__yearly_plant_parts: The EIA plant parts list.
"""
return _out_pudl__yearly_assn_eia_ferc1_plant_parts
Copy link
Member Author

@katie-lamb katie-lamb Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zschira I found out that you can't write an output to persistent storage from a graph_asset so I made this seperate asset which takes the intermediary output from the graph_asset _out_pudl__yearly_assn_eia_ferc1_plant_parts and basically just outputs an asset to the DB. Maybe there's a better way to do this besides having this wrapper asset which is almost identically named? Maybe I'm not understanding how to write to the DB directly from a graph asset? It's nice to keep the graph asset though so you can run individual ops, and not have to run the whole linking process at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that we're not setting the io manager in the graph asset. I remember messing with this awhile back for something, and I think I could only get it to work by using a graph_multi_asset because the API is a little weird. I can see if I can get it to work that way in the morning, and if not it's probably ok as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think I tried setting it in the graph_multi_asset and didn't have success. I think graph_multi_asset doesn't take io_manager_key as a parameter, and so I tried to set it within the outs parameter:

outs={"out_pudl__yearly_assn_eia_ferc1_plant_parts" : AssetOut(io_manager_key="pudl_sqlite_io_manager")

But maybe that's not the right way to go about it.

)


@op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katie-lamb I see what you mean, I tried using the graph_multi_asset and specifying the io-manager through the outs, but it's still not actually writing to the DB for some reason. I did you find you can specify the io-manager for the final op in the graph, and that works. IDK if this is actually better than just using the second asset, but it's an option.

Suggested change
@op
@op(out={"out_pudl__yearly_assn_eia_ferc1_plant_parts": Out(io_manager_key="pudl_sqlite_manager_key"})

I tested this locally and I was able to remove the wrapper asset and get this to write to the DB as expected. Either way is a bit janky, it would be nice if dagster just let us specify the asset manager directly in a graph_asset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think I like this better, just because then there's no confusion with what out_pudl__yearly... vs _out_pudl__yearly... is. Will make this change.

@katie-lamb
Copy link
Member Author

After comparing results of the current vs incoming matching model, there are 519 out of the 24430 matches (2% of matches) that change from the old to the new model. After spot checking these changed matches and looking at the average change in each of the "matching columns" between the new and old results, it doesn't seem like the 519 new matches are decidedly better or worse. The change likely comes from the name cleaning and the null filling that happens in this PR.

@katie-lamb katie-lamb marked this pull request as ready for review January 11, 2024 20:22
Copy link

codecov bot commented Jan 11, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (8a2c131) 92.7% compared to head (f14d121) 92.7%.
Report is 14 commits behind head on main.

Files Patch % Lines
...rc/pudl/analysis/record_linkage/embed_dataframe.py 92.1% 5 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #3184   +/-   ##
=====================================
  Coverage   92.7%   92.7%           
=====================================
  Files        143     143           
  Lines      12972   13040   +68     
=====================================
+ Hits       12022   12087   +65     
- Misses       950     953    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@katie-lamb katie-lamb enabled auto-merge (squash) January 12, 2024 03:16
@zaneselvans zaneselvans disabled auto-merge January 12, 2024 19:17
@zaneselvans zaneselvans merged commit f59a98e into main Jan 12, 2024
11 of 13 checks passed
@zaneselvans zaneselvans deleted the ferc_to_eia branch January 12, 2024 19:18
@zaneselvans
Copy link
Member

Merging this manually as an admin since it's passed CI 3 times and keeps getting scooped by other PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ccai Tasks related to CCAI grant for entity matching eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Apply PUDL entity matching framework to FERC-EIA
3 participants