-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FERC to EIA Entity Matching Refactor #3184
Conversation
This reverts commit 4f38401.
@@ -32,6 +35,7 @@ class FeatureMatrix: | |||
""" | |||
|
|||
matrix: np.ndarray | scipy.sparse.csr_matrix | |||
index: pd.Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this variable to keep track of the index of the matrix, because in the FERC to EIA context, it's useful to know what record_id_eia
x record_id_ferc1
pair each row in the matrix corresponds to.
@@ -69,37 +73,36 @@ def as_pipeline(self): | |||
) | |||
|
|||
|
|||
def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty confused why I was unable to use this dataframe_embedder_factory
. Basically, when I created another instance of the factory in the FERC to EIA model, it created duplicate nodes with the same name as the ops in the FERC to FERC model. When I moved the two ops and graph outside of this factory function then I was able to call the graph twice, once in each of the two different models. Maybe it should somehow be made into a class? It would be nice to bundle the two ops and graph into one container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can figure out why this isn't working, but I think for the time being I think it's ok moving forward as is
return embed_dataframe | ||
|
||
@graph | ||
def embed_dataframe_graph(df: pd.DataFrame, vectorizers) -> FeatureMatrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe embed_dataframe_graph
isn't the best name for this graph. It was previously just embed_dataframe
but now that you actually call this graph instead of just the dataframe_embedder_factory
, I wanted to make sure that this graph did't ahve the same name as the module (embed_dataframe
).
category_cols = df.select_dtypes(include="category").columns | ||
df[category_cols] = df[category_cols].astype("str") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why, but I was getting errors when cleaning the category
type columns because converting nulls to the empty string adds another category value that's not set in categories
. I'm not sure why this wasn't an issue with plant_type
in the FERC to FERC model. Maybe I should reset the categories themselves instead of converting all category columns to strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plant and construction types are both being converted to strings in the ferc-ferc model before vectorizing, so I think this is pretty much equivalent to how that works
if isinstance(df, pd.DataFrame) and len(df.columns) > 1: | ||
clean_df = pd.DataFrame() | ||
for col in df.columns: | ||
clean_df = pd.concat( | ||
[clean_df, df[col].apply(self.get_clean_data)], axis=1 | ||
) | ||
return clean_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows the name cleaner to work on a dataframe with multiple columns, so if you pass in a dataframe with plant_name_ferc1
and plant_name_eia
then it will clean both columns separately and concatenate them back together.
Hey @katie-lamb if we know that the integration tests can't pass, I'm going to switch this back to a draft so that every new push doesn't cost us a few dollars in GitHub runner time. |
@zschira The CI is passing locally for me, so I'm not sure why it's not passing here. Investigating! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad it seemed pretty straightforward to adapt to the new infrastructure! I don't have anything blocking, just some minor comments/suggestions.
@@ -69,37 +73,36 @@ def as_pipeline(self): | |||
) | |||
|
|||
|
|||
def dataframe_embedder_factory(vectorizers: dict[str, ColumnVectorizer]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can figure out why this isn't working, but I think for the time being I think it's ok moving forward as is
|
||
|
||
@op | ||
def get_all_pairs_df(inputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to use the same op for get_all_pairs_df
and get_train_pairs_df
since they're doing the same thing. Maybe we should pass the inputs to an op that just does the merge
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this so that inputs
is passed into one op that returns both all_pairs_df
and train_pairs_df
.
@asset( | ||
name="out_pudl__yearly_assn_eia_ferc1_plant_parts", | ||
io_manager_key="pudl_sqlite_io_manager", | ||
compute_kind="Python", | ||
) | ||
def out_pudl__yearly_assn_eia_ferc1_plant_parts( | ||
_out_pudl__yearly_assn_eia_ferc1_plant_parts: pd.DataFrame, | ||
) -> pd.DataFrame: | ||
"""Linkage between FERC1 plants and EIA plant parts for persistent storage. | ||
|
||
Args: | ||
out_eia__yearly_plant_parts: The EIA plant parts list. | ||
""" | ||
return _out_pudl__yearly_assn_eia_ferc1_plant_parts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zschira I found out that you can't write an output to persistent storage from a graph_asset
so I made this seperate asset
which takes the intermediary output from the graph_asset
_out_pudl__yearly_assn_eia_ferc1_plant_parts
and basically just outputs an asset to the DB. Maybe there's a better way to do this besides having this wrapper asset
which is almost identically named? Maybe I'm not understanding how to write to the DB directly from a graph asset? It's nice to keep the graph asset though so you can run individual ops, and not have to run the whole linking process at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that we're not setting the io manager in the graph asset. I remember messing with this awhile back for something, and I think I could only get it to work by using a graph_multi_asset
because the API is a little weird. I can see if I can get it to work that way in the morning, and if not it's probably ok as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think I tried setting it in the graph_multi_asset
and didn't have success. I think graph_multi_asset
doesn't take io_manager_key
as a parameter, and so I tried to set it within the outs
parameter:
outs={"out_pudl__yearly_assn_eia_ferc1_plant_parts" : AssetOut(io_manager_key="pudl_sqlite_io_manager")
But maybe that's not the right way to go about it.
) | ||
|
||
|
||
@op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katie-lamb I see what you mean, I tried using the graph_multi_asset
and specifying the io-manager through the outs
, but it's still not actually writing to the DB for some reason. I did you find you can specify the io-manager for the final op
in the graph, and that works. IDK if this is actually better than just using the second asset, but it's an option.
@op | |
@op(out={"out_pudl__yearly_assn_eia_ferc1_plant_parts": Out(io_manager_key="pudl_sqlite_manager_key"}) |
I tested this locally and I was able to remove the wrapper asset and get this to write to the DB as expected. Either way is a bit janky, it would be nice if dagster just let us specify the asset manager directly in a graph_asset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I think I like this better, just because then there's no confusion with what out_pudl__yearly... vs _out_pudl__yearly... is. Will make this change.
After comparing results of the current vs incoming matching model, there are 519 out of the 24430 matches (2% of matches) that change from the old to the new model. After spot checking these changed matches and looking at the average change in each of the "matching columns" between the new and old results, it doesn't seem like the 519 new matches are decidedly better or worse. The change likely comes from the name cleaning and the null filling that happens in this PR. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #3184 +/- ##
=====================================
Coverage 92.7% 92.7%
=====================================
Files 143 143
Lines 12972 13040 +68
=====================================
+ Hits 12022 12087 +65
- Misses 950 953 +3 ☔ View full report in Codecov by Sentry. |
Merging this manually as an admin since it's passed CI 3 times and keeps getting scooped by other PRs. |
Overview
I'm not sure how to pin the issue since it's in a different repo, but it closes catalyst-cooperative/ccai-entity-matching#108 in the CCAI repo.
This PR incorporates the new vectorization infrastructure for entity matching problems into the FERC to EIA matching model.
Major Changes
recordlinkage
dependency that was previously being used to vectorize the model input dataframes and replaced that vectorization process with the infrastructure that newly went in with the FERC to FERC model update.Still need to do:
embed_dataframe
module to be more descriptive.Out of scope
faiss
clusteringTesting
How did you make sure this worked? How can a reviewer verify this?
Accuracy is now at .74 on the test set (previously .73). Consistency of matches is still .74. I'm going to compare the output matches from the model in
dev
and this model and see if there are any changes in what got matched.To-do list