Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new authors.csv, adjust published columns and tests #99

Merged
merged 6 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions rialto_airflow/harvest/dimensions.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@
def dois_from_orcid(orcid):
logging.info(f"looking up dois for orcid {orcid}")
q = """
search publications where researchers.orcid_id = "{}"
search publications where researchers.orcid_id = "{}" and year in [2018:{}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we limiting to publications between 2018-2024. This seems like something we will forget about, and we will quietly stop getting new data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought was that since Rochelle only cared about pubs after 2017, we could speed up the run time by filtering in the query but I don't think it actually did speed things up so I pulled it out.

return publications [doi]
limit 1000
""".format(orcid)
""".format(orcid, 2024)

# The Dimensions API can flake out sometimes, so try to catch & retry.
# TODO: Consider using retry param in query() instead
Expand Down
6 changes: 1 addition & 5 deletions rialto_airflow/harvest/merge_pubs.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,7 @@ def dimensions_pubs_df(dimensions_pubs):
"document_type",
"funders",
"funding_section",
"linkout",
"open_access",
"publisher",
"research_orgs",
"researchers",
"title",
"type",
"year",
Expand All @@ -86,7 +82,7 @@ def openalex_pubs_df(openalex_pubs):
"publication_year",
"title",
"type",
"best_oa_location",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found it useful to return this in the Easy Deposit workcycle, because we wanted to get the URLs for OpenAccess PDFs. But I guess it can be removed. Did you find it was not needed or redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that this DAG would drive the open access dashboard and its not needed for that dashboard. I think we will want another DAG for a publishers data dashboard and others for other purposes. The other option would be to get all of the fields for all dashboards in one DAG and then ignore the fields we don't need but I recall us having concerns about the size of the file with that apporach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to either approach. Let me know if you think we should discuss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should minimize the time we have to go out and get metadata from OpenAlex and Dimensions. As long as we have the data in the interim files there will be an opportunity to use it other datasets like you said.

"open_access",
),
)
df = df.rename(lambda column_name: "openalex_" + column_name)
Expand Down
1 change: 1 addition & 0 deletions rialto_airflow/harvest/openalex.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ def normalize_publication(pub) -> dict:
"id",
"ids",
"indexed_in",
"institution_assertions",
"institutions_distinct_count",
"is_authors_truncated",
"is_paratext",
Expand Down
13 changes: 0 additions & 13 deletions rialto_airflow/harvest/sul_pub.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,16 @@
SUL_PUB_FIELDS = [
"authorship",
"title",
"abstract",
"author",
"year",
"type",
"mesh_headings",
"publisher",
"journal",
"provenance",
"doi",
"issn",
"sulpubid",
"sw_id",
"pmid",
"identifier",
"last_updated",
"pages",
"date",
"country",
"booktitle",
"edition",
"series",
"chapter",
"editor",
]


Expand Down
10 changes: 5 additions & 5 deletions test/harvest/test_merge_pubs.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def openalex_pubs_csv(tmp_path):
"title",
"type",
"doi",
"best_oa_location",
"open_access",
]
writer.writerow(header)
writer.writerow(
Expand All @@ -97,7 +97,7 @@ def openalex_pubs_csv(tmp_path):
"A Publication",
"article",
"https://doi.org/10.0000/cccc",
'{is_oa: true, landing_page_url: "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1398957", pdf_url: null, source: { id: "https://openalex.org/S2764455111", display_name: "PubMed Central", issn_l: null, issn: null, host_organization: "https://openalex.org/I1299303238", type: "repository" }, license: null, version: "publishedVersion"}',
"green",
]
)
writer.writerow(
Expand All @@ -110,7 +110,7 @@ def openalex_pubs_csv(tmp_path):
"A Research Article",
"article",
"https://doi.org/10.0000/1234",
"",
"bronze",
]
)
return fixture_file
Expand Down Expand Up @@ -165,7 +165,7 @@ def test_openalex_pubs_df(openalex_pubs_csv):
df = lazy_df.collect()
assert df.shape[0] == 2
assert "bogus" not in df.columns, "Unneeded columns have been dropped"
assert "openalex_best_oa_location" in df.columns
assert "openalex_open_access" in df.columns
assert df["openalex_doi"].to_list() == ["10.0000/cccc", "10.0000/1234"]


Expand Down Expand Up @@ -193,7 +193,7 @@ def test_merge(tmp_path, sul_pubs_csv, openalex_pubs_csv, dimensions_pubs_csv):
assert output.is_file(), "output file has been created"
df = pl.read_parquet(output)
assert df.shape[0] == 5
assert df.shape[1] == 25
assert df.shape[1] == 21
assert set(df["doi"].to_list()) == set(
["10.0000/aaaa", "10.0000/1234", "10.0000/cccc", "10.0000/dddd", "10.0000/eeee"]
)
Loading