-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added openalex and dimensions publication links #98
Conversation
It might be useful to include links to the article for (possibly) getting fulltext. * OpenAlex best_oa_location: https://docs.openalex.org/api-entities/works/work-object#best_oa_location * Dimensions linkout: https://docs.dimensions.ai/dsl/datasource-publications.html
@@ -127,6 +127,7 @@ def normalize_publication(pub) -> dict: | |||
"apc_paid", | |||
"best_oa_location", | |||
"biblio", | |||
"citation_normalized_percentile", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a new column that is now coming back from OpenAlex and needed to be accounted for in the test.
@@ -93,6 +97,7 @@ def openalex_pubs_csv(tmp_path): | |||
"A Publication", | |||
"article", | |||
"https://doi.org/10.0000/cccc", | |||
'{is_oa: true, landing_page_url: "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1398957", pdf_url: null, source: { id: "https://openalex.org/S2764455111", display_name: "PubMed Central", issn_l: null, issn: null, host_organization: "https://openalex.org/I1299303238", type: "repository" }, license: null, version: "publishedVersion"}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value for best_oa_location
is a JSON object that will need to be unpacked further when using it in analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this end up as a string in our Parquet file? Maybe in a future step we could extract the fields that look useful into columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be a Python string representation for a dictionary unfortunately, which we will need to eval in order to use.
I thought there might be information in the object that was worth preserving, like multiple types of URLs, license, etc. and I wasn't sure what to pick. We also have some other columns that store objects in a similar way at the moment: dim_funders
, dim_open_access
, dim_research_orgs
, dim_researchers
.
If we switched to harvesting as JSONL then Parquet could store the object natively which would be easier to work with I think?
For now we would have to do something clunky like this (with pandas) to get the pdf_url
:
>>> df = pandas.read_parquet('data/latest/contributions.parquet')
>>> df = df[df.openalex_best_oa_location.notna()]
>>> df.openalex_best_oa_location = df.openalex_best_oa_location.apply(eval)
>>> df.pdf_url = df.openalex_best_oa_location.apply(lambda l: l['pdf_url'])
Perhaps a saner alternative would be to split out the information that we would like to preserve in the parquet file like extracting:
- openalex_best_oa_location_pdf_url
- openalex_best_oa_location_url
- openalex_best_oa_license
But maybe we should come up with a mechanism to do this consistently with other columns in the Parquet files, if this is the direction we want to go in?
@@ -186,7 +193,7 @@ def test_merge(tmp_path, sul_pubs_csv, openalex_pubs_csv, dimensions_pubs_csv): | |||
assert output.is_file(), "output file has been created" | |||
df = pl.read_parquet(output) | |||
assert df.shape[0] == 5 | |||
assert df.shape[1] == 23 | |||
assert df.shape[1] == 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenAlex output includes two new columns best_oa_location
and citation_normalized_percentile
.
Approved, pending seeing how memory usage goes when you run the DAG. |
It looks like it ran ok, but I would like to do some follow on work to unpack the JSON objects into separate columns. |
It might be useful to include links to the article for (possibly) getting full-text. These could be helpful when using the dataset as a source for testing machine learning techniques that use the full text as an input.
best_oa_location
https://docs.openalex.org/api-entities/works/work-object#best_oa_locationlinkout
https://docs.dimensions.ai/dsl/datasource-publications.html