Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added openalex and dimensions publication links #98

Merged
merged 1 commit into from
Sep 3, 2024
Merged

Conversation

edsu
Copy link
Contributor

@edsu edsu commented Aug 29, 2024

It might be useful to include links to the article for (possibly) getting full-text. These could be helpful when using the dataset as a source for testing machine learning techniques that use the full text as an input.

It might be useful to include links to the article for (possibly)
getting fulltext.

* OpenAlex best_oa_location: https://docs.openalex.org/api-entities/works/work-object#best_oa_location
* Dimensions linkout: https://docs.dimensions.ai/dsl/datasource-publications.html
@@ -127,6 +127,7 @@ def normalize_publication(pub) -> dict:
"apc_paid",
"best_oa_location",
"biblio",
"citation_normalized_percentile",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a new column that is now coming back from OpenAlex and needed to be accounted for in the test.

@@ -93,6 +97,7 @@ def openalex_pubs_csv(tmp_path):
"A Publication",
"article",
"https://doi.org/10.0000/cccc",
'{is_oa: true, landing_page_url: "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1398957", pdf_url: null, source: { id: "https://openalex.org/S2764455111", display_name: "PubMed Central", issn_l: null, issn: null, host_organization: "https://openalex.org/I1299303238", type: "repository" }, license: null, version: "publishedVersion"}',
Copy link
Contributor Author

@edsu edsu Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value for best_oa_location is a JSON object that will need to be unpacked further when using it in analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this end up as a string in our Parquet file? Maybe in a future step we could extract the fields that look useful into columns.

Copy link
Contributor Author

@edsu edsu Aug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be a Python string representation for a dictionary unfortunately, which we will need to eval in order to use.

I thought there might be information in the object that was worth preserving, like multiple types of URLs, license, etc. and I wasn't sure what to pick. We also have some other columns that store objects in a similar way at the moment: dim_funders, dim_open_access, dim_research_orgs, dim_researchers.

If we switched to harvesting as JSONL then Parquet could store the object natively which would be easier to work with I think?

For now we would have to do something clunky like this (with pandas) to get the pdf_url:

>>> df = pandas.read_parquet('data/latest/contributions.parquet')
>>> df = df[df.openalex_best_oa_location.notna()]
>>> df.openalex_best_oa_location  = df.openalex_best_oa_location.apply(eval)
>>> df.pdf_url = df.openalex_best_oa_location.apply(lambda l: l['pdf_url'])

Perhaps a saner alternative would be to split out the information that we would like to preserve in the parquet file like extracting:

  • openalex_best_oa_location_pdf_url
  • openalex_best_oa_location_url
  • openalex_best_oa_license

But maybe we should come up with a mechanism to do this consistently with other columns in the Parquet files, if this is the direction we want to go in?

@@ -186,7 +193,7 @@ def test_merge(tmp_path, sul_pubs_csv, openalex_pubs_csv, dimensions_pubs_csv):
assert output.is_file(), "output file has been created"
df = pl.read_parquet(output)
assert df.shape[0] == 5
assert df.shape[1] == 23
assert df.shape[1] == 25
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAlex output includes two new columns best_oa_location and citation_normalized_percentile.

@lwrubel
Copy link
Contributor

lwrubel commented Aug 30, 2024

Approved, pending seeing how memory usage goes when you run the DAG.

@edsu
Copy link
Contributor Author

edsu commented Sep 3, 2024

Approved, pending seeing how memory usage goes when you run the DAG.

It looks like it ran ok, but I would like to do some follow on work to unpack the JSON objects into separate columns.

@edsu edsu merged commit 6d294e8 into main Sep 3, 2024
1 check passed
@edsu edsu deleted the download-url branch September 3, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants