Added openalex and dimensions publication links #98

edsu · 2024-08-29T21:33:04Z

It might be useful to include links to the article for (possibly) getting full-text. These could be helpful when using the dataset as a source for testing machine learning techniques that use the full text as an input.

OpenAlex: best_oa_location https://docs.openalex.org/api-entities/works/work-object#best_oa_location
Dimensions: linkout https://docs.dimensions.ai/dsl/datasource-publications.html

It might be useful to include links to the article for (possibly) getting fulltext. * OpenAlex best_oa_location: https://docs.openalex.org/api-entities/works/work-object#best_oa_location * Dimensions linkout: https://docs.dimensions.ai/dsl/datasource-publications.html

edsu · 2024-08-29T21:37:56Z

rialto_airflow/harvest/openalex.py

@@ -127,6 +127,7 @@ def normalize_publication(pub) -> dict:
    "apc_paid",
    "best_oa_location",
    "biblio",
+    "citation_normalized_percentile",


This was a new column that is now coming back from OpenAlex and needed to be accounted for in the test.

edsu · 2024-08-29T21:38:31Z

test/harvest/test_merge_pubs.py

@@ -93,6 +97,7 @@ def openalex_pubs_csv(tmp_path):
                "A Publication",
                "article",
                "https://doi.org/10.0000/cccc",
+                '{is_oa: true, landing_page_url: "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1398957", pdf_url: null, source: { id: "https://openalex.org/S2764455111", display_name: "PubMed Central", issn_l: null, issn: null, host_organization: "https://openalex.org/I1299303238", type: "repository" }, license: null, version: "publishedVersion"}',


The value for best_oa_location is a JSON object that will need to be unpacked further when using it in analysis.

Will this end up as a string in our Parquet file? Maybe in a future step we could extract the fields that look useful into columns.

Yes, it will be a Python string representation for a dictionary unfortunately, which we will need to eval in order to use.

I thought there might be information in the object that was worth preserving, like multiple types of URLs, license, etc. and I wasn't sure what to pick. We also have some other columns that store objects in a similar way at the moment: dim_funders, dim_open_access, dim_research_orgs, dim_researchers.

If we switched to harvesting as JSONL then Parquet could store the object natively which would be easier to work with I think?

For now we would have to do something clunky like this (with pandas) to get the pdf_url:

>>> df = pandas.read_parquet('data/latest/contributions.parquet') >>> df = df[df.openalex_best_oa_location.notna()] >>> df.openalex_best_oa_location = df.openalex_best_oa_location.apply(eval) >>> df.pdf_url = df.openalex_best_oa_location.apply(lambda l: l['pdf_url'])

Perhaps a saner alternative would be to split out the information that we would like to preserve in the parquet file like extracting:

openalex_best_oa_location_pdf_url

openalex_best_oa_location_url

openalex_best_oa_license

But maybe we should come up with a mechanism to do this consistently with other columns in the Parquet files, if this is the direction we want to go in?

edsu · 2024-08-29T21:40:23Z

test/harvest/test_merge_pubs.py

@@ -186,7 +193,7 @@ def test_merge(tmp_path, sul_pubs_csv, openalex_pubs_csv, dimensions_pubs_csv):
    assert output.is_file(), "output file has been created"
    df = pl.read_parquet(output)
    assert df.shape[0] == 5
-    assert df.shape[1] == 23
+    assert df.shape[1] == 25


OpenAlex output includes two new columns best_oa_location and citation_normalized_percentile.

lwrubel · 2024-08-30T16:30:58Z

Approved, pending seeing how memory usage goes when you run the DAG.

edsu · 2024-09-03T13:16:36Z

Approved, pending seeing how memory usage goes when you run the DAG.

It looks like it ran ok, but I would like to do some follow on work to unpack the JSON objects into separate columns.

edsu force-pushed the download-url branch from b011cf3 to 1083a3c Compare August 29, 2024 21:35

edsu commented Aug 29, 2024

View reviewed changes

lwrubel approved these changes Aug 30, 2024

View reviewed changes

edsu merged commit 6d294e8 into main Sep 3, 2024
1 check passed

edsu deleted the download-url branch September 3, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added openalex and dimensions publication links #98

Added openalex and dimensions publication links #98

edsu commented Aug 29, 2024 •

edited

Loading

edsu Aug 29, 2024

edsu Aug 29, 2024 •

edited

Loading

lwrubel Aug 30, 2024

edsu Aug 31, 2024 •

edited

Loading

edsu Aug 29, 2024

lwrubel commented Aug 30, 2024

edsu commented Sep 3, 2024

Added openalex and dimensions publication links #98

Added openalex and dimensions publication links #98

Conversation

edsu commented Aug 29, 2024 • edited Loading

edsu Aug 29, 2024

Choose a reason for hiding this comment

edsu Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

lwrubel Aug 30, 2024

Choose a reason for hiding this comment

edsu Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

edsu Aug 29, 2024

Choose a reason for hiding this comment

lwrubel commented Aug 30, 2024

edsu commented Sep 3, 2024

edsu commented Aug 29, 2024 •

edited

Loading

edsu Aug 29, 2024 •

edited

Loading

edsu Aug 31, 2024 •

edited

Loading