API endpoint `/queries:run` only returns the first 100 matching documents #434

mbthornton-lbl · 2024-01-10T17:52:30Z

queries:run endpoint truncates data in response
The queries:run endpoint is only returning the first 100 results of a query

Query:

{
  "find": "data_generation_set",
  "filter": {"associated_studies": "nmdc:sty-11-aygzgv51"}
}

Response:
response_1704828951248.json

Note at the end of the response JSON that it claims partialResultsReturned: null

aclum · 2024-10-17T23:29:45Z

this needs prioritization @shreddd

aclum · 2024-10-29T01:55:05Z

@PeopleMakeCulture @dwinston please comment on the effort needed to add paging.

PeopleMakeCulture · 2024-10-30T13:44:08Z

TODO:
Determine: what is the mean document size for collections? If the docs are small enough, we can increase the default size.
Previous to Berkeley schema, proteomics has some docs that were up to several MB large, which is why we limited API requests by default to 20 docs. If this is no longer the case we can easily increase the default doc limit to return.
@aclum do you have any insight into this?

dwinston · 2024-10-30T16:55:37Z

TODO: update collection name from "omics_processing_set" in @mbthornton-lbl 's relevant query, to ensure reproduction for solving this issue.

aclum · 2024-10-30T18:17:47Z

I updated the description in the example to be valid for berkeley.

Documents in workflow_exeuction_set still are quite large.

PeopleMakeCulture · 2024-11-07T16:23:53Z

We are currently exploring two approaches to get this done w/ limited time:

Leverage and document mongo’s skip & limit query syntax
Re-use cursor implementation from other find endpoints

PeopleMakeCulture · 2024-11-07T16:52:35Z

@mbthornton-lbl @aclum The above query can be executed using the /nmdcschema/{collections} endpoint like so. The limit can likely be increased past 100. The response will contain a token at the bottom which can be passed in the next API call to get the next set of documents. Is this sufficient for your use case?

aclum · 2024-11-07T19:06:55Z

We are not currently blocked so don't need a workaround, the request here is for the queries:run endpoint to be able to return a complete set of records. queries:run is the only endpoint right now with the flexibility to traverse collections with a query beside study id and where you have any control over the information returned.

aclum · 2024-11-07T20:54:43Z

Here is an example of where we only want information for biosamples if there is a specific version of the annotation workflow. There is no other way to do this right now via the API and the number of expected records based on a compass query is 536.

db.getCollection(
    'workflow_execution_set'
).aggregate(
    [   
        {
            $match: {
                type: 'nmdc:MetagenomeAnnotation',
                version: 'v1.0.2-beta'
            }
        },
        {
            $lookup: {
                from: 'biosample_set',
                localField: 'has_input',
                foreignField: 'id',
                as: 'biosample_set'
            }
        }
    ], {
        maxTimeMS: 60000,
        allowDiskUse: true
    }
);

I'm fine if instead of fixing queries:run we make a new read only, with similar flexibility to nmdcschema/{collection_name}, which supports aggregations but it needs to be able to return all records.

aclum · 2024-11-07T21:20:28Z

A real life example use case where I've had to simply the query and instruct Marcel call the API tens to hundreds of times instead of one is: Give me all the annotation records along with information about the input and output files with version X for study Y that were generated at JGI so the analysis can be pulled back into JGI. This particular example returns 86 documents but I can't have the JGI developer implement this programmatically going forward b/c that won't always be true for a given study.

db.getCollection(
	'workflow_execution_set'
).aggregate(
	[{
			$match: {
				type: 'nmdc:MetagenomeAnnotation',
				version: 'v1.1.0'
			}
		},
		{
			$lookup: {
				from: 'data_generation_set',
				localField: 'was_informed_by',
				foreignField: 'id',
				as: 'data_generation_set'
			}
		},
		{
			$match: {
				'data_generation_set.associated_studies': 'nmdc:sty-11-34xj1150',
				'data_generation_set.processing_institution': 'JGI'
			}
		},
		{
			$lookup: {
				from: 'data_object_set',
				localField: 'has_input',
				foreignField: 'id',
				as: 'do_input'
			}
		},
		{
			$lookup: {
				from: 'data_object_set',
				localField: 'has_output',
				foreignField: 'id',
				as: 'do_output'
			}
		}
	], {
		maxTimeMS: 60000,
		allowDiskUse: true
	}
);

Instead I've had to instruct him to use queries:run with the following query for each MetagenomeAnnotation id (86 queries instead of 1 in this case)

{
 "aggregate": "workflow_execution_set",
 "pipeline": [
   {
     "$match": {
       "id": "nmdc:wfmgan-11-mk6c5h53.2"
     }
   },
   {
     "$lookup": {
       "from": "data_object_set",
       "localField": "has_input",
       "foreignField": "id",
       "as": "data_object_input"
     }
   },
   {
     "$lookup": {
       "from": "data_object_set",
       "localField": "has_output",
       "foreignField": "id",
       "as": "data_object_output"
     }
   }
 ]
}

aclum · 2025-01-07T01:53:54Z

This caused truncated results when running ETL code, see microbiomedata/issues#813. It would be great to get this addressed as it is really not initiative to internal users that results may be truncated. cc @shreddd

Workaround is to use a different endpoint so this is not a blocker but continues to cause problems and delays as staff newly encounter this issue

shreddd · 2025-01-14T23:04:25Z

Consider implementing as a new endpoint if the returned payload will change. See also:
https://codebeyondlimits.com/articles/pagination-in-mongodb-the-only-right-way-to-implement-it-and-avoid-common-mistakes
in case this is relevant.

aclum added the bug Something isn't working label Oct 17, 2024

dwinston added this to Polyneme mixset Oct 30, 2024

dwinston moved this to Lineup in Polyneme mixset Oct 30, 2024

PeopleMakeCulture assigned PeopleMakeCulture and dwinston Oct 30, 2024

PeopleMakeCulture moved this from At bat to Injured List in Polyneme mixset Oct 30, 2024

dwinston mentioned this issue Oct 30, 2024

extend aggregate query:run with paging #460

Closed

PeopleMakeCulture moved this from At bat to Injured List in Polyneme mixset Nov 7, 2024

aclum mentioned this issue Jan 7, 2025

Add class with helper methods to fill in missing records in database #856

Merged

20 tasks

dwinston moved this from Front of house to On stage in Polyneme mixset Jan 7, 2025

dwinston linked a pull request Jan 23, 2025 that will close this issue

paginate queries:run #754

Draft

12 tasks

aclum mentioned this issue Jan 23, 2025

4.7 API MVP recurring Q2 milestone #879

Open

2 tasks

eecavanna added this to BSSD PI Meeting Prep 2025 🤝 Jan 29, 2025

eecavanna moved this to In Progress in BSSD PI Meeting Prep 2025 🤝 Jan 29, 2025

eecavanna changed the title ~~queries:run endpoint truncating results~~ API endpoint /queries:run only returns the first 100 matching documents Jan 29, 2025

shreddd unassigned PeopleMakeCulture Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API endpoint `/queries:run` only returns the first 100 matching documents #434

API endpoint `/queries:run` only returns the first 100 matching documents #434

mbthornton-lbl commented Jan 10, 2024 •

edited by eecavanna

Loading

aclum commented Oct 17, 2024

aclum commented Oct 29, 2024

PeopleMakeCulture commented Oct 30, 2024 •

edited

Loading

dwinston commented Oct 30, 2024

aclum commented Oct 30, 2024

PeopleMakeCulture commented Nov 7, 2024

PeopleMakeCulture commented Nov 7, 2024

aclum commented Nov 7, 2024

aclum commented Nov 7, 2024 •

edited by eecavanna

Loading

aclum commented Nov 7, 2024 •

edited by eecavanna

Loading

aclum commented Jan 7, 2025 •

edited

Loading

shreddd commented Jan 14, 2025

API endpoint /queries:run only returns the first 100 matching documents #434

API endpoint /queries:run only returns the first 100 matching documents #434

Comments

mbthornton-lbl commented Jan 10, 2024 • edited by eecavanna Loading

aclum commented Oct 17, 2024

aclum commented Oct 29, 2024

PeopleMakeCulture commented Oct 30, 2024 • edited Loading

dwinston commented Oct 30, 2024

aclum commented Oct 30, 2024

PeopleMakeCulture commented Nov 7, 2024

PeopleMakeCulture commented Nov 7, 2024

aclum commented Nov 7, 2024

aclum commented Nov 7, 2024 • edited by eecavanna Loading

aclum commented Nov 7, 2024 • edited by eecavanna Loading

aclum commented Jan 7, 2025 • edited Loading

shreddd commented Jan 14, 2025

API endpoint `/queries:run` only returns the first 100 matching documents #434

API endpoint `/queries:run` only returns the first 100 matching documents #434

mbthornton-lbl commented Jan 10, 2024 •

edited by eecavanna

Loading

PeopleMakeCulture commented Oct 30, 2024 •

edited

Loading

aclum commented Nov 7, 2024 •

edited by eecavanna

Loading

aclum commented Nov 7, 2024 •

edited by eecavanna

Loading

aclum commented Jan 7, 2025 •

edited

Loading