Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API endpoint /queries:run only returns the first 100 matching documents #434

Open
Tracked by #879
mbthornton-lbl opened this issue Jan 10, 2024 · 12 comments · May be fixed by #754
Open
Tracked by #879

API endpoint /queries:run only returns the first 100 matching documents #434

mbthornton-lbl opened this issue Jan 10, 2024 · 12 comments · May be fixed by #754
Assignees
Labels
bug Something isn't working

Comments

@mbthornton-lbl
Copy link
Contributor

mbthornton-lbl commented Jan 10, 2024

queries:run endpoint truncates data in response
The queries:run endpoint is only returning the first 100 results of a query

Query:

{
  "find": "data_generation_set",
  "filter": {"associated_studies": "nmdc:sty-11-aygzgv51"}
}

Response:
response_1704828951248.json

Note at the end of the response JSON that it claims partialResultsReturned: null

@aclum aclum added the bug Something isn't working label Oct 17, 2024
@aclum
Copy link
Contributor

aclum commented Oct 17, 2024

this needs prioritization @shreddd

@aclum
Copy link
Contributor

aclum commented Oct 29, 2024

@PeopleMakeCulture @dwinston please comment on the effort needed to add paging.

@PeopleMakeCulture
Copy link
Collaborator

PeopleMakeCulture commented Oct 30, 2024

TODO:
Determine: what is the mean document size for collections? If the docs are small enough, we can increase the default size.
Previous to Berkeley schema, proteomics has some docs that were up to several MB large, which is why we limited API requests by default to 20 docs. If this is no longer the case we can easily increase the default doc limit to return.
@aclum do you have any insight into this?

@dwinston
Copy link
Collaborator

TODO: update collection name from "omics_processing_set" in @mbthornton-lbl 's relevant query, to ensure reproduction for solving this issue.

@aclum
Copy link
Contributor

aclum commented Oct 30, 2024

I updated the description in the example to be valid for berkeley.

Documents in workflow_exeuction_set still are quite large.

@PeopleMakeCulture
Copy link
Collaborator

We are currently exploring two approaches to get this done w/ limited time:

  1. Leverage and document mongo’s skip & limit query syntax
  2. Re-use cursor implementation from other find endpoints

@PeopleMakeCulture
Copy link
Collaborator

@mbthornton-lbl @aclum The above query can be executed using the /nmdcschema/{collections} endpoint like so. The limit can likely be increased past 100. The response will contain a token at the bottom which can be passed in the next API call to get the next set of documents. Is this sufficient for your use case?

Image

@PeopleMakeCulture PeopleMakeCulture moved this from At bat to Injured List in Polyneme mixset Nov 7, 2024
@aclum
Copy link
Contributor

aclum commented Nov 7, 2024

We are not currently blocked so don't need a workaround, the request here is for the queries:run endpoint to be able to return a complete set of records. queries:run is the only endpoint right now with the flexibility to traverse collections with a query beside study id and where you have any control over the information returned.

@aclum
Copy link
Contributor

aclum commented Nov 7, 2024

Here is an example of where we only want information for biosamples if there is a specific version of the annotation workflow. There is no other way to do this right now via the API and the number of expected records based on a compass query is 536.

db.getCollection(
    'workflow_execution_set'
).aggregate(
    [   
        {
            $match: {
                type: 'nmdc:MetagenomeAnnotation',
                version: 'v1.0.2-beta'
            }
        },
        {
            $lookup: {
                from: 'biosample_set',
                localField: 'has_input',
                foreignField: 'id',
                as: 'biosample_set'
            }
        }
    ], {
        maxTimeMS: 60000,
        allowDiskUse: true
    }
);

I'm fine if instead of fixing queries:run we make a new read only, with similar flexibility to nmdcschema/{collection_name}, which supports aggregations but it needs to be able to return all records.

@aclum
Copy link
Contributor

aclum commented Nov 7, 2024

A real life example use case where I've had to simply the query and instruct Marcel call the API tens to hundreds of times instead of one is: Give me all the annotation records along with information about the input and output files with version X for study Y that were generated at JGI so the analysis can be pulled back into JGI. This particular example returns 86 documents but I can't have the JGI developer implement this programmatically going forward b/c that won't always be true for a given study.

db.getCollection(
	'workflow_execution_set'
).aggregate(
	[{
			$match: {
				type: 'nmdc:MetagenomeAnnotation',
				version: 'v1.1.0'
			}
		},
		{
			$lookup: {
				from: 'data_generation_set',
				localField: 'was_informed_by',
				foreignField: 'id',
				as: 'data_generation_set'
			}
		},
		{
			$match: {
				'data_generation_set.associated_studies': 'nmdc:sty-11-34xj1150',
				'data_generation_set.processing_institution': 'JGI'
			}
		},
		{
			$lookup: {
				from: 'data_object_set',
				localField: 'has_input',
				foreignField: 'id',
				as: 'do_input'
			}
		},
		{
			$lookup: {
				from: 'data_object_set',
				localField: 'has_output',
				foreignField: 'id',
				as: 'do_output'
			}
		}
	], {
		maxTimeMS: 60000,
		allowDiskUse: true
	}
);

Instead I've had to instruct him to use queries:run with the following query for each MetagenomeAnnotation id (86 queries instead of 1 in this case)

{
 "aggregate": "workflow_execution_set",
 "pipeline": [
   {
     "$match": {
       "id": "nmdc:wfmgan-11-mk6c5h53.2"
     }
   },
   {
     "$lookup": {
       "from": "data_object_set",
       "localField": "has_input",
       "foreignField": "id",
       "as": "data_object_input"
     }
   },
   {
     "$lookup": {
       "from": "data_object_set",
       "localField": "has_output",
       "foreignField": "id",
       "as": "data_object_output"
     }
   }
 ]
}

@aclum
Copy link
Contributor

aclum commented Jan 7, 2025

This caused truncated results when running ETL code, see microbiomedata/issues#813. It would be great to get this addressed as it is really not initiative to internal users that results may be truncated. cc @shreddd

Workaround is to use a different endpoint so this is not a blocker but continues to cause problems and delays as staff newly encounter this issue

@dwinston dwinston moved this from Front of house to On stage in Polyneme mixset Jan 7, 2025
@shreddd
Copy link
Collaborator

shreddd commented Jan 14, 2025

Consider implementing as a new endpoint if the returned payload will change. See also:
https://codebeyondlimits.com/articles/pagination-in-mongodb-the-only-right-way-to-implement-it-and-avoid-common-mistakes
in case this is relevant.

@dwinston dwinston linked a pull request Jan 23, 2025 that will close this issue
12 tasks
@eecavanna eecavanna changed the title queries:run endpoint truncating results API endpoint /queries:run only returns the first 100 matching documents Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: On stage
Development

Successfully merging a pull request may close this issue.

5 participants