-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API endpoint /queries:run
only returns the first 100 matching documents
#434
Comments
this needs prioritization @shreddd |
@PeopleMakeCulture @dwinston please comment on the effort needed to add paging. |
TODO: |
TODO: update collection name from "omics_processing_set" in @mbthornton-lbl 's relevant query, to ensure reproduction for solving this issue. |
I updated the description in the example to be valid for berkeley. Documents in workflow_exeuction_set still are quite large. |
We are currently exploring two approaches to get this done w/ limited time:
|
@mbthornton-lbl @aclum The above query can be executed using the |
We are not currently blocked so don't need a workaround, the request here is for the queries:run endpoint to be able to return a complete set of records. queries:run is the only endpoint right now with the flexibility to traverse collections with a query beside study id and where you have any control over the information returned. |
Here is an example of where we only want information for biosamples if there is a specific version of the annotation workflow. There is no other way to do this right now via the API and the number of expected records based on a compass query is 536. db.getCollection(
'workflow_execution_set'
).aggregate(
[
{
$match: {
type: 'nmdc:MetagenomeAnnotation',
version: 'v1.0.2-beta'
}
},
{
$lookup: {
from: 'biosample_set',
localField: 'has_input',
foreignField: 'id',
as: 'biosample_set'
}
}
], {
maxTimeMS: 60000,
allowDiskUse: true
}
); I'm fine if instead of fixing queries:run we make a new read only, with similar flexibility to nmdcschema/{collection_name}, which supports aggregations but it needs to be able to return all records. |
A real life example use case where I've had to simply the query and instruct Marcel call the API tens to hundreds of times instead of one is: Give me all the annotation records along with information about the input and output files with version X for study Y that were generated at JGI so the analysis can be pulled back into JGI. This particular example returns 86 documents but I can't have the JGI developer implement this programmatically going forward b/c that won't always be true for a given study. db.getCollection(
'workflow_execution_set'
).aggregate(
[{
$match: {
type: 'nmdc:MetagenomeAnnotation',
version: 'v1.1.0'
}
},
{
$lookup: {
from: 'data_generation_set',
localField: 'was_informed_by',
foreignField: 'id',
as: 'data_generation_set'
}
},
{
$match: {
'data_generation_set.associated_studies': 'nmdc:sty-11-34xj1150',
'data_generation_set.processing_institution': 'JGI'
}
},
{
$lookup: {
from: 'data_object_set',
localField: 'has_input',
foreignField: 'id',
as: 'do_input'
}
},
{
$lookup: {
from: 'data_object_set',
localField: 'has_output',
foreignField: 'id',
as: 'do_output'
}
}
], {
maxTimeMS: 60000,
allowDiskUse: true
}
); Instead I've had to instruct him to use queries:run with the following query for each MetagenomeAnnotation id (86 queries instead of 1 in this case) {
"aggregate": "workflow_execution_set",
"pipeline": [
{
"$match": {
"id": "nmdc:wfmgan-11-mk6c5h53.2"
}
},
{
"$lookup": {
"from": "data_object_set",
"localField": "has_input",
"foreignField": "id",
"as": "data_object_input"
}
},
{
"$lookup": {
"from": "data_object_set",
"localField": "has_output",
"foreignField": "id",
"as": "data_object_output"
}
}
]
} |
This caused truncated results when running ETL code, see microbiomedata/issues#813. It would be great to get this addressed as it is really not initiative to internal users that results may be truncated. cc @shreddd Workaround is to use a different endpoint so this is not a blocker but continues to cause problems and delays as staff newly encounter this issue |
Consider implementing as a new endpoint if the returned payload will change. See also: |
/queries:run
only returns the first 100 matching documents
queries:run endpoint truncates data in response
The queries:run endpoint is only returning the first 100 results of a query
Query:
Response:
response_1704828951248.json
Note at the end of the response JSON that it claims partialResultsReturned: null
The text was updated successfully, but these errors were encountered: