Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRS paging #325

Open
briandoconnor opened this issue Jul 13, 2020 · 7 comments
Open

DRS paging #325

briandoconnor opened this issue Jul 13, 2020 · 7 comments

Comments

@briandoconnor
Copy link
Contributor

briandoconnor commented Jul 13, 2020

Ideas:
Can we include...
has_more
next (URL to the next page)
previous (URL to the previous page)
page_count: number of pages (can we make this optional?)
items_per_page: number of bundle items per page
These would be present in every response.
Also consider adding a requested page size

dG: see also https://cloud.google.com/apis/design/design_patterns#list_pagination for a slightly simpler pagination API style guide.

Next steps
Common approach, see link above
check with TASC Force -- Is GA4GH doing this in a consistent way? Is there a common pattern we should use across all our APIs?
Who will implement this? U. Chicago? Other groups? We need at least one driver.
U. Chicago implementer
CRDC, GDC/IDC
TOPMed
EMBL imaging site?

@ghost
Copy link

ghost commented Jul 13, 2020

A provision for an index is something that would be generally useful. A somewhat flexible structure that mapped a key to a position in the bundle. I imagine for large bundles it is inefficient to jump from page to page looking for something.

@pgrosu
Copy link

pgrosu commented Jul 13, 2020

Besides the offset-with-limit approach Kaushik mentioned, you probably want a sort feature so you can perform a seek-filter (i.e. retrieve after some id) with a limit.

@ianfore
Copy link

ianfore commented Mar 22, 2021

Discussed within CRDC Imaging and Data Commons framework. Determined that 'standard' pagination capability is required.
'Standard' pagination capability would amount to what was suggested at the top of the issue. Copied and annotated below.
However, rather than specifying all those from scratch suggest that:

  • There are a number of standard approaches already in use, both in and out of GA4GH. Suggest that adopting one of these rather than coming up with a new one

  • A common approach to pagination would be desirable across GA4GH APIs unless there is really different capability required.

  • If the previous will take a long time to resolve, a DRS approach should be taken as there are immediate needs.

  • has_more
    or

  • page n of page_count

  • page_count: number of pages (can we make this optional?)
    what does its absence mean? If it means there is only one page, is it not easier to require page_count and set the value to 1 if that is the case.

  • Navigation
    next (URL to the next page
    or to a specific page
    previous (URL to the previous page) - seems less of a case for this

  • items_per_page: number of bundle items per page

  • Allow page size to be part of the request

@bcli4d
Copy link

bcli4d commented Apr 7, 2021

In my opinion, the Google design pattern which David references above is more than adequate for expected Imaging Data Commons uses. Specifically, I don't see any value in being able to request an arbitrary page, a capability which the Google design pattern appears to support. I think, for our purposes, it would be adequate to just get the next page until all data has been received.
I'm trying to imagine a scenario in which random page access could be useful. Ian talked about viewing pyramidal, tiled images. It seems that this would require that the client knows the server access pattern, e.g. how it walks a bundle hierarchy ( a bundle of bundles) and in what order results are then returned. Guessing specifying this might be difficult at best for some implementations.

@BinamB
Copy link

BinamB commented Apr 8, 2021

I fully agree with what @bcli4d mentioned. I believe we should keep bundles as simple as possible. In an expanded bundle with many file objects, servers need to respond with large payloads quickly and clients will need to be able to receive and parse large response objects. Implementing the Google design pattern solves for this and enables clients to recursively expand large bundles by just iterating through each page.

When bundles are nested and the expand=True flag is called pagination calculations will change and we will need to discuss how the client will handle this.

@pgrosu
Copy link

pgrosu commented Apr 8, 2021

Just curious, what would be the upper limit of the number of pages and page sizes? Imagine if you will this scenario :) You got your DICOM images, but they are too big so you create patches out of them, which become a multiple of the original set. Upon these patches you can perform searches. So you make that another repository of patched datasets, in addition to your original one. Then you decide that you want to perform pairwise or some other segmentation analysis of any incoming patches. So you build a generative model using a simple variational autoencoder (VAE). Sure it helps you with image synthesis, but it can do much more than that, as it can find similarities quickly among any of the above datasets or new incoming data with refinement filters. This alternate representation of the data now becomes another addition to the query engine. In fact can speed up searches without much computational overhead under specific query design criteria. Since this could be helpful to others as well, you keep adding the derived data to your Cloud repositories/databases. Since some queries are based on generative models, the query results could become some power of the original set. So we started with whole images, then we went to patched images, and then to VAE representation of images. All of these speed up the ability for clinicians/researchers to get interesting insight, while still remaining computationally manageable given the above data preparation.

So I come back to my original question, what might be the upper limit of the number of query result pages and page sizes? Would the upper limit of the number of pages be 5000, 10000, 100k, 500k or something more? At some point you might spend more time parsing through search results, rather than gathering useful scientific insight.

Hope it helps,
Paul

@ianfore
Copy link

ianfore commented Apr 12, 2021

In getting back into the discussion I'll focus on is Binam's "keep bundles as simple as possible". To that end, and focussing on pagination - as that is the subject of this ticket.

The discussions of other things here aren't about pagination - there are other tickets for those.
Paul - there's a lot in your comment which we could discuss in those other tickets. I'm working up a DICOM using existing DRS ids from the IDC which Bill and Binam work on. But will comment on one of your points here.
"You got your DICOM images, but they are too big so you create patches out of them". Creating patches out of them amounts to the multiple request referred to above - i.e. you make a request consisting of multiple DRS ids that you decide. That patching/slicing/bundling is not predetermined, you - the client author - decide it according to the need of your application and request the 'patch' you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants