-
Notifications
You must be signed in to change notification settings - Fork 4
GSIP 69 Use Cases
The following are identified use cases in the GeoServer code base that should cover most situations where the main scalability and/or performance bottle neck is in the Catalog’s client code and not in the Catalog’s ability to serve large amounts of configuration objects.
-
Secure Catalog Decorator
: a full scan of
Catalog
resources is performed on eachget\*():<List>
request and a separate list is built for the current user’s accessible objects, even if theCatalog
returns an immutable list, affecting both memory consumption and processing time. -
Wicket User Interface
: Home page gets the whole list of workspaces, stores, and layers only
to get their size.
Catalog
resource list pages (e.g.LayerPage
,StorePage
, etc) do so to a) return the iterator for the current page of data, b) obtain the full list of objects, c) obtain the filtered list of objects, d) obtain the total number of objects, e) obtain the filtered number of objects. -
WMS GetCapabilities
: Generation of a WMS Capabilities document implies fetching the full
list of layers multiple times, in order to a)filter the layer list based
on the request’s
NAMESPACE
parameter, b)calculate the layer list’s aggregated bounds, c) figure out a common CRS to all the layers, and d) build an in-memory layer tree in order to nest layers based on theLayerInfo
’s wms “path” attribute; .
org.geoserver.security.SecureCatalogImpl
is a decorator around
org.geoserver.catalog.Catalog
that applies in-process filtering of
restricted catalog resources based on the configured data security
policies.
Whenever the current list of some concrete catalog resource is requested
(e.g. getLayers ():List<LayerInfo>
), this in-process filtering
consists of the following steps:
-
Obtain the original list of catalog objects from the decorated
Catalog
; -
For each catalog object:
- Check if the catalog object is accessible to the current user
- Create a security decorator for the catalog object, if accessible
-
If accessible, add the security decorator to the return list
-
return the filtered list of catalog objects
The problem with this approach is that a full scan of Catalog
resources is performed on each request and a separate list is built for
the accessible objects, even if the Catalog
returns an immutable
list, affecting both memory consumption and processing time.
A more scalable approach would be:
- Create a query predicate that matches catalog objects based on current user’s credentials
- Query the decorated
Catalog
with that filter predicate - Obtain the filtered and immutable list of
Catalog
objects - Return a list decorator that applies a secured decorator to each returned object on demand.
In turn, the Catalog
backend, besides returning only the list of
matching objects, could be able of transforming the query predicate,or
part of it, to the native backend’s query language.
-
The GeoServer Home Page presents a list of number of available workspaces, stores, and layers. To do so, it asks the
Catalog
for the list of each of those resources and calls the list’ssize ():int
method (e.g.getCatalog ().getLayers ().size ()
. Having a large number of resources (say, layers) means going through theSecureCatalog
back to the actualCatalog
, each of which returns a safe copy of the actual catalog objects, just to finally get the number of objects in the list. -
The
Catalog
resource list pages (e.g.org.geoserver.web.data.layer.LayerPage
,org.geoserver.web.data.store.StorePage
present even a more challenging use case:
They present the full list of catalog objects of a given type in a paged list, allowing for sorting and filtering based on direct or computed properties. They also display the total number of objects, as well as the number of objects that match the current filter, if any.
In order to do so, the GeoServer wicket “framework”, through
GeoServerDataProvider
leverages on the following API and default
behavior, while being a “template method” class provides the hooks to
optimize and avoid loading everything into memory:
abstract class GeoServerDataProvider<T> extends org.apache.wicket.extensions.markup.html.repeater.util.SortableDataProvider{
/** @return iterator capable of iterating
* over {first, first+count} items */
@Override
public Iterator<T> iterator(int first, int count) {
List<T> items = getFilteredItems();
// global sorting
Comparator<T> comparator = getComparator(getSort());
if (comparator != null) {
Collections.sort(items, comparator);
}
// in memory paging
int last = first + count;
if (last > items.size())
last = items.size();
return items.subList(first, last).iterator();
}
/** @return the size of the filtered item collection */
@Override
public int size() {
return getFilteredItems().size();
}
/** @return a non filtered list of all
* the items the provider must return
*/
protected abstract List<T> getItems();
/** @eturn the global size of the collection,
* without filtering it
*/
public int fullSize() {
return getItems().size();
}
/** @return a filtered list of items. Subclasses can
* override if they have a more efficient way of filtering
* than in memory keyword comparison
*/
protected List<T> getFilteredItems() {
List<T> items = getItems();
// if needed, filter
if (keywords != null && keywords.length > 0) {
return filterByKeywords(items);
} else {
// make a deep copy anyways, the catalog
// does not do that for us
return new ArrayList<T>(items);
}
}
.....
}
Then, concrete resource list pages (e.g. LayerPage
), use
specializations of GeoServerDataProvider
to fill in a
GeoServerTablePanel
, which in turn cares about the
GeoServerDataProvider
’s public API (size ():int
, fullSize ():int
, iterator (int, int):Iterator`.
The protected List<T> getItems ()
method is implemented by concrete
data providers such as:
class LayerProvider extends GeoServerDataProvider<LayerInfo>{
@Override
protected List<LayerInfo> getItems() {
return getCatalog().getLayers();
}
....
}
So in this case, a full scan and defensive copy of catalog resources is being built for each of:
- Getting the total number of objects
- Getting the filtered number of objects
- Getting the filtered list of objects to return an Iterator
Making for a catalog objects list page very resource (memory and processing) intensive, up to impractical as the number of resources increments.
An approach that scales better should allow for:
- Getting the number of matching objects, given a query predicate,
directly from the
Catalog
with no need to traverse a list of results - Obtaining an iterator directly from the
Catalog
for the objects that match a query predicate. - Allows to specify and get the results sorted directly from the catalog
Performing a WMS GetCapabilities request when the number of layers is
large enough (tested with 10.000 and 100.000) becomes quickly
un-practicable as the Capabilities*1*3*0*Translator
fetches the full
list of layers multiple times for some post processing:
- To filter the layer list based on the request’s
NAMESPACE
parameter, if present; - To calculate the layer list’s aggregated bounds;
- To build an in-memory layer tree in order to nest layers based on
the
LayerInfo
’s wms “path” attribute;
This makes for a capabilities request to potentially make GeoServer go
out of memory, at least for the cases where the catalog storage is off
heap; although the creation of a separate list of LayerInfo when there’s
a namespace filter, and the creation of the in-memory LayerTree
object holding all layers can also lead to problems even if the catalog
is fully in memory.
Although an argument can be made that a GetCapabilities response with tens or hundreds of thousands of layers would be totally impractical for almost any client (but perhaps a crawler), such an operation should not bring GeoServer down nonetheless.
Now, a possible solution seems not to be completely tied to a better (or
streaming) Catalog
API, but also to improving the logic of the
GetCapabilities translator itself:
- The namespace filter should be passed back to the catalog backend, so that the translator gets only the matching layers with no need to in-process filtering
- Building an in-memory tree of all the layers for the rather rare case of layers configured with the wms path attribute is non practical. It would be better if all the layers that are not nested through the wms path attribute are encoded in a streamed way, while the in-memory tree is built only for those that do have the wms path set.
- Furthermore, it would be possible/desirable to:
*# Do a single pass over the resulting list of layers, encoding the
layers to a temporary resource while at the same time building the
aggregated bounds;
*# Cache the results of the whole operation some place in order to
return the cached document as long as some change indicator, such as the
updateSequence
has not changed. Although some thought should be put into this to account for the cases where the configuration changes at GeoServer’s back, such as a third process modifying the backend storage directly (could be even another GeoServer in a load-balanced set up, real clustering should take care of a sharedupdateSequence
). But this may become even more complex as we’d need to take authentication/authorization into account. If at all, caching for the anonymous user would perhaps bring the higher benefit at the lower cost.
{html}
{html} {quote} Return to the (main proposal page)[GSIP 69 - Catalog scalability enhancements) {quote}