Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make human-friendly names available and allow multi-grid downloads #932

Merged
merged 12 commits into from
Mar 28, 2024

Conversation

abyrd
Copy link
Member

@abyrd abyrd commented Feb 16, 2024

The backend returns download URLs as small JSON objects with a url property. In this PR, these JSON responses are extended to also have a humanName property, which is an alternate filename that is more human readable but not guaranteed to be unique. This enables #673 (as discussed in #673 (comment)) but should be backward compatible with existing UI as responses still have the same old url property. The new humanName property can be used in various ways such as the download attribute of an HTML link or as the attachment name in a content-disposition header.

Regional result CSV download links seem to be the odd one out: they return text/plain instead of a JSON wrapped URL. Switching them over to use JSON wrapped URLs like everything else would require an accompanying change on the UI side but would also enable the CSV downloads to use human readable names. The updated code to return the JSON wrapped URL with humanName is already present in RegionalAnalysisController.getCsvResults but commented out for the time being.

This PR also adds and endpoint at /api/regional/:_id/all that will generate all available GeoTIFF rasters for a given regional analysis and store them in a persistent ZIP file, returning one of these new-style JSON wrapped download links including a human-readable name. An attempt is also made to replace any UUIDs in the names of files inside this ZIP with relevant human-readable strings.

Many of the changes here are just small changes to return types, MIME types etc. There is one big block add/remove RegionalAnalysisController which is mostly factoring the action of extracting a single grid out of the HTTP controller, so it can be called once by the existing grid-retrieval HTTP controller, but multiple times in a loop by the new bulk-zip-creating HTTP controller. The smaller targeted changes there are somewhat camouflaged/exaggerated by the indentation and movement of that big block.

abyrd added 7 commits January 26, 2024 23:55
this is not particularly efficient but it's only used in local operation
This enables #673, but should be backward compatible with existing UI.
JSON responses containing a URL have an accompanying humanName field.
This can be used as the download attribute of an HTML link,
or as the attachment name in a content-disposition header.
CSV download still returns text/plain (requires JSON parsing on UI side)
Copy link
Member

@ansoncfit ansoncfit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested a few things with the UI locally, ensuring that pre-existing download options for opportunity datasets and regional analyses continue to work.

I would like to consider including the last six characters of UUIDs in the human readable names, which would be a consistent way for users to distinguish between inputs when names are not unique. And protip: entering part of UUID in most of the selectors in the UI works for filtering.

package com.conveyal.file;

/**
* Combines a url for downloading a file, which might include a globally unique but human-annoying UUID, with a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globally unique but human-annoying

😅

Comment on lines -40 to +48
String extension = filename.substring(filename.lastIndexOf(".") + 1);
return FileStorageFormat.valueOf(extension.toUpperCase());
String extension = filename.substring(filename.lastIndexOf(".") + 1).toLowerCase(Locale.ROOT);
for (FileStorageFormat format : FileStorageFormat.values()) {
if (format.extension.equals(extension)) {
return format;
}
}
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What motivated this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous version did not work correctly, but it was only used for retrieving pre-existing data in the local file server implementation.


// Get some path parameters out of the URL.
// The UUID of the regional analysis for which we want the output data
final String regionalAnalysisId = req.params("_id");
// The response file format: PNG, TIFF, or GRID
final String fileFormatExtension = req.params("format");
final String fileFormatExtension = req.params("format"); // FIXME this may be case sensitive (could make multiple files for different requests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To discuss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is now performing all necessary case conversions. We rearranged some code and comments to make this more clear.

}

private String getCsvResults (Request req, Response res) {
private Object getCsvResults (Request req, Response res) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the return type stay String until the changes below are uncommented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd usually say yes, but in this particular case a reference to this method is being used as a @FunctionalInterface spark.Route, and the actual type of the method in that interface is Object handle(Request request, Response response).

int secondUnderscore = singleCutoffFileStorageKey.path.indexOf('_', firstUnderscore + 1);
int lastDot = singleCutoffFileStorageKey.path.lastIndexOf('.');
String humanName = analysis.name +
singleCutoffFileStorageKey.path.substring(firstUnderscore, firstUnderscore + 6) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first six characters of spatial datasets that were created in quick succession will be the same. Use the last six characters instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll change it to use the last 5 characters. Looks like this was using the first 6 including the underscore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.mongodb.com/docs/manual/reference/method/ObjectId/ The first four and middle five bytes are slow-changing and would not disambiguate between data sets. Only the least significant digits of counter at the end will be sure to change.

Comment on lines 425 to 426
int firstUnderscore = singleCutoffFileStorageKey.path.indexOf('_');
int secondUnderscore = singleCutoffFileStorageKey.path.indexOf('_', firstUnderscore + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parses out the destination id, cutoff, and percentile from singleCutoffFileStorageKey.path.

Is there a reason for not using cutoffIndex, destinationPointSetId, percentile directly? Perhaps to avoid replicating the logic for the annotations (C, P)?

Copy link
Member Author

@abyrd abyrd Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was trying to avoid replicating logic to produce a name containing the C and P cutoff and percentile values. Especially the cutoff: the method getSingleCutoffGrid needs the cutoffIndex to extract one grid from the multi-grid file. It also needs the cutoff in minutes to substitute into the singleGridCutoffKey. So it uses the index to look up the cutoffMinutes in the RegionalAnalysis object, avoiding passing redundant index and value parameters into getSingleCutoffGrid. This means that that caller doesn't have the cutoff in minutes to substitute into the human readable name.

However it's quite inelegant to parse these out of the filename and I was hoping to change it. This code was revised several times and at some point in the past something was more complicated and the string slicing seemed preferable. But at present, repeating the step of getting the cutoff value out of the array (even with the null check and fallback to a singular cutoffMinutes) is now less confusing and error-prone than this string slicing.

Note that all of this is just to maintain backward compatibility with old analyses that have a single cutoff instead of multiple cutoffs. It may be nearing time to deprecate some of these older formats and remove all these fallbacks from the code.

I am refactoring this to just produce the human readable name totally separately from the storage key.

Comment on lines 275 to 285
// Replace each destination point set ID with its name.
int d = 0;
for (String destinationPointSetId : analysis.destinationPointSetIds) {
OpportunityDataset opportunityDataset = Persistence.opportunityDatasets.findByIdIfPermitted(
destinationPointSetId,
userPermissions
);
checkNotNull(opportunityDataset, "Opportunity dataset could not be found in database.");
// Avoid name collisions, prepend an integer.
String friendlyDestinationName = "D%d_%s".formatted(d++, filenameCleanString(opportunityDataset.name));
friendlyReplacements.put(destinationPointSetId, friendlyDestinationName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destination name is used in this new method here, and (part of the) UUID is used in the revised getRegionalResults below.

Would a unified method, with the last six characters of the UUID appended to the name, be useful in both places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason the two use different naming schemes is that including the user-specified name of the destination set means cascading N database lookups to find those names. I was willing to introduce this potential source of lag or bugs in the multi-grid case. It needs to zip things and move files around so might be expected to lag a little. The multi-grid case was also new and intended to be manually triggered.

On the other hand, I was trying to keep the existing single-cutoff codepath more stable, so didn't want to cascade a new database operation there.

In the multi-grid case, I don't really like this UUID-to-name string substitution approach. Practically speaking it works fine. The principle behind it is even reasonable (a one way mapping from totally unique ID strings are equivalent to specific sanitized human-readable strings). But it feels like a workaround, avoiding producing the human readable names in the loop and associating them with the keys.

Assuming we're willing to cascade the database fetch in the single-grid case, what might make sense is to factor out not only a function that generates the short dataset name, but one that creates the whole human-readable name of a single-cutoff grid, then reuse both of these functions in both the single and multi-grid use cases. There's a slight complication in how the extension of the human-readable filename is kept separate in case the raw human readable name is truncated. Maybe this calls for a new data type combining or associating a FileStorageKey with a separate human-readable name and extension, or that truncates and combines the raw human readable name and extension in its constructor.

@@ -550,10 +674,12 @@ public void registerEndpoints (spark.Service sparkService) {
});
sparkService.path("/api/regional", () -> {
// For grids, no transformer is supplied: render raw bytes or input stream rather than transforming to JSON.
// Wait, do we have any endpoints that write grids into responses? It looks like they all return URLs now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See TimeGridWriter.writeGeotiff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that's indeed used by AnalysisWorkerController#handleSinglePoint to generate a binary Spark response. But in this comment I was referring to RegionalAnalysisController, and specifically this URL path /api/regional where there's a comment about no transformer being supplied.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked at this today, and RegionalAnalysis inherits a toString method that produces JSON. Leaving as-is.

@abyrd
Copy link
Member Author

abyrd commented Mar 20, 2024

Added one commit to address some of the review comments. Further changes will require some decisions about where it's appropriate to make database calls.

@abyrd
Copy link
Member Author

abyrd commented Mar 21, 2024

Recent commits factor out humanNameForEntity method that creates names in the format User_assigned_name_a1b2c3 for any Model instance, including RegionalAnalysis and OpportunityDataset. This is used by a revised getSingleCutoffGrid method, which returns a new record type (FileStorageKey storageKey, String humanName). The getSingleCutoffGrid method now also receives the cutoff in minutes and looks up the index into the cutoffs array as needed. The same approach to generating human readable names for grid files is used everywhere, in both the bulk zip download and retrieving individual files.

@abyrd abyrd merged commit 953757c into dev Mar 28, 2024
3 checks passed
@abyrd abyrd deleted the multi-grid-download branch March 28, 2024 02:35
@abyrd
Copy link
Member Author

abyrd commented Apr 11, 2024

Although this has been merged, I'm adding follow-up observations here to keep them together with the original discussion. On today's call we identified a problem with the multi-file ZIP download. In regional analysis with millions of origin points and say 40 different cutoff/destination/percentile combinations, creating the zip file can be quite slow, taking over 2 minutes. It's not appropriate to stall an HTTP request for so long with no indication of progress. In the absence of feedback users are likely to re-initiate the creation/download process multiple times from a command line, script, or UI. This could lead to multiple stalled HTTP handler threads all creating the same grid and zip files and overwriting each other's results.

In initial testing I had not noticed this slowness, so the first step was to identify which part of the process is slow. This could be extracting the single-cutoff values from the multi-cutoff regional results file, encoding the geotiff file (or grid file, which is similarly slow), or transferring these resulting files to S3. We don't have enough log messages to immediately see the timing of each step. But by comparing file modification times (with ls --full-time for high time resolution) and log messages, about 13 seconds elapse between the log message about starting to extract the single-cutoff grid and the geotiff file appearing on the filesystem. The file is timestamped one second later on S3. So apparently S3 transfer is not the culprit, and it's either extracting the single-cutoff data or encoding/compressing the geotiff and grid file that is slow. These files are only about 3-6MB so this is not expected to be slow at all - we need to investigate this. Maybe we're using some un-buffered file IO?

Assuming creating these files remains "slow" (more than a few seconds), the solution will probably involve a mechanism like the one we use in network building, where an HTTP response is always immediate. If the object exists, it's immediately returned. Otherwise the background process is started and we respond with a 202 ACCEPTED (see AnalysisWorkerController.handleSinglePoint L68). But the AsyncLoader mechanism used for networks is not very appropriate for this situation: the results of the multi-file ZIP operation are not held as an object in memory, and though that AsyncLoader aims to be reusable/generic, it is a cacheing mechanism incorporating some network-build specific ideas.

Aside: as noted in AnalysisWorker#handleAndSerializeOneSinglePointTask, "Ideally we'd stall briefly using something like Future.get(timeout) in case loading finishes quickly" so not all requests for newly created entities result in a 202 and retry, but that's another topic.

@abyrd
Copy link
Member Author

abyrd commented Apr 11, 2024

Looking into the speed of individual multi-cutoff to single-cutoff conversions. I added some log statements around the lines that produce the single-cutoff files and ran a regional analysis for Portland (57k origins). It takes around 200ms to derive a single-cutoff file. This is 52x smaller than the original culprit regions with around 3M origins. 52 * 200 ms gives 10s, roughly aligning with experience on those larger regions.

Wrapping input and output with buffered streams makes little to no difference, probably because input and output are both passing through gzip streams which contain their own buffers. Sampling CPU in VisualVM shows inflaterInputStream.read as the most time-consuming call. Reading time is about double writing time.

Unzipping the source .access file and removing the gunzip input stream halves the time to 100 ms instead of 200 ms. Proportions are reversed, with 2/3 of time now in writing instead of 2/3 in reading. Removing the gzip output stream (in addition to the gunzip input stream) still produces files usable by the UI. This drops the conversion time to around 30 ms, so an 85% decrease relative to the original 200 ms. The input and output files are observed to be about 3x bigger without compression.

Using command line tools, gunzipping and re-gzipping the entire multi-cutoff source .access file source data takes 10 and 50 ms respectively. This is much faster than the 170 ms compression seems to be contribute in the Java case.

Initial conclusion is that streaming compression/decompression is the main culprit, perhaps in how it interacts with the byte-by-byte little-endian DataInputStream. It would be interesting to see how much faster this would be if performed with NIO byte buffers.

It's also worth considering whether files need to be compressed at rest, and whether we should just derive all the grid and tif files as part of the regional analysis process, instead of extracting them one by one from the .access file. Compression is expected to be fast and cheap though. We don't want to incur it on every HTTP download instead of once when the file is created, simply to work around the fact that the gzip streamed compression is slower.

I also tried increasing the read buffer size with new GZIPInputStream(rawInput, 65536) and it didn't seem to make any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants