-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make human-friendly names available and allow multi-grid downloads #932
Conversation
this is not particularly efficient but it's only used in local operation
This enables #673, but should be backward compatible with existing UI. JSON responses containing a URL have an accompanying humanName field. This can be used as the download attribute of an HTML link, or as the attachment name in a content-disposition header. CSV download still returns text/plain (requires JSON parsing on UI side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested a few things with the UI locally, ensuring that pre-existing download options for opportunity datasets and regional analyses continue to work.
I would like to consider including the last six characters of UUIDs in the human readable names, which would be a consistent way for users to distinguish between inputs when names are not unique. And protip: entering part of UUID in most of the selectors in the UI works for filtering.
package com.conveyal.file; | ||
|
||
/** | ||
* Combines a url for downloading a file, which might include a globally unique but human-annoying UUID, with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globally unique but human-annoying
😅
String extension = filename.substring(filename.lastIndexOf(".") + 1); | ||
return FileStorageFormat.valueOf(extension.toUpperCase()); | ||
String extension = filename.substring(filename.lastIndexOf(".") + 1).toLowerCase(Locale.ROOT); | ||
for (FileStorageFormat format : FileStorageFormat.values()) { | ||
if (format.extension.equals(extension)) { | ||
return format; | ||
} | ||
} | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What motivated this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous version did not work correctly, but it was only used for retrieving pre-existing data in the local file server implementation.
|
||
// Get some path parameters out of the URL. | ||
// The UUID of the regional analysis for which we want the output data | ||
final String regionalAnalysisId = req.params("_id"); | ||
// The response file format: PNG, TIFF, or GRID | ||
final String fileFormatExtension = req.params("format"); | ||
final String fileFormatExtension = req.params("format"); // FIXME this may be case sensitive (could make multiple files for different requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is now performing all necessary case conversions. We rearranged some code and comments to make this more clear.
} | ||
|
||
private String getCsvResults (Request req, Response res) { | ||
private Object getCsvResults (Request req, Response res) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the return type stay String
until the changes below are uncommented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd usually say yes, but in this particular case a reference to this method is being used as a @FunctionalInterface spark.Route
, and the actual type of the method in that interface is Object handle(Request request, Response response)
.
int secondUnderscore = singleCutoffFileStorageKey.path.indexOf('_', firstUnderscore + 1); | ||
int lastDot = singleCutoffFileStorageKey.path.lastIndexOf('.'); | ||
String humanName = analysis.name + | ||
singleCutoffFileStorageKey.path.substring(firstUnderscore, firstUnderscore + 6) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first six characters of spatial datasets that were created in quick succession will be the same. Use the last six characters instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll change it to use the last 5 characters. Looks like this was using the first 6 including the underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.mongodb.com/docs/manual/reference/method/ObjectId/ The first four and middle five bytes are slow-changing and would not disambiguate between data sets. Only the least significant digits of counter at the end will be sure to change.
int firstUnderscore = singleCutoffFileStorageKey.path.indexOf('_'); | ||
int secondUnderscore = singleCutoffFileStorageKey.path.indexOf('_', firstUnderscore + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parses out the destination id, cutoff, and percentile from singleCutoffFileStorageKey.path
.
Is there a reason for not using cutoffIndex
, destinationPointSetId
, percentile
directly? Perhaps to avoid replicating the logic for the annotations (C
, P
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was trying to avoid replicating logic to produce a name containing the C and P cutoff and percentile values. Especially the cutoff: the method getSingleCutoffGrid needs the cutoffIndex to extract one grid from the multi-grid file. It also needs the cutoff in minutes to substitute into the singleGridCutoffKey. So it uses the index to look up the cutoffMinutes in the RegionalAnalysis object, avoiding passing redundant index and value parameters into getSingleCutoffGrid. This means that that caller doesn't have the cutoff in minutes to substitute into the human readable name.
However it's quite inelegant to parse these out of the filename and I was hoping to change it. This code was revised several times and at some point in the past something was more complicated and the string slicing seemed preferable. But at present, repeating the step of getting the cutoff value out of the array (even with the null check and fallback to a singular cutoffMinutes) is now less confusing and error-prone than this string slicing.
Note that all of this is just to maintain backward compatibility with old analyses that have a single cutoff instead of multiple cutoffs. It may be nearing time to deprecate some of these older formats and remove all these fallbacks from the code.
I am refactoring this to just produce the human readable name totally separately from the storage key.
// Replace each destination point set ID with its name. | ||
int d = 0; | ||
for (String destinationPointSetId : analysis.destinationPointSetIds) { | ||
OpportunityDataset opportunityDataset = Persistence.opportunityDatasets.findByIdIfPermitted( | ||
destinationPointSetId, | ||
userPermissions | ||
); | ||
checkNotNull(opportunityDataset, "Opportunity dataset could not be found in database."); | ||
// Avoid name collisions, prepend an integer. | ||
String friendlyDestinationName = "D%d_%s".formatted(d++, filenameCleanString(opportunityDataset.name)); | ||
friendlyReplacements.put(destinationPointSetId, friendlyDestinationName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destination name is used in this new method here, and (part of the) UUID is used in the revised getRegionalResults
below.
Would a unified method, with the last six characters of the UUID appended to the name, be useful in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason the two use different naming schemes is that including the user-specified name of the destination set means cascading N database lookups to find those names. I was willing to introduce this potential source of lag or bugs in the multi-grid case. It needs to zip things and move files around so might be expected to lag a little. The multi-grid case was also new and intended to be manually triggered.
On the other hand, I was trying to keep the existing single-cutoff codepath more stable, so didn't want to cascade a new database operation there.
In the multi-grid case, I don't really like this UUID-to-name string substitution approach. Practically speaking it works fine. The principle behind it is even reasonable (a one way mapping from totally unique ID strings are equivalent to specific sanitized human-readable strings). But it feels like a workaround, avoiding producing the human readable names in the loop and associating them with the keys.
Assuming we're willing to cascade the database fetch in the single-grid case, what might make sense is to factor out not only a function that generates the short dataset name, but one that creates the whole human-readable name of a single-cutoff grid, then reuse both of these functions in both the single and multi-grid use cases. There's a slight complication in how the extension of the human-readable filename is kept separate in case the raw human readable name is truncated. Maybe this calls for a new data type combining or associating a FileStorageKey with a separate human-readable name and extension, or that truncates and combines the raw human readable name and extension in its constructor.
@@ -550,10 +674,12 @@ public void registerEndpoints (spark.Service sparkService) { | |||
}); | |||
sparkService.path("/api/regional", () -> { | |||
// For grids, no transformer is supplied: render raw bytes or input stream rather than transforming to JSON. | |||
// Wait, do we have any endpoints that write grids into responses? It looks like they all return URLs now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See TimeGridWriter.writeGeotiff
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that's indeed used by AnalysisWorkerController#handleSinglePoint
to generate a binary Spark response. But in this comment I was referring to RegionalAnalysisController
, and specifically this URL path /api/regional
where there's a comment about no transformer being supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We looked at this today, and RegionalAnalysis inherits a toString method that produces JSON. Leaving as-is.
also use last instead of first five chars of UUID
Added one commit to address some of the review comments. Further changes will require some decisions about where it's appropriate to make database calls. |
Eliminates poorly named intermediate string variable and need for explanation in outdated comments.
Recent commits factor out |
Although this has been merged, I'm adding follow-up observations here to keep them together with the original discussion. On today's call we identified a problem with the multi-file ZIP download. In regional analysis with millions of origin points and say 40 different cutoff/destination/percentile combinations, creating the zip file can be quite slow, taking over 2 minutes. It's not appropriate to stall an HTTP request for so long with no indication of progress. In the absence of feedback users are likely to re-initiate the creation/download process multiple times from a command line, script, or UI. This could lead to multiple stalled HTTP handler threads all creating the same grid and zip files and overwriting each other's results. In initial testing I had not noticed this slowness, so the first step was to identify which part of the process is slow. This could be extracting the single-cutoff values from the multi-cutoff regional results file, encoding the geotiff file (or grid file, which is similarly slow), or transferring these resulting files to S3. We don't have enough log messages to immediately see the timing of each step. But by comparing file modification times (with Assuming creating these files remains "slow" (more than a few seconds), the solution will probably involve a mechanism like the one we use in network building, where an HTTP response is always immediate. If the object exists, it's immediately returned. Otherwise the background process is started and we respond with a Aside: as noted in AnalysisWorker#handleAndSerializeOneSinglePointTask, "Ideally we'd stall briefly using something like Future.get(timeout) in case loading finishes quickly" so not all requests for newly created entities result in a 202 and retry, but that's another topic. |
Looking into the speed of individual multi-cutoff to single-cutoff conversions. I added some log statements around the lines that produce the single-cutoff files and ran a regional analysis for Portland (57k origins). It takes around 200ms to derive a single-cutoff file. This is 52x smaller than the original culprit regions with around 3M origins. 52 * 200 ms gives 10s, roughly aligning with experience on those larger regions. Wrapping input and output with buffered streams makes little to no difference, probably because input and output are both passing through gzip streams which contain their own buffers. Sampling CPU in VisualVM shows Unzipping the source Using command line tools, gunzipping and re-gzipping the entire multi-cutoff source Initial conclusion is that streaming compression/decompression is the main culprit, perhaps in how it interacts with the byte-by-byte little-endian DataInputStream. It would be interesting to see how much faster this would be if performed with NIO byte buffers. It's also worth considering whether files need to be compressed at rest, and whether we should just derive all the I also tried increasing the read buffer size with |
The backend returns download URLs as small JSON objects with a
url
property. In this PR, these JSON responses are extended to also have a humanName property, which is an alternate filename that is more human readable but not guaranteed to be unique. This enables #673 (as discussed in #673 (comment)) but should be backward compatible with existing UI as responses still have the same oldurl
property. The newhumanName
property can be used in various ways such as the download attribute of an HTML link or as the attachment name in a content-disposition header.Regional result CSV download links seem to be the odd one out: they return text/plain instead of a JSON wrapped URL. Switching them over to use JSON wrapped URLs like everything else would require an accompanying change on the UI side but would also enable the CSV downloads to use human readable names. The updated code to return the JSON wrapped URL with humanName is already present in RegionalAnalysisController.getCsvResults but commented out for the time being.
This PR also adds and endpoint at
/api/regional/:_id/all
that will generate all available GeoTIFF rasters for a given regional analysis and store them in a persistent ZIP file, returning one of these new-style JSON wrapped download links including a human-readable name. An attempt is also made to replace any UUIDs in the names of files inside this ZIP with relevant human-readable strings.Many of the changes here are just small changes to return types, MIME types etc. There is one big block add/remove RegionalAnalysisController which is mostly factoring the action of extracting a single grid out of the HTTP controller, so it can be called once by the existing grid-retrieval HTTP controller, but multiple times in a loop by the new bulk-zip-creating HTTP controller. The smaller targeted changes there are somewhat camouflaged/exaggerated by the indentation and movement of that big block.