-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn into an 'Interject' service that runs format ID & conversions #3
Comments
Some notes, from a 'good' and a 'bad' ePub. A good one looks like:
the 'bad' one?
Interestingly, Siegfried doesn't mind, and uses the container signature from PRONOM:
So, if we could run Siegfried/Fido on the remote ZIP, we could ID the file pretty effectively. |
Found httpio, which provides a file-like API to any HTTP resources that supports range requests, so this could be used to support fast full format identification via a modified version of Fido. It doesn't look like Assuming Fido can correctly identify these ePubs...
...which it can, then this project could be modified into a more generic Interject service that checks formats and fixes-up https://github.com/ukwa/ukwa-pywb/blob/eec2b802213783395890311106aea09ca1630191/config.yaml#L25-L44 |
One important note on this - the current system avoids direct downloads via URL hacking, because you can only go via ukwa-pywb which proxies requests, and the 'raw' service is only visible on the server side. Pushing the format blocking upstream might allow that to be circumvented. That said, we primarily rely on the use of secure PCs or the NPLD Player to provide content security, so it's not a critical issue. |
Okay, need to split this into short-term (#4) and longer-term. |
There are some malformed ePubs that do not have the right file signatures.
And this caused problems during ingest, which means the download has this content type:
Whereas for well-formed ePubs, we see:
The system therefore can't invoke the ePub reader, so they end up as just a download, as they are marked as
text/plain
(which is also the failover mode, seeepub-streamer/streamer.py
Lines 48 to 51 in 05aa44a
Then the format-based download-blocking logic in
ukwa-pywb
fails to block the download (becausetext/plain
is always allowed).It seems the browser realises it's not text and converts it into a download ZIP.
It's not clear how best to resolve this. First issue is that we need to block downloads - fixing things up so the borked ePubs work is a secondary issue.
This service acts as a proxy for all DLS content, meaning PDFs and ePubs mostly. So, when we pass-over the source content-type, perhaps we could intervene if we detect
text/plain
? Or possibly assume it's supposed to be an ePub and repair the interaction by fixing the MIME type?Looking at the problematic file, the MIME type is there, as an uncompressed stream, but is the second entry rather than the first (as the standard requires). Therefore, files like this can still be detected, but require a 'relaxed'/custom signature.
The text was updated successfully, but these errors were encountered: