Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make raw substream data available to their corresponding user types? #5

Closed
ams-tschoening opened this issue Jan 6, 2017 · 8 comments

Comments

@ams-tschoening
Copy link
Contributor

Consider the following KSY:

meta:
  id:             fmt_clt_recs
  endian:         le

seq:
  - id:     records
    type:   fmt_clt_rec
    repeat: eos

types:
  fmt_clt_rec:
    seq:
      - id:   length
        type: u1
      - id:   oms_rec
        size: oms_rec_length
        type: fmt_oms_rec
      [...]

    instances:
      oms_rec_length:
        value:  length - 6

The corresponding _read method is the following:

    private void _read() {
        this.length = this._io.readU1();
        this._raw_omsRec = this._io.readBytes(omsRecLength());
        KaitaiStream _io__raw_omsRec = new KaitaiStream(_raw_omsRec);
        this.omsRec = new FmtOmsRec(_io__raw_omsRec);
        [...]
    }

fmt_oms_rec is an opaque type from another file and I not only need to parse that type and access whatever I've described in the corresponding KSY, but in some circumstances I simply need to take the whole byte[] of such an instance and send it somewhere or store it in a database or such. It's only raw/low level read access to the data.

The problem now is that while KaitaiStream is available to each instance based on a substream, it doesn't provide access to the raw data. One can only seek, read some bytes etc., but accessing the backing byte[] of the ByteBuffer is not possible, even though ByteBuffer itself would allow it.

The only way to access the raw data currently is to access the parent of an instance of fmt_oms_rec and use _raw_omsRec from there. But that seems not that obvious to me and in case of an opaque type like mine, one has to cast to get the parent type and such.

So, is it by design that the raw data is only accessible using the parent? Wouldn't it make sense to make it accessible using KaitaiStream as well? This could even be considered an implementation detail of the Java runtime and not necessarily be part of the API, like size, eof or such. Because if/how the raw data is available to a user highly depends already on how KaitaiStream is implemented in each runtime.

The only thing to consider I see is if one wants to make the raw data available always or only in case of substreams. But I think making it always available and let the user decide if/how to use is the easiest.

Additionally, I would suggest simply making asReadOnlyBuffer available, using that one would have access to the underluying byte[] and can't break things too easily.

@ams-tschoening
Copy link
Contributor Author

I've missed one important use case: My fmt_oms_rec initially comes as a part of a larger structure into my software, that's why I have fmt_clt_recs etc., and I store each instance of fmt_oms_rec in a database. Later I sometimes only need to parse those instances of fmt_oms_rec because I already have their raw binary data from the database, so calling directly the following in Java on my own:

FmtOmsRec parsedOms = new FmtOmsRec(new KaitaiStream(oms));

In such a case there's no parent element I could retrieve the underluying byte[] from, but I still need to forward it somewhere else. So I either need to keep that on my own somewhere or, which would be better of cause, simply ask KaitaiStream to provide it back to me as I provided it before.

@ams-tschoening
Copy link
Contributor Author

ams-tschoening commented Jan 6, 2017

The only thing to consider I see is if one wants to make the raw data available always or only in case of substreams. But I think making it always available and let the user decide if/how to use is the easiest.

One could consider working with different instances of KaitaiStream for all user types to parse, much like is the case for substreams currently. KaitaiStream would need to be enhanced to work on top of another KaitaiStream and provide some housekeeping of current start index and such. All reads to the new, inner instance need to advance the current index of the underluying stream and if one user type has successfully been parsed the newly created stream is at the end of the data of that stream and has its start and could extract all the data belonging to the current type all the time. Much like ByteBuffer.duplicate() works and this could be adopted to all runtimes and languages in future.

private void _read() {
    this.length = this._io.readU1();
    KaitaiStream _io__raw_omsRec = new KaitaiStream(this._io);
    this.omsRec = new FmtOmsRec(_io__raw_omsRec);
    [...]
}

public KaitaiStream(KaitaiStream io) {
    fc = null;
    bb = io.asReadOnlyBuffer().duplicate();
}

asReadOnlyBuffer would need another implementation for this case to not provide the complete array backing the original KaitaiStream, but only extract the data relevant for the current type. Maybe using ByteBuffer.get and aching the result in the instance or such.

This way all user types would always get their associated data without the need to manually consider substreams. Would even make the compilers per language a bit easier because creating the KaitaiStream would now be done always and only the arg changes, either it's some byte[] or the parent stream.

@GreyCat
Copy link
Member

GreyCat commented Jan 6, 2017

Ok, let me elaborate on this a little bit.

First of all, KaitaiStream in its general sense (across the different languages and platforms) is a stream, and it's not really guaranteed to have all the bytes in memory (or even memory-mapped to some local file). Java implementation indeed uses a mmaped approach, which keeps things somewhat relatively simple, but other languages/platforms might not have that, so it's generally wise to think of KaitaiStream as a stream. If you want raw byte contents out of it — there's only one true way to do it — read it as bytes. If you have a KaitaiStruct-based object, this approach would always guaranteed to work in all supported languages (of course, syntax would change):

FmtOmsRec omsRec;
// omsRec somehow gets populated
omsRec._io().seek(0);
byte[] omsRecRaw = omsRec._io().readBytesFull();

Of course, this might be an overkill, but it's the only way that would definitely work everywhere. That said, in Java we can cut some corners and I don't see any harm in providing access to underlying ByteBuffer, especially if it can be done in read-only maneer.

Last, but not least, there's been a long-going initiative to actually reduce copying of raw byte buffers around and implementing proper substreams: kaitai-io/kaitai_struct#44 — that's probably what you'd also want to have.

@ams-tschoening
Copy link
Contributor Author

With your example and kaitai-io/kaitai_struct#44 in mind: Should all runtimes "officially" provide a way to get the bytes of their associated type? If yes, I'm going to close this issue and create one in kaitai_struct like kaitai-io/kaitai_struct#44 for all runtimes as a remainder and link all these 3 issues. If no, I'll only close here.

In Java it was easy to add as a workaround, in other runtimes it might not even be a problem currently because of private fields don't exist anyway or such. But in the end, in my point of view it's a design decision one needs to make and provide a proper API, maybe with your (unoptimized) default implementation in the worst case. A name like "asRoBuffer" simply wouldn't hold if a raw byte[] is returned and such.

In my opinion the runtimes should provide access to the raw data, simply because I needed it already. :-)

@GreyCat
Copy link
Member

GreyCat commented Jan 9, 2017

Runtimes in general do not have raw byte data, thus there can't be easy generic way to provide access to it. The only generic way is to do the actual "seek to 0; read all bytes; return".

It's possible to add exactly that to all runtimes, but I'm not sure if that's worth the effort, plus, it could encourage suboptimal programming style?

Java-only methods, given possible byte buffer implementation, are perfectly ok from my point of view.

@ams-tschoening
Copy link
Contributor Author

ams-tschoening commented Jan 9, 2017

[...]plus, it could encourage suboptimal programming style?

I don't see that, either users need all bytes, like me in some cases, or they don't care. It's perfectly fine to argue that this might be a rare case not worth adding an "API" to all runtimes and therefore users need to deal with this on their own, though.

Is this worth documenting? I could add a little paragraph to pitfalls, linking this discussion and such.

@ams-tschoening
Copy link
Contributor Author

Additionally, I would suggest simply making asReadOnlyBuffer available, using that one would have access to the underluying byte[] and can't break things too easily.

Just for the records: _io().asRoBuffer().array() doesn't work, a ReadOnlyBufferException gets thrown, which is documented. So one needs to read from the provided buffer using its API in any case and can't shortcircuit to the underlying raw data, its only that one doesn't necessarily need to copy the whole underlying byte[]. All depends on what the data is needed for and if a user accepts a ByteBuffer already, like e.g. many Java API is.

@ams-tschoening
Copy link
Contributor Author

I'm closing this in favour to some other statement, in the end I have my Java-runtime enhancement already. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants