Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "clamping" of stream start in KaitaiStream #579

Closed
pachoo opened this issue May 15, 2019 · 5 comments
Closed

Support "clamping" of stream start in KaitaiStream #579

pachoo opened this issue May 15, 2019 · 5 comments

Comments

@pachoo
Copy link

pachoo commented May 15, 2019

Currently it's challenging to parse non-trivial structures from a stream if the position is not at the beginning. As an example:

Sample ksy

meta:
  id: Foo
  endian: le
seq:
  - id: offset
    type: u4
  - id: size
    type: u4
instances:
  body:
    pos: offset
    size: size

Sample Python Code demonstrating issue

foo_data = b'\x08\x00\x00\x00\x04\x00\x00\x00\xAA\xBB\xCC\xDD'
foo1 = Foo.from_io(io.BytesIO(foo_data))
foo.body
# Expected body
b'\xaa\xbb\xcc\xdd'

# Create a stream with 20 bytes of arbitrary data we'll skip over
stream_data = bytes(range(0x41, 0x41+20)) + foo_data
ios = io.BytesIO(stream_data)
ios.seek(20)
foo2 = Foo.from_io(ios)
foo2.body
# Actual
b'IJKL'
# Expected
# b'\xaa\xbb\xcc\xdd'

To support this I'd request that the ability to optionally "clamp" the stream in the constructor for KaitaiStream so that the original position is saved, and a few operations like seek() and size() can be modified. Additionally it'd by handy to have KaitaiStruct.from_io() also support this parameter.

my kludge (for python) is currently:

class ClampedKaitaiStream(KaitaiStream):
    def __init__(self, stream, clamp=False):
        super().__init__(stream)
        if clamp:
            self._orig_pos = stream.tell()
        else:
            self._orig_pos = None

    def seek(self, n):
        if self._orig_pos != None:
            n += self._orig_pos
        super().seek(n)

    def size(self):
        full_size = super().size()
        if self._orig_pos != None:
            full_size -= self._orig_pos
        return full_size

    def pos(self):
        p = self._io.tell()
        if self._orig_pos != None:
          p -= self._orig_pos
        return p
@GreyCat
Copy link
Member

GreyCat commented May 23, 2019

Um, can you elaborate how it's different from current substreams implementation, and proposed newer, more efficient substreams implementation (#44)?

@pachoo
Copy link
Author

pachoo commented May 23, 2019

It's likely I'm missing something simple, but with the current KaitaiStream implementation (at least in Python), if I wanted to begin parsing an arbitrary structure mid-stream (as in the original example) I believe I'd have to read in the entire stream's data from the starting point on, then pass that. e.g.:

# Craft a stream with 20 bytes of junk data then our struct data
foo_data = b'\x08\x00\x00\x00\x04\x00\x00\x00\xAA\xBB\xCC\xDD'
junk_data = bytes(range(0x41, 0x41+20)
ios = io.BytesIO(junk_data + foo_data)
# Seek to the start of the structure in the stream
ios.seek(20)
# Read in the rest of the data, not ideal if this is large
remaining_stream = io.BytesIO(ios.read())
remaining_ks = KaitaiStream(remaining_stream)
foo = Foo(remaining_ks)
# Alertnately what I'd usually do...
# foo = Foo.from_io(remaining_stream)
foo.body
# Expected body
b'\xaa\xbb\xcc\xdd'

Without "clamping" or creating a new stream with the start of the structure at position 0, the Foo structure has an issue with mid-stream parsing since the body is found with an absolute seek (as opposed to a relative seek)

Here's the python generated code for Foo.ksy in KaitaiStruct 0.8:

class Foo(KaitaiStruct):
    def __init__(self, _io, _parent=None, _root=None):
        self._io = _io
        self._parent = _parent
        self._root = _root if _root else self
        self._read()

    def _read(self):
        self.offset = self._io.read_u4le()
        self.size = self._io.read_u4le()

    @property
    def body(self):
        if hasattr(self, '_m_body'):
            return self._m_body if hasattr(self, '_m_body') else None
        # Seek to the stream's absolute position 'offset'
        _pos = self._io.pos()
        self._io.seek(self.offset)
        self._m_body = self._io.read_bytes(self.size)
        self._io.seek(_pos)
        return self._m_body if hasattr(self, '_m_body') else None

And here's some relevant KaitaiStream code:

class KaitaiStream(object):
    def __init__(self, io):
        self._io = io
        self.align_to_byte()

    def seek(self, n):
        self._io.seek(n)

    def pos(self):
        return self._io.tell()

    def read_bytes(self, n):
        if n < 0:
            raise ValueError(
                "requested invalid %d amount of bytes" %
                (n,)
            )
        r = self._io.read(n)
        if len(r) < n:
            raise EOFError(
                "requested %d bytes, but got only %d bytes" %
                (n, len(r))
            )
        return r

For my case at least (and I haven't run the unit tests) using my ClampedKaitaiStream (or alternately adding that code to KaitaiStream) works for me since the ClampedKaitaiStream has the ability to know the original stream position, and then modify seek(), pos() and size() returns to use this.

This allows me to parse the Foo structure, mid stream, without having to read in the entire rest of the stream's data and have the body's offset point to the right place. In the example below, I'm imagining that the KaitaiStream class had code similar to ClampedKaitaiStream.

# Craft a stream with 20 bytes of junk data then our struct data
foo_data = b'\x08\x00\x00\x00\x04\x00\x00\x00\xAA\xBB\xCC\xDD'
junk_data = bytes(range(0x41, 0x41+20)
ios = io.BytesIO(junk_data + foo_data)
# Seek to the start of the structure in the stream
ios.seek(20)
# Just pass in the stream, the 'clamping' will save off the current position in the stream.
kios = KaitaiStream(ios, clamp=True)
foo = Foo(kios)
foo.body
# Expected body
b'\xaa\xbb\xcc\xdd'

Again, it's likely I'm missing something simple in the current implementation, but this is the difference that I see. With my proposed solution I could parse this structure which has an absolute offset from the middle of a stream without reading in the rest of the stream as data to make a new substream.

I haven't looked at the larger problem of improving substream efficiency (as well as handling all the related fuctionality and edge-cases) as outlined in #44.

@GreyCat
Copy link
Member

GreyCat commented May 23, 2019

It looks like you do in your sample code exactly what ksc will generate when given something like:

instances:
  my_foo:
    pos: 20
    size-eos: true
    type: foo

i.e. in Python, it will generate this:

self._io.seek(20) # seek
self._raw__m_my_foo = self._io.read_bytes_full() # read bytes till the end of stream
_io__raw__m_my_foo = KaitaiStream(BytesIO(self._raw__m_my_foo)) # create new KaitaiStream
self._m_my_foo = self._root.Foo(_io__raw__m_my_foo, self, self._root) # pass it to Foo

Basically, what you want (i.e. substreams, "clamped" to position and size of part of original stream) is already supported in ksy — however, in not a very efficient manner. #44 addresses that and proposes a new, cleaner interface. So far it looks very close to what you propose — i.e. without several operations (seek + read bytes + create new BytesIO out of bytes + create new KaitaiStream wrapping that BytesIO), it should be exactly one call, something like

io = self._io.substream(20, -1) # or just (20), if we're talking about substream to end of current steram
self._m_my_foo = self._root.Foo(io, self, self._root)

@pachoo
Copy link
Author

pachoo commented May 23, 2019

I'll keep my eye out for #44, thanks for the feedback.

The actual use case that prompted this was Parsing MachO executables ( https://github.com/kaitai-io/kaitai_struct_formats/blob/master/executable/mach_o.ksy ) out of Fat files, and I solved it in a similar way to you, created a new ksy file for the FAT format.

Going back to the initial use case proposed here, it sounds like if I did want to handle this case efficiently with the new substream changes, something like the following would work.

ios = io.BytesIO(junk_data + foo_data)
# ...
# Mucking around we find that we've got an interesting Foo structure at position 20...
kios = KaitaiStream(ios)
foo = Foo(kios.substream(20))
foo.body
b'\xaa\xbb\xcc\xdd'

Feel free to close as duplicate of #44 :)

@GreyCat
Copy link
Member

GreyCat commented May 23, 2019

Thanks for confirming! Let's continue the discussion in #44 — I see that you already have implementation for Python, please consider contributing it as part of that effort?

@GreyCat GreyCat closed this as completed May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants