read_bytes(): add extra length check #61

armijnhemel · 2021-10-31T17:50:34Z

read_bytes() could benefit from an extra length check. I am using kaitai struct for parsing and there are plenty of false positives for certain file types. I sometimes end up reading a lot of extra data. With an extra check to see if the amount of bytes to be read is smaller than the stream size or file size this would be avoided:

    def read_bytes(self, n):
        if n < 0:
            raise ValueError(
                "requested invalid %d amount of bytes" %
                (n,)
            )
        r = self._io.read(n)
        if len(r) < n:
            raise EOFError(
                "requested %d bytes, but got only %d bytes" %
                (n, len(r))
            )
        return r

for example could be rewritten to something like:

    def read_bytes(self, n):
        if n < 0:
            raise ValueError(
                "requested invalid %d amount of bytes" %
                (n,)
            )
        if n > self.size():
            raise ValueError(
                 "requested to read %d bytes, but only %d available" %
                 (n, self.size()))
        r = self._io.read(n)
        if len(r) < n:
            raise EOFError(
                "requested %d bytes, but got only %d bytes" %
                (n, len(r))
            )
        return r

or something similar.

Right now I am trying to work around this by adding extra boundary checks in the .ksy files that looks at the size of the file, but that's a rather ugly hack.

The text was updated successfully, but these errors were encountered:

generalmimon · 2021-10-31T18:51:07Z

kaitai-io/kaitai_struct_cpp_stl_runtime#46 does just that, only in C++, not Python.

armijnhemel · 2021-10-31T18:58:37Z

kaitai-io/kaitai_struct_cpp_stl_runtime#46 does just that, only in C++, not Python.

I read some of the comments and the one about infinite streams makes sense. For me personally this is not relevant at all, but I guess it might be relevant for others. Would it perhaps be possible to add a variant and making it conditional upon for example an environment variable, with the current implementation being the default?

generalmimon · 2021-10-31T20:21:06Z

Would it perhaps be possible to add a variant and making it conditional upon for example an environment variable, with the current implementation being the default?

I think we're actually looking for io.IOBase.seekable() here in Python:

class io.IOBase

...

seekable()

Return True if the stream supports random access. If False, seek(), tell() and truncate() will raise OSError.

generalmimon · 2022-04-22T20:02:58Z

@dgelessus @armijnhemel

I've implemented this in 349a861, but I wonder if this won't slow down the overall parsing significantly - the size() method has to call both _io.tell() and _io.seek() twice (first to the end of stream and then restore the current position) and pos() means another _io.tell(). The method self.read_bytes() that I modified here is used even by small primitive types like u1, s4 etc., and this additional overhead of seek()+tell() may be significant here.

Even just calling seekable() may contribute to the issue (but does not have to), because the docstring in IOBase.seekable() in my Python 3.9 installation says "This method may need to do a test seek()." (%LOCALAPPDATA%\Programs\Python\Python39\Lib\_pyio.py:435):

    def seekable(self):
        """Return a bool indicating whether object supports random access.

        If False, seek(), tell() and truncate() will raise OSError.
        This method may need to do a test seek().
        """
        return False

But this potential issue just randomly occurred to me, it's not confirmed, any chance you can do a benchmark? If it's really an issue, I think we can introduce a threshold on the number of bytes requested where it's actually cheaper to call _io.size() - _io.pos() first than to call _io.read() unconditionally and then realize that the number of bytes actually received is less than requested, therefore the entire read was unnecessary. However, this threshold should ideally be based on benchmark results, not set arbitrarily.

armijnhemel · 2022-04-23T12:48:20Z

@dgelessus @armijnhemel

I've implemented this in 349a861, but I wonder if this won't slow down the overall parsing significantly - the size() method has to call both _io.tell() and _io.seek() twice (first to the end of stream and then restore the current position) and pos() means another _io.tell(). The method self.read_bytes() that I modified here is used even by small primitive types like u1, s4 etc., and this additional overhead of seek()+tell() may be significant here.

Even just calling seekable() may contribute to the issue (but does not have to), because the docstring in IOBase.seekable() in my Python 3.9 installation says "This method may need to do a test seek()." (%LOCALAPPDATA%\Programs\Python\Python39\Lib\_pyio.py:435):
    def seekable(self):
        """Return a bool indicating whether object supports random access.

        If False, seek(), tell() and truncate() will raise OSError.
        This method may need to do a test seek().
        """
        return False
But this potential issue just randomly occurred to me, it's not confirmed, any chance you can do a benchmark? If it's really an issue, I think we can introduce a threshold on the number of bytes requested where it's actually cheaper to call _io.size() - _io.pos() first than to call _io.read() unconditionally and then realize that the number of bytes actually received is less than requested, therefore the entire read was unnecessary. However, this threshold should ideally be based on benchmark results, not set arbitrarily.

I still need to dig into this deeper but one call to tell() is superfluous according to https://docs.python.org/3/library/io.html#io.IOBase.seek :

Change the stream position to the given byte offset. offset is interpreted relative to the position indicated by whence. The default value for whence is SEEK_SET. Values for whence are:

[...]

Return the new absolute position.

This can be easily verified:

>>> bla = open('/bin/ls', 'rb')
>>> bla.seek(0, os.SEEK_END)
137912

which is the correct size of the binary:

$ ls -l /bin/ls
-rwxr-xr-x 1 root root 137912 Jul  7  2021 /bin/ls

So this:

        # Seek to the end of the File object
        io.seek(0, SEEK_END)
        # Remember position, which is equal to the full length
        full_size = io.tell()

can be turned into:

        # Seek to the end of the File object and store the full length
        full_size = io.seek(0, SEEK_END)

See #61 (comment)

generalmimon · 2022-04-23T14:09:59Z

@armijnhemel I applied your suggestion, thanks for looking into this.

generalmimon closed this as completed in 349a861 Apr 8, 2022

generalmimon added a commit that referenced this issue Apr 23, 2022

size(): remove redundant io.tell() (suggested by @armijnhemel)

255f5b7

See #61 (comment)

dgelessus mentioned this issue Jul 6, 2022

read_bytes(): use previous implementation again for small reads #68

Merged

generalmimon mentioned this issue Nov 11, 2022

size() does not work in Python 2 in a KaitaiStream backed by a 'file' object #72

Closed

generalmimon mentioned this issue Jul 28, 2023

Ensure that the stream state after an EOF error is the same as before kaitai-io/kaitai_struct#1061

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_bytes(): add extra length check #61

read_bytes(): add extra length check #61

armijnhemel commented Oct 31, 2021 •

edited by generalmimon

Loading

generalmimon commented Oct 31, 2021

armijnhemel commented Oct 31, 2021

generalmimon commented Oct 31, 2021 •

edited

Loading

generalmimon commented Apr 22, 2022

armijnhemel commented Apr 23, 2022

generalmimon commented Apr 23, 2022

read_bytes(): add extra length check #61

read_bytes(): add extra length check #61

Comments

armijnhemel commented Oct 31, 2021 • edited by generalmimon Loading

generalmimon commented Oct 31, 2021

armijnhemel commented Oct 31, 2021

generalmimon commented Oct 31, 2021 • edited Loading

generalmimon commented Apr 22, 2022

armijnhemel commented Apr 23, 2022

generalmimon commented Apr 23, 2022

armijnhemel commented Oct 31, 2021 •

edited by generalmimon

Loading

generalmimon commented Oct 31, 2021 •

edited

Loading