Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_bytes(): add extra length check #61

Closed
armijnhemel opened this issue Oct 31, 2021 · 6 comments
Closed

read_bytes(): add extra length check #61

armijnhemel opened this issue Oct 31, 2021 · 6 comments

Comments

@armijnhemel
Copy link

armijnhemel commented Oct 31, 2021

read_bytes() could benefit from an extra length check. I am using kaitai struct for parsing and there are plenty of false positives for certain file types. I sometimes end up reading a lot of extra data. With an extra check to see if the amount of bytes to be read is smaller than the stream size or file size this would be avoided:

    def read_bytes(self, n):
        if n < 0:
            raise ValueError(
                "requested invalid %d amount of bytes" %
                (n,)
            )
        r = self._io.read(n)
        if len(r) < n:
            raise EOFError(
                "requested %d bytes, but got only %d bytes" %
                (n, len(r))
            )
        return r

for example could be rewritten to something like:

    def read_bytes(self, n):
        if n < 0:
            raise ValueError(
                "requested invalid %d amount of bytes" %
                (n,)
            )
        if n > self.size():
            raise ValueError(
                 "requested to read %d bytes, but only %d available" %
                 (n, self.size()))
        r = self._io.read(n)
        if len(r) < n:
            raise EOFError(
                "requested %d bytes, but got only %d bytes" %
                (n, len(r))
            )
        return r

or something similar.

Right now I am trying to work around this by adding extra boundary checks in the .ksy files that looks at the size of the file, but that's a rather ugly hack.

@generalmimon
Copy link
Member

kaitai-io/kaitai_struct_cpp_stl_runtime#46 does just that, only in C++, not Python.

@armijnhemel
Copy link
Author

kaitai-io/kaitai_struct_cpp_stl_runtime#46 does just that, only in C++, not Python.

I read some of the comments and the one about infinite streams makes sense. For me personally this is not relevant at all, but I guess it might be relevant for others. Would it perhaps be possible to add a variant and making it conditional upon for example an environment variable, with the current implementation being the default?

@generalmimon
Copy link
Member

generalmimon commented Oct 31, 2021

Would it perhaps be possible to add a variant and making it conditional upon for example an environment variable, with the current implementation being the default?

I think we're actually looking for io.IOBase.seekable() here in Python:

class io.IOBase
...
seekable()
Return True if the stream supports random access. If False, seek(), tell() and truncate() will raise OSError.

@generalmimon
Copy link
Member

@dgelessus @armijnhemel

I've implemented this in 349a861, but I wonder if this won't slow down the overall parsing significantly - the size() method has to call both _io.tell() and _io.seek() twice (first to the end of stream and then restore the current position) and pos() means another _io.tell(). The method self.read_bytes() that I modified here is used even by small primitive types like u1, s4 etc., and this additional overhead of seek()+tell() may be significant here.

Even just calling seekable() may contribute to the issue (but does not have to), because the docstring in IOBase.seekable() in my Python 3.9 installation says "This method may need to do a test seek()." (%LOCALAPPDATA%\Programs\Python\Python39\Lib\_pyio.py:435):

    def seekable(self):
        """Return a bool indicating whether object supports random access.

        If False, seek(), tell() and truncate() will raise OSError.
        This method may need to do a test seek().
        """
        return False

But this potential issue just randomly occurred to me, it's not confirmed, any chance you can do a benchmark? If it's really an issue, I think we can introduce a threshold on the number of bytes requested where it's actually cheaper to call _io.size() - _io.pos() first than to call _io.read() unconditionally and then realize that the number of bytes actually received is less than requested, therefore the entire read was unnecessary. However, this threshold should ideally be based on benchmark results, not set arbitrarily.

@armijnhemel
Copy link
Author

@dgelessus @armijnhemel

I've implemented this in 349a861, but I wonder if this won't slow down the overall parsing significantly - the size() method has to call both _io.tell() and _io.seek() twice (first to the end of stream and then restore the current position) and pos() means another _io.tell(). The method self.read_bytes() that I modified here is used even by small primitive types like u1, s4 etc., and this additional overhead of seek()+tell() may be significant here.

Even just calling seekable() may contribute to the issue (but does not have to), because the docstring in IOBase.seekable() in my Python 3.9 installation says "This method may need to do a test seek()." (%LOCALAPPDATA%\Programs\Python\Python39\Lib\_pyio.py:435):

    def seekable(self):
        """Return a bool indicating whether object supports random access.

        If False, seek(), tell() and truncate() will raise OSError.
        This method may need to do a test seek().
        """
        return False

But this potential issue just randomly occurred to me, it's not confirmed, any chance you can do a benchmark? If it's really an issue, I think we can introduce a threshold on the number of bytes requested where it's actually cheaper to call _io.size() - _io.pos() first than to call _io.read() unconditionally and then realize that the number of bytes actually received is less than requested, therefore the entire read was unnecessary. However, this threshold should ideally be based on benchmark results, not set arbitrarily.

I still need to dig into this deeper but one call to tell() is superfluous according to https://docs.python.org/3/library/io.html#io.IOBase.seek :

Change the stream position to the given byte offset. offset is interpreted relative to the position indicated by whence. The default value for whence is SEEK_SET. Values for whence are:

[...]

Return the new absolute position.

This can be easily verified:

>>> bla = open('/bin/ls', 'rb')
>>> bla.seek(0, os.SEEK_END)
137912

which is the correct size of the binary:

$ ls -l /bin/ls
-rwxr-xr-x 1 root root 137912 Jul  7  2021 /bin/ls

So this:

        # Seek to the end of the File object
        io.seek(0, SEEK_END)
        # Remember position, which is equal to the full length
        full_size = io.tell()

can be turned into:

        # Seek to the end of the File object and store the full length
        full_size = io.seek(0, SEEK_END)

@generalmimon
Copy link
Member

@armijnhemel I applied your suggestion, thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants