Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_length makes less sense when data is a nested dictionary rather than a json string #7030

Open
zachliu opened this issue Jun 24, 2024 · 1 comment

Comments

@zachliu
Copy link
Contributor

zachliu commented Jun 24, 2024

Issue Summary

Before this PR #6687, the data returned by query runners are json strings. Hence the data_length calculated by len(data) makes sense:

logger.info(
"job=execute_query query_hash=%s ds_id=%d data_length=%s error=[%s]",
self.query_hash,
self.data_source_id,
data and len(data),
error,
)

But after #6687, data is a nested dictionary. And len(data) only gives the number of keys it has. In most cases, there are only two keys, "columns" and "rows", so the data_length doesn't really give us useful information.

Steps to Reproduce

Search for data_length= in your logs.

Technical details:

  • Redash Version: 24.06.0-dev
@zachliu
Copy link
Contributor Author

zachliu commented Jun 27, 2024

I replaced len(data) with

def _get_size_iterative(dict_obj):
    """Iteratively finds size of objects in bytes"""
    seen = set()
    size = 0
    objects = deque([dict_obj])

    while objects:
        current = objects.popleft()
        if id(current) in seen:
            continue
        seen.add(id(current))
        size += sys.getsizeof(current)

        if isinstance(current, dict):
            objects.extend(current.keys())
            objects.extend(current.values())
        elif hasattr(current, '__dict__'):
            objects.append(current.__dict__)
        elif hasattr(current, '__iter__') and not isinstance(current, (str, bytes, bytearray)):
            objects.extend(current)

    return size

It works fine. The in-memory dictionary size is usually a lot larger than in-disk storage size such as a csv file due to Python's in-memory storage overheads but at least it gives us a relative value especially informative because I'm using data_length in a DataDog dashboard to monitor user's query result sizes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant