-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excellent tool. Any way to see the keys available on a large json? #1
Comments
I only just spotted this sorry. Is essentially what you want:
and have that dump the keys as a list to std out? Is this massive data an array or object at the top level? |
Hey! Yes. Something like this, but being a
Should give
Basically, a way to inspect all of the possible json prefixes that I can feed to json2nd. In my case it is a massive object at the top level. |
You sort-of can't specify these as paths to json2nd because the array gets in the way. You can do this kind of thing in jq (albeit a bit more slowly), and it struggles with massive data. With json2nd can't go through an array to find another one. So from:
You can have in jq though you can jump through your arrays.
But is what you really want just a manual map of the massive JSON? So more like
? I guess it gets confusing if the records aren't uniform like:
The original keys thing is sort of okay if you just stop at arrays :) |
That makes sense! Yeah, I understand json2nd can't look into arrays. Then it would be more like:
I think this is the original keys thing that you mentioned then, but possibly inside nested objects (to not have to run it multiple times). |
Try this? |
Yes, this looks like what I need! I'll let you know how it goes. |
Just used it - it is exactly what I was asking for. It beats python ijson, but somehow, it is ~15 times slower than json2nd. Thank you for taking a jump on implementing this, I didn't expect that. It is a good tool to complement json2nd! |
Unlike json2nd it uses the go parser, so it's doing heaps more work. I'd like to abstract the JSON scanner logic of json2nd so I can use it here as well. However it should make it through large JSON files at least. I'll leave this issue open for the day when I get an opportunity to do that and you can benchmark it against this version :) |
Makes sense! |
This tool is amazing. I was having trouble ingesting json data using DuckDB because the object sizes where too large. With this I can just pick the large arrays apart into separate objects and DuckDB ingests without trouble! (.jsonl are much more efficient for duckdb ingestion because instead of whole arrays in memory you instantiate individual objects).
If you or anyone could recommend a tool to see the top-level keys of a json file (or the keys available under a given path) that would make my life even better. If you know that tool please recommend :) I'm dealing with ~100+GB files and the only approach I can think of is slow Python ijson module. Thankfully most follow a schema so I usually know the keys.
The text was updated successfully, but these errors were encountered: