Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excellent tool. Any way to see the keys available on a large json? #1

Open
felix-hh opened this issue Oct 2, 2023 · 9 comments
Open

Comments

@felix-hh
Copy link

felix-hh commented Oct 2, 2023

This tool is amazing. I was having trouble ingesting json data using DuckDB because the object sizes where too large. With this I can just pick the large arrays apart into separate objects and DuckDB ingests without trouble! (.jsonl are much more efficient for duckdb ingestion because instead of whole arrays in memory you instantiate individual objects).

If you or anyone could recommend a tool to see the top-level keys of a json file (or the keys available under a given path) that would make my life even better. If you know that tool please recommend :) I'm dealing with ~100+GB files and the only approach I can think of is slow Python ijson module. Thankfully most follow a schema so I usually know the keys.

@draxil
Copy link
Owner

draxil commented Jan 26, 2024

I only just spotted this sorry.

Is essentially what you want:

jsonkeys massive.json

and have that dump the keys as a list to std out?

Is this massive data an array or object at the top level?

@felix-hh
Copy link
Author

Hey! Yes. Something like this, but being a massive.json that does not fit in memory

// myjson.json
{
a: value1,
b: value2,
c: [ {a2: [{a3: 1}, {a3: 2}], b2: value3]}
{a2: [{a3: 3}, {a3: 4}], b2: value4]}]
}

Should give

> jsonkeys myjson.json
a
b
c
c.a2
c.a2.a3
c.b2

Basically, a way to inspect all of the possible json prefixes that I can feed to json2nd. In my case it is a massive object at the top level.

@draxil
Copy link
Owner

draxil commented Jan 26, 2024

You sort-of can't specify these as paths to json2nd because the array gets in the way. You can do this kind of thing in jq (albeit a bit more slowly), and it struggles with massive data.

With json2nd can't go through an array to find another one. So from:

{
    "a": "one",
    "b": "two",
    "c": [
	{
	    "a2": [{"a3": 1}]
	}
    ]
}

You can have
json2nd -path c sample.json
or
json2nd -path d.e sample.json
but not
json2nd -path c.a2

in jq though you can jump through your arrays.

jq '.c.[] | .a2[] | .a3' sample.json

But is what you really want just a manual map of the massive JSON?

So more like

jsonmap sample.json
a
b
c[].a2.[]a3
d.e

?

I guess it gets confusing if the records aren't uniform like:

{
    "c": [
	{
	    "a2": [{"a3": 1}]
	},
	{
	    "x": {"y":2}
	}
    ]
}

The original keys thing is sort of okay if you just stop at arrays :)

@felix-hh
Copy link
Author

That makes sense! Yeah, I understand json2nd can't look into arrays. Then it would be more like:

{
    "a": "one",
    "b": "two",
    "c": [
	{
	    "a2": [{"a3": 1}]
	}
  "d" : {"e": [...array], "f" [...array]}
    ]
}
a
b
c
d
d.e
d.f

I think this is the original keys thing that you mentioned then, but possibly inside nested objects (to not have to run it multiple times).

@draxil
Copy link
Owner

draxil commented Jan 27, 2024

Try this?
https://github.com/draxil/jsonkeys

@felix-hh
Copy link
Author

Yes, this looks like what I need! I'll let you know how it goes.

@felix-hh
Copy link
Author

felix-hh commented Jan 29, 2024

Just used it - it is exactly what I was asking for. It beats python ijson, but somehow, it is ~15 times slower than json2nd. Thank you for taking a jump on implementing this, I didn't expect that. It is a good tool to complement json2nd!

@draxil
Copy link
Owner

draxil commented Jan 30, 2024

Unlike json2nd it uses the go parser, so it's doing heaps more work.

I'd like to abstract the JSON scanner logic of json2nd so I can use it here as well.

However it should make it through large JSON files at least.

I'll leave this issue open for the day when I get an opportunity to do that and you can benchmark it against this version :)

@felix-hh
Copy link
Author

Makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants