Maximum call stack size exceeded on a large string, increasing the process' stack size works tho. #101

klnvnv · 2018-11-25T10:28:33Z

Hey man, great lib!

The stack blows up when trying to parse() some large strings, around a million characters or so.

I didn't look into how you're parsing the html string, so I don't know exactly what the reason is - too long of a string, to many html elements, or too many levels in the tree. The files I've got are between one and two MB, ~5-6k elements, and the elements should be less than ten levels nested.

I'm on a Mac, the default stack size is 8k, when I increase the stack for the node process to 64k it works ok.

andrejewski · 2018-11-25T18:58:34Z

A reproducible test case with data would be really helpful!

There are a few functions where recursion is used. From a quick reading, they all relate to nesting levels. I am surprised files this large are not >10 levels deep, this might indicate a parsing bug.

I'm glad you found a workaround. I think optimizing these recursion problems is worth some time.

It's also been a goal of mine for quite some time to make Himalaya stream-able such that the whole string doesn't need to be loaded into memory at once. That's a much larger change, so maybe fixing these recursions can buy me some more time.

klnvnv · 2018-11-25T20:37:32Z

Sorry man, I wish I could share it, but it’s some private user data. I’ve been trying for like an hour to make some files that reproduce it. A million bare divs, a dozen levels deep work. Some random html and text repeating 1m times doesn’t make it hiccup. All I manage to do is run out of heap. I could parse a 100 meg string without a hitch. Good on you!!! In the original files there’s a lot of text and a lot of links and tags do have a lot of attributes. The deepest level it goes is up to 13 and there are a lot of tables with nested tags and text. I can’t easily recreate the structure in an automatic way, sorry. Maybe you could give it a try?

…

On Sun, 25 Nov 2018 at 20:58, Chris Andrejewski ***@***.***> wrote: A reproducible test case with data would be really helpful! There are a few functions where recursion is used. From a quick reading, they all relate to nesting levels. I am surprised files this large are not >10 levels deep, this might indicate a parsing bug. I'm glad you found a workaround. I think optimizing these recursion problems is worth some time. It's also been a goal of mine for quite some time to make Himalaya stream-able such that the whole string doesn't need to be loaded into memory at once. That's a much larger change, so maybe fixing these recursions can buy me some more time. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#101 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACMufnKONRExEglcu7T7CHP1ttNnpj7sks5uyuhbgaJpZM4Yxzoc> .

andrejewski · 2018-11-25T21:44:06Z

Unfortunately I don't think I can reproduce it on my own. I also played around with some nested divs and it handled that fine.

If I could guess, since you mention tables, there could be a parsing bug when dealing with your specific tables that is causing the parser to not unwind its stack correctly. What I recommend is segmenting those tables into smaller strings, parsing those, and picking out abnormalities.

Tables are pretty nasty so there's a better chance something obscure is happening there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum call stack size exceeded on a large string, increasing the process' stack size works tho. #101

Maximum call stack size exceeded on a large string, increasing the process' stack size works tho. #101

klnvnv commented Nov 25, 2018

andrejewski commented Nov 25, 2018

klnvnv commented Nov 25, 2018 via email

andrejewski commented Nov 25, 2018

Maximum call stack size exceeded on a large string, increasing the process' stack size works tho. #101

Maximum call stack size exceeded on a large string, increasing the process' stack size works tho. #101

Comments

klnvnv commented Nov 25, 2018

andrejewski commented Nov 25, 2018

klnvnv commented Nov 25, 2018 via email

andrejewski commented Nov 25, 2018