-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for context-sensitive parsing #158
Comments
Apologiies if what I am asking for isn't releated to this issue. Is it possible to get something similiar to regex capture group? Examples (syntax can be anything): Using
In the above, the scope of a captured rule need to considered. I am not that familiar with parsing, so I am not sure if I would be of any help regarding this. However, if you need further clarification, or any input, feel free to ask it. I will do what I can. :-) Edit: One more thing needs to considered: To use same rule for multiple captures in same scope, for the above syntax, captured index can be mentioned. Reason: Like I mentioned in #215 [1], I am trying to write a grammar for Markdown. I am stuck at writing a rule for fenced code blocks. From CommonMark spec (above example-124 [2]):
Without capture group/node, If I am right, the only way I can think of to solve this is to use a global state. That will make this more complicated, as I will have to save source string of some parsed non-code-blocks, and re-parse some code blocks. I haven't implemented this, so I am not sure how feasible this is. If I am doing this in a wrong way, or over-complicating this, please do say so! :-) [1] #215 (comment) |
I think the above will also work nicely with indentation-sensitive languages. Simple example:
For this to work, a capture's scope should be such that the a rule should be able to access captures in it's parent scope/rules - like functions in most programming languages, which can access variables in its' outer scope. Therefore, in the above grammar, A random thought. Hope this makes sense! :-) |
@MuhammedZakir Thanks for contributing your thoughts here! That's an interesting suggestion, and maybe something like this could work for Ohm. I still need some time to think through all of the details, but this has inspired me to see what I could find in the literature for solving this problem. Here's what I found after a quick search:
I'd like to take the time to read this papers and think through the implications for Ohm. Maybe you're interested to read these as well! |
A Symbol-Based Extension of Parsing Expression Grammars and Context-Sensitive Packrat Parsing Paper: https://dl.acm.org/doi/10.1145/3136014.3136025
All symol operations in SPEG:
-- Edit: Might be helpful (haven't read this): Is stateful packrat parsing really linear in practice? a counter-example, an improved grammar, and its parsing algorithms |
Any updates/thoughts? 👀 |
@MuhammedZakir Unfortunately I haven't had the time to investigate this deeply yet. But, the SPEG approach seems nice and I could imagine it fitting well into Ohm. I do have some more time in the coming weeks/months so this may be something that I can dig into soon. |
Glad to hear that! :-) FYI: Pest supports context-sensitive parsing: https://docs.rs/pest_derive/latest/pest_derive/#push-pop-drop-and-peek. |
I'm trying to extend the grammar of Ohm to include some indentation specific operators as specified in the paper Indentation-sensitive parsing for Parsec also linked above. This will change the following part of Ohm grammar: - Seq = Iter*
-
+ Seq = IterWithIndentation*
+
+ IndentationRel = Eq | Ge | Gt | Any
+
+ Eq = "@="
+ Ge = "@>="
+ Gt = "@>"
+ Any = "@*"
+
+ IterWithIndentation
+ = Iter Ge -- indent_ge
+ | Iter Gt -- indent_gt
+ | Iter Eq -- indent_eq
+ | Iter Any -- indent_any
+ | Iter With these changes, we can write the following grammar for (one of the productions of) while statement in python:
We can now write while a < 100:
a = a + 1
a = a + 4
while a < 10: a=a+1
b = """e
eee
1
""" In order to be able to handle this logic, Ohm would need to make two properties available at each parse:
Each parse needs these properties and produces a new set of properties for the next parse. Side noteThe suggested operators are two characters long, but I believe they read well |
Wouldn't a general method/operator such as the one I mentioned above [1] solves this? [1] #158 (comment) |
@haikyuu Thanks for taking a go at this! This is definitely an interesting experiment. I've been thinking a bit in the past few weeks about how we can handle indentation-sensitive language, and other context-sensitive languages. I'll try to share some more substantial thoughts in the next day or two. For now, I can share my initial thoughts/impressions:
|
@pdubroy I have watched the SPEG video and I find it very powerful and readable syntax.
|
@haikyuu Sounds great! Btw, if I were to experiment with this in Ohm, I'd initially try to do this without changing any syntax at all. I'd create a "dummy" grammar with empty rules. Once it's instantiated in JS, I'd replace the rule bodies with a custom subclass of PExpr. Something like this:
Then of course you'd have to implement the Maybe that's helpful in case you or anyone else ends up experimenting with this. |
Any progress? |
No, I didn't make any progress. Feel free to jump on it if you will 🙏 |
It would be helpful to support state during parsing to enable such things as the off-side rule. This seems to be already called out as a planned extension. From the MSA paper:
The text was updated successfully, but these errors were encountered: