Auto generate ast expression nodes #16285

Glyphack · 2025-02-20T19:13:11Z

Summary

Auto generate AST nodes using definitions in ast.toml. I added similar fields to Field in ASDL to hold field information

Test Plan

Nothing outside the ruff_python_ast package should change.

MichaReiser · 2025-02-20T19:26:36Z

Uhh, exciting.

I do feel a bit conflicted about using toml to define our grammar over e.g. something like ungrammar but maybe it's the right choice and doesn't really matter?

github-actions · 2025-02-20T19:34:14Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Glyphack · 2025-02-20T19:35:05Z

@MichaReiser I will also use ungrammar on a separate branch to see the difference I just know the name nothing more. I was also unsure if I should extend the ast.toml file or not.

I first started by re-using the ASDL parser by python and then realized the ASDL is not offering more functionality here for generating Nodes and Enums. I decided to continue with toml file since we can add custom fields there as well. For example the fields order or other payload can be added to the toml.

MichaReiser · 2025-02-20T20:45:48Z

RustPython used to use ASDL and, uff, that was painful. It's probably different because we actually parsed the official python grammar and derived the AST from it. The problem with that is that our AST diverged in many places.

dcreager

I do feel a bit conflicted about using toml to define our grammar over e.g. something like ungrammar but maybe it's the right choice and doesn't really matter?

To be honest I don't have a strong opinion between our hand-rolled TOML, ASDL, and ungrammar. To me, the main concerns are:

Is the AST definition easy to understand and maintain?
Is it easy to parse in our code generator?

Our hand-rolled TOML was great for the first proof-of-concept, since we could rely on Python's builtin tomllib to do the bulk of the parsing.

I thought ASDL could be nice primarily because we could reuse a lot of the parsing logic from CPython, since (at least at the moment) we are using a Python script to parse the AST definition and generate our (Rust) code. (At first I hoped it would also have the benefit of letting us reuse the grammar itself from CPython, but as @MichaReiser points out, our AST has diverged from CPython's representation.)

RustPython used to use ASDL and, uff, that was painful.

What part was painful about it? Was the syntax itself too limited? Or was it cumbersome to parse and generate code from?

dcreager · 2025-02-20T21:16:06Z

crates/ruff_python_ast/ast.toml

+ExprBoolOp = { fields = [
+    { name = "op", type = "BoolOp" },
+    { name = "values", type = "Expr", seq = true }
+]}


Is the AST definition easy to understand and maintain?

Given this concern, this syntax does start to look a bit unwieldy.

Is it easy to parse in our code generator?

But on the other hand, continuing to use TOML still makes it easier to iterate on the code generator itself.

So given all of this, I'd still lean towards using this TOML syntax. We can always revisit the syntax later if we feel there's something that e.g. ungrammar would give us that the TOML doesn't.

crates/ruff_python_ast/generate.py

crates/ruff_python_ast/src/nodes.rs

MichaReiser · 2025-02-20T22:02:12Z

What part was painful about it? Was the syntax itself too limited? Or was it cumbersome to parse and generate code from?

The main pain point was that it parsed python's official AST grammar and then did some hacky overrides in code for where we wanted to diverge.

My other concern with using ASDL is that it isn't easy for us to extend if we need to and I always found it hard to read (e.g. could we extract field documentation?)

Glyphack · 2025-02-21T22:25:54Z

good news 🎉 I was able to complete the generation for all expression nodes. There are a few hacky things I need to fix.

We need a way to provide the derive attribute to structs. Right now I just added a derives field that adds additional derives for nodes that have the Default derive.

ExprNoneLiteral = { fields = [], derives = ["Default"]}

I initially thought only knowing a field is a sequence or not would be enough to insert it in a Vec but I found in ExprCompare one of the types is Box<[crate::CmpOp]>.

ExprCompare = { fields = [
    { name = "ops", type = "Box<[crate::CmpOp]>"},
    { name = "comparators", type = "Box<[Expr]>"}
]}

I added a rule to automatically box if a field type is the group of the node it belongs to so we box every Expr but this might be confusing because for Box<str> we need to box explicitly. Or for [Expr] type the auto boxing won't work. I'm also not interested in making it more complicated but I'm not sure what way would be better.
Also it won't work in this example:

{ name = "parameters", type = "Box<crate::Parameters>", optional = true },

I'm not sure if we refer to Name just by it's name and import it in the file or like this:

{ name = "id", type = "name::Name" },

The current code also uses crate:: before every type. This is wrong because if a type is bool it should not become crate::bool but removing that requires every type to contain the crate word which is redundant mostly. We can solve this by having a map of types that should not have crate:: I don't know what would be the easiest solution.

I'm reading about ungrammar but I feel because of the extensibility we need in the generator using a custom file would be easier because we can add stuff without hacking it on top of something else. But I have to read it first.

dcreager

good news 🎉 I was able to complete the generation for all expression nodes. There are a few hacky things I need to fix.

This is starting to come together very nicely! Thank you for tackling this.

I'm reading about ungrammar but I feel because of the extensibility we need in the generator using a custom file would be easier because we can add stuff without hacking it on top of something else. But I have to read it first.

Given some of my style nits about cleaning up the TOML, I think it's fine to proceed with this PR without also looking for a way to migrate to ungrammar. I think that can be a separate experiment/PR — and in all honesty, a lower priority one, since this seems to be shaping up nicely as it currently is.

dcreager · 2025-02-25T14:43:52Z

crates/ruff_python_ast/ast.toml

+ExprUnaryOp = { fields = [
+    { name = "op", type = "UnaryOp" },
+    { name = "operand", type = "Expr" }
+]}


as a style nit, the following are equivalent TOML (i.e., it shouldn't require changes to the generator) that I think might be a bit easier to read:

[Expr.nodes.ExprUnaryOp] fields = [ { name = "op", type = "UnaryOp" }, { name = "operand", type = "Expr" } ]

or

[[Expr.nodes.ExprUnaryOp.fields]] name = "op" type = "UnaryOp" [[Expr.nodes.ExprUnaryOp.fields]] name = "operand" type = "Expr"

I have a slight preference for the last one, since it eliminates the nested maps and lists from the TOML. What do you think?

Sounds good.

crates/ruff_python_ast/ast.toml

crates/ruff_python_ast/generate.py

dcreager · 2025-02-25T14:52:05Z

crates/ruff_python_ast/generate.py

+                inner = f"crate::{field.ty}"
+                if field.ty == "bool" or field.ty.startswith("Box"):
+                    inner = field.ty


I like your suggestion of adding a hard-coded list of types (defined near the top of this file) that don't require a crate:: qualifier

dcreager · 2025-02-25T14:53:50Z

crates/ruff_python_ast/generate.py

+                if field.ty == group.name and field.seq is False:
+                    inner = f"Box<{inner}>"


We will want to do this if the field type is any group, not just the current group. Most of the Stmt nodes, for instance, contain expressions, which are all Box<Expr> even though that's a different node group:

ruff/crates/ruff_python_ast/src/nodes.rs

Lines 178 to 183 in 5c007db

pub struct StmtAugAssign {

pub range: TextRange,

pub target: Box<Expr>,

pub op: Operator,

pub value: Box<Expr>,

}

I added a rule to automatically box if a field type is the group of the node it belongs to so we box every Expr but this might be confusing because for Box<str> we need to box explicitly. Or for [Expr] type the auto boxing won't work. I'm also not interested in making it more complicated but I'm not sure what way would be better.

I think this suggestion also makes this concern less concerning: auto-boxing for singleton node groups is great, and that doesn't mean we have to find a way to auto-box every place that Box appears.

crates/ruff_python_ast/src/nodes.rs

crates/ruff_python_ast/src/relocate.rs

dcreager · 2025-02-25T15:07:02Z

crates/ruff_python_ast/ast.toml

+    { name = "ctx", type = "ExprContext" }
+]}
+ExprName = { fields = [
+    { name = "id", type = "name::Name" },


I'm not sure if we refer to Name just by it's name and import it in the file or like this:

{ name = "id", type = "name::Name" },

I think it's worth ensuring that we can use just Name here in the TOML, and importing it at the top of the generated Rust file is a perfectly fine way to do that

I defined the imports at the top of the generate.py.

One additional improvement is to define the imports in a way we can check if a name is imported and not use crate prefix for it.

For now I just used the global list of types we don't want to prefix with crate(the other comment) instead. Because we only have one import and it might be too soon to generalize.

dcreager · 2025-02-25T15:09:06Z

crates/ruff_python_ast/ast.toml

+    { name = "ops", type = "Box<[crate::CmpOp]>"},
+    { name = "comparators", type = "Box<[Expr]>"}


I initially thought only knowing a field is a sequence or not would be enough to insert it in a Vec but I found in ExprCompare one of the types is Box<[crate::CmpOp]>

Doing this as a hard-coded type without a seq is just fine as a special case. We might also check whether they need to be a box-of-slice instead of a vec. The only benefit I can see is that we save one word of storage since we only need pointer+length instead of pointer+length+capacity. And if that's worth doing, why aren't we doing that for all of the other vecs? But we don't have to tackle that in this PR, this is fine as you have it.

crates/ruff_python_ast/ast.toml

Co-authored-by: Douglas Creager <[email protected]>

MichaReiser

I like where this is going.

I do find the TOML somewhat hard to parse because of how verbose fields is. I'd suggest exploring using regular lists with inline tables to make the format more compact. We can still use the "full-table" layout in cases where it is necessary (because inline-tables need to be single line)

MichaReiser · 2025-02-26T07:40:41Z

crates/ruff_python_ast/ast.toml

+[[Expr.nodes.ExprBoolOp.fields]]
+name = "op"
+type = "BoolOp"


Should we use inline arrays? I find the syntax very verbose to read and find it hard to get a sense of a nodes fields

Suggested change

[[Expr.nodes.ExprBoolOp.fields]]

name = "op"

type = "BoolOp"

fields = [

{ name = "op", type = "BoolOp" },

{ name = "values", type = "Expr", seq = true },

...

}

The downside of this is that inline tables can't be multiline (this can be a problem with comments)

In fairness to @Glyphack I had suggested the syntax he's using here #16285 (comment)!

Another possibility, since name is required and unique, would be

[Expr.nodes.ExprBoolOp] fields = { op = { type = "BoolOp" }, values = { type = "Expr", seq = true } }

Oh, I didn't see that one.

The second doesn't work, I think, because toml inline tables don't allow line breaks

yes unfortunately it's not possible to have line break inside an inline table.
https://toml.io/en/v1.0.0#inline-table

To make that work we need to have a table [Expr.nodes.ExprBoolOp.fields.op] which takes it more verbose I think?

But I think instead of the inline table using an inline array would solve the problem of having all the fields in one section and we only need to write an extra "name = ..." does sound good to me.

But one thing to keep in mind is that if we use inline table for the field attributes and have tables inside attributes(for example if we apply this comment #16285 (comment)) it might get very messy example:

[Expr.nodes.ExprBoolOp] fields = [ { name = "values", type = { kind = "sequence", element-type = "Expr" } } ]

I feel we end up making type a more complex field in the end.

MichaReiser · 2025-02-26T07:40:59Z

crates/ruff_python_ast/ast.toml

+seq = true
+
+[Expr.nodes.ExprNamed]
+rustdoc = "/// See also [NamedExpr](https://docs.python.org/3/library/ast.html#ast.NamedExpr)"


Nit: Just doc?

rustdoc is consistent with the existing per-group comments. If we use doc here we should change the other one to be consistent.

I also prefer doc it's shorter I'll change all to doc.

MichaReiser · 2025-02-26T07:48:49Z

crates/ruff_python_ast/ast.toml

+
+[[Expr.nodes.ExprCompare.fields]]
+name = "comparators"
+type = "Box<[Expr]>"


Having to explicitly use Box<[Expr]> here has the downside that it still requires manually updating all types when changing from one representation to another.

I also find that type = Expr + seq = true reads somewhat strangely. It also allows for combination that aren't allowed. E.g. is seq = true + optional = true supported? I wonder if we should do something like:

type = "Expr" // single item or escape hatch type = { kind: "sequence", element-type: "Expr", slice: true } type = { kind: "required", value: "Expr" } type = { kind: "optiona", value: "Expr" }

I'm not very convinced it is better. Maybe you have ideas to improve it further?

Yeah the current format has no restrictions and definitely has some problems like You can use type Vec<Expr> and also set the sequence. Honestly I was not sure how specific should I make the types in this stage.
I think your suggestion is needed in the end. For example when we want to provide functions to access data without accessing the fields the exact types are useful.

My feeling is that if you are also unsure we can leave this undecided to when it's needed?

MichaReiser · 2025-02-26T07:50:40Z

crates/ruff_python_ast/ast.toml

+rustdoc = """/// An AST node that represents either a single-part f-string literal
+/// or an implicitly concatenated f-string literal.
+///
+/// This type differs from the original Python AST ([JoinedStr]) in that it
+/// doesn't join the implicitly concatenated parts into a single string. Instead,
+/// it keeps them separate and provide various methods to access the parts.
+///
+/// [JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"""


Nit: We could also use a list here and join the elements?

Suggested change

rustdoc = """/// An AST node that represents either a single-part f-string literal

/// or an implicitly concatenated f-string literal.

///

/// This type differs from the original Python AST ([JoinedStr]) in that it

/// doesn't join the implicitly concatenated parts into a single string. Instead,

/// it keeps them separate and provide various methods to access the parts.

///

/// [JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"""

rustdoc = [

"An AST node that represents either a single-part f-string literal",

"or an implicitly concatenated f-string literal.",

"",

"This type differs from the original Python AST ([JoinedStr]) in that it",

"doesn't join the implicitly concatenated parts into a single string. Instead,",

"it keeps them separate and provide various methods to access the parts.",

"",

"[JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"

]

But I must admit it looks worse with the empty lines... hmm

I would definetely remove the need for /// They're unnecessary noise when authoring the TOML file and they can easily be inserted by the script.

That's a good point that the script can insert the ///s. I think removing those, but keeping the multi-line string, is likely to be a good compromise:

Suggested change

rustdoc = """/// An AST node that represents either a single-part f-string literal

/// or an implicitly concatenated f-string literal.

///

/// This type differs from the original Python AST ([JoinedStr]) in that it

/// doesn't join the implicitly concatenated parts into a single string. Instead,

/// it keeps them separate and provide various methods to access the parts.

///

/// [JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"""

rustdoc = """

An AST node that represents either a single-part f-string literal

or an implicitly concatenated f-string literal.

This type differs from the original Python AST ([JoinedStr]) in that it

doesn't join the implicitly concatenated parts into a single string. Instead,

it keeps them separate and provide various methods to access the parts.

[JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"""

Good point I'm going with the second version with multi line strings.

Co-authored-by: Douglas Creager <[email protected]>

Glyphack force-pushed the autogen-ast branch 2 times, most recently from 63f4777 to fd8c7cd Compare February 20, 2025 19:24

Auto generate ast expression nodes

85c9400

Glyphack force-pushed the autogen-ast branch from fd8c7cd to 85c9400 Compare February 20, 2025 19:25

MichaReiser requested a review from dcreager February 20, 2025 19:25

AlexWaygood added the internal An internal refactor or improvement label Feb 20, 2025

dcreager reviewed Feb 20, 2025

View reviewed changes

Auto generate all expression AST nodes

18f64ea

Fix linter issues

cfad8f4

MichaReiser requested a review from dcreager February 25, 2025 08:13

dcreager reviewed Feb 25, 2025

View reviewed changes

Glyphack and others added 2 commits February 25, 2025 19:08

Update crates/ruff_python_ast/generate.py

8b9bbb6

Co-authored-by: Douglas Creager <[email protected]>

Apply review suggestions

ce4cf6f

Glyphack force-pushed the autogen-ast branch from 177cfed to ce4cf6f Compare February 25, 2025 19:30

Move node comments into rustdoc

afba5fb

Glyphack requested a review from dcreager February 26, 2025 07:15

Glyphack marked this pull request as ready for review February 26, 2025 07:15

MichaReiser reviewed Feb 26, 2025

View reviewed changes

Glyphack and others added 2 commits February 26, 2025 12:10

Update crates/ruff_python_ast/ast.toml

4951013

Co-authored-by: Douglas Creager <[email protected]>

Refactor rustdoc

6f28c81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto generate ast expression nodes #16285

Auto generate ast expression nodes #16285

Glyphack commented Feb 20, 2025 •

edited

Loading

MichaReiser commented Feb 20, 2025

github-actions bot commented Feb 20, 2025 •

edited

Loading

Glyphack commented Feb 20, 2025

MichaReiser commented Feb 20, 2025

dcreager left a comment

dcreager Feb 20, 2025

MichaReiser commented Feb 20, 2025

Glyphack commented Feb 21, 2025 •

edited

Loading

dcreager left a comment

dcreager Feb 25, 2025

Glyphack Feb 25, 2025

dcreager Feb 25, 2025

dcreager Feb 25, 2025

dcreager Feb 25, 2025

Glyphack Feb 25, 2025 •

edited

Loading

dcreager Feb 25, 2025

MichaReiser left a comment

MichaReiser Feb 26, 2025

dcreager Feb 26, 2025

MichaReiser Feb 26, 2025

Glyphack Feb 26, 2025 •

edited

Loading

MichaReiser Feb 26, 2025

dcreager Feb 26, 2025

Glyphack Feb 26, 2025

MichaReiser Feb 26, 2025

Glyphack Feb 26, 2025

MichaReiser Feb 26, 2025

dcreager Feb 26, 2025

Glyphack Feb 26, 2025

		if field.ty == group.name and field.seq is False:
		inner = f"Box<{inner}>"

	pub struct StmtAugAssign {
	pub range: TextRange,
	pub target: Box<Expr>,
	pub op: Operator,
	pub value: Box<Expr>,
	}

		{ name = "ops", type = "Box<[crate::CmpOp]>"},
		{ name = "comparators", type = "Box<[Expr]>"}

-rustdoc = """/// An AST node that represents either a single-part f-string literal
-/// or an implicitly concatenated f-string literal.
-///
-/// This type differs from the original Python AST ([JoinedStr]) in that it
-/// doesn't join the implicitly concatenated parts into a single string. Instead,
-/// it keeps them separate and provide various methods to access the parts.
-///
-/// [JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"""
+rustdoc = [
+	"An AST node that represents either a single-part f-string literal",
+	"or an implicitly concatenated f-string literal.",
+	"",
+	"This type differs from the original Python AST ([JoinedStr]) in that it",
+	"doesn't join the implicitly concatenated parts into a single string. Instead,",
+	"it keeps them separate and provide various methods to access the parts.",
+	"",
+	"[JoinedStr]: https://docs.python.org/3/library/ast.html#ast.JoinedStr"
+]

Auto generate ast expression nodes #16285

Are you sure you want to change the base?

Auto generate ast expression nodes #16285

Conversation

Glyphack commented Feb 20, 2025 • edited Loading

Summary

Test Plan

MichaReiser commented Feb 20, 2025

github-actions bot commented Feb 20, 2025 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Glyphack commented Feb 20, 2025

MichaReiser commented Feb 20, 2025

dcreager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser commented Feb 20, 2025

Glyphack commented Feb 21, 2025 • edited Loading

dcreager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glyphack Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glyphack Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Glyphack commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

`ruff-ecosystem` results

Glyphack commented Feb 21, 2025 •

edited

Loading

Glyphack Feb 25, 2025 •

edited

Loading

Glyphack Feb 26, 2025 •

edited

Loading