Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolk v0.8: preparation for structures; indexed access var.0 #1503

Open
wants to merge 5 commits into
base: testnet
Choose a base branch
from

Conversation

tolk-vm
Copy link
Contributor

@tolk-vm tolk-vm commented Jan 27, 2025

After Tolk v0.7 with AST-based semantic kernel was developed, we're starting our way to eventually implement structures with auto packing to/from cells. This will take several steps (each publicly released), it's the first one.

Notable changes in Tolk v0.8

  1. Syntax tensorVar.0 and tupleVar.0, both for reading and writing
  2. Allow cell, slice, etc. to be valid identifiers

Syntax tensorVar.0 and tupleVar.0

I'll briefly remind, what tensors and tuples are in FunC/Tolk.

A tensor of N parts is actually N dictints variables on a TVM stack.

var a: (int, builder) = (5, beginCell());   // 2 variables on a stack
var b: (int, slice, bool) = ...;            // 3 variables on a stack
var c: (int, (slice, bool)) = ...;          // also 3 (1 + 2)

A tuple of N parts is a single variable on a TVM stack. Tuples can be typed and untyped:

var a: [int, builder] = [5, beginCell()];   // 1 variable (TVM tuple) with 2 items
var b: [int, slice, bool] = ...;            // 1 tuple with 3 items
var c: [int, [slice, bool]] = ...;          // 1 tuple with 2 items (int + nested tuple)

var untypedT = createEmptyTuple();          // untyped tuple, compiler is not aware of its contents
untypedT.tuplePush(5);                      // now it has 1 item

Currenly, the only way you can do with tensors or typed tuples, is unpacking them to separate variables. Being unpacked, these variables become copied. Modifying them won't modify an original tensor/tuple.

var t: (int, slice, builder);   // 3 stack slots
var (i, s, b) = t;              // 3 variables, each 1 slot
var (i, _, _) = t;              // if you need just 0-th
i = 100500;                     // does NOT change t

var t: [int, slice, builder];   // 1 stack slot (TVM tuple) with 3 items
var [i, s, b] = t;              // read tuple items via INDEX asm command
i = 100500;                     // does NOT change 0-th item of t

Since Tolk v0.8, you can access tensors/tuples by indices without unpacking them

Use tensorVar.{i} to access i-th component of a tensor. Modifying it will change the tensor.

var t = (5, someSlice, someBuilder);   // 3 stack slots
t.0                     // 5
t.0 = 10;               // t is now (10, ...)
t.0 += 1;               // t is now (11, ...)
increment(mutate t.0);  // t is now (12, ...)
t.0.increment();        // t is now (13, ...)

t.1         // slice
t.100500    // compilation error

Use tupleVar.{i} to access i-th element of a tuple (does asm INDEX under the hood). Modifying it will change the tuple (does SETINDEX under the hood).

var t = [5, someSlice, someBuilder];   // 1 tuple on a stack with 3 items
t.0                     // "0 INDEX", reads 5
t.0 = 10;               // "0 SETINDEX", t is now [10, ...]
t.0 += 1;               // also works: "0 INDEX" to read 10, "0 SETINDEX" to write 11
increment(mutate t.0);  // also, the same way
t.0.increment();        // also, the same way

t.1         // "1 INDEX", it's slice
t.100500    // compilation error

It also works for untyped tuples, though the compiler can't guarantee index correctness.

var t = createEmptyTuple();
t.tuplePush(5);
t.0                     // will head 5
t.0 = 10                // t will be [10]
t.100500                // will fail at runtime

It works for nesting var.{i}.{j}. It works for nested tensors, nested tuples, tuples nested into tensors. It works for mutate. It works for globals.

Just a couple of examples:

var t: (int, (slice, builder, int));
t.0          // int
t.1          // (slice, builder, int)
t.1.2        // int
t.1.2 = 5    // of course, t and t.1 will be transparently affected

var t: (int, [slice, builder, int])
t.1.2    // "2 INDEX" of 1-th component of tensor t

var t: [int, [slice, builder, int]]
t.1.2    // "1 INDEX" of t + "2 INDEX" of that tuple

Nested tuples will even work for writing:

t.1.2 = 10     // "1 INDEX" + "2 SETINDEX" + "1 SETINDEX"
t.1.2 += 10    // "1 INDEX" + "2 INDEX" + sum + "2 SETINDEX" + "1 SETINDEX"

globalTuple.1.2 += 10;  // "GETGLOB" + ... + "SETGLOB"

So, the compiler is smart enough to handle all cases. Even this ones:

global gTensor: (int, int);
(gTensor.0, gTensor.1) = (5, 6);   // does GETGLOB/SETGLOB only once

var t: [..., [int, int]]
(t.2.0, t.2.1) = (5, 6);  // sub-tuple t.2 is read/updated only once

Also, the compiler can now detect "one variable modified twice in same expression" resulting in a compilation error

(a, a) = rhs                    // error
f(mutate t.1.0, mutate t.1.0)   // error
(t.1.0, t.1) = rhs              // also error: t.1 both modified and read

Why is this essential?

In the future, we'll have structures, declared like this:

struct User {
    id: int;
    name: slice;
}

Structures will be stored like tensors on a stack:

var u: User = { id: 5, name: "" };
// u is actually 2 slots on a stack, the same as
var u: (int, slice) = (5, "");

fun getUser(): User { ... }
// on a stack, the same as
fun getUser(): (int, slice) { ... }

It means, that obj.{field} is exactly the same as tensorVar.{i}:

var u: User = ...;       // u: (int, slice) = ...
u.id;                    // u.0
u.id = 10;               // u.0 = 10

Same goes for nested objects:

struct Storage {
    lastUpdated: int;
    owner: User;
}

s.lastUpdated            // s.0
s.owner.id               // s.1.0

Probably, a structure might have an annotation to change its layout from a tensor to a tuple. Then, accessing/modifying fields of such an object, will result in "INDEX" / "SETINDEX", exactly the same as done for tuples now. Note, that global tensors (and global objects in the future) are stored as TVM tuples, actually.

So, implementing all the above is a direct step towards structures.

Allow cell, slice, etc. to be valid identifiers

Previously, int / cell / builder / slice / tuple were keywords, variables could not have such names.

Now, these names are allowed:

var cell = ...;
var cell: cell = ...;
var cell: slice = ...;  // don't do this :)

In both TypeScript and Rust, names of types are also valid identifiers (var number = ... in TS is okay). Moreover, with the introduction of structures, this code should reasonably be valid, right?

struct a { ... }

var a = ...;
var a: a = { ... };

As a consequence, struct fields will also be allowed to be named cell, slice, or any other types existing now or will be added later.

Implementation details: Ops on a stack refactoring

The necessary first thing is to allow direct access to tensor vars. In FunC (and in Tolk before), tensor vars were represented as a single var in terms or Ops:

var a = (1, 2);
LET (_i) = (_1, _2)

Now, every tensor of N stack slots is represented as N IR vars, handling them in later IR analysis correspondingly changed. It became possible, because now all types are inferred in advance.

LET (_i, _j) = (_1, _2)

The TmpVar now represents a single stack slot. LocalVarData now contains an array of stack slots (1 for primitives, N for tensors).

Stack comments in Fift output have also changed. First, they contain tensor components now:

fun getFirst(t: (int, (slice, int))) {
    return t.0;
}

Results in (pay attention to comments):

  getFirst PROC:<{
    //  t.0 t.1.0 t.1.1
    2DROP    //  t.0
  }>

It's very handy, especially when object and fields are implemented.

Next, I've decided to use notation '0 instead of _0 in stack comments (as always was in FunC). It doesn't mess with identifiers and indices:

    ...
    2 ADDCONST  //  '6
    RANDU256    //  '6 '7

Implementation details: indexed access and non-trivial lvalues

Having refactored LET Op above, making tensorVar.0 work for reading and writing becomes quite trivial. It's just accessing stack slots by offset, depending of inferred types. Every TypeData can calculate its own width on stack, so accessing i-th component is just accessing W[i] slots with offset as a sum of W[0..i-1].

Nesting tensorVar.0.1.2 works automatically. Holding tuples inside tensors at IR level makes no difference if we can handle tuple vars in general.

Making tupleVar.0 work on writing is not so trivial:

tupleVar.0 = 10;   // should actually do: tupleVar + "10 PUSHINT" + "0 SETINDEX"

To achieve this, a special LValContext was introduced. Its purpose is to handle non-primitive lvalues. At IR level, a usual local variable exists, but on its change, something non-trivial should happen.

  • example: globalVar = 9 actually does Const '5 = 9 + Let '6 = '5 + SetGlob "globVar" = '6
  • example: tupleVar.0 = 9 actually does Const '5 = 9 + Let '6 = '5 + Const '7 = 0 + Call tupleSetAt('4, '6, '7)

Of course, mixing globals with tuples should also be supported. To achieve this, treat tupleObj inside tupleObj.i as "rvalue inside lvalue". For instance, globalTuple.0 = 9 reads global (like rvalue), assigns 9 to tmp var, modifies tuple, writes global.

Nested tuples are handled with care. Remember, that t.0 = rhs should NOT read 0-th item, only write it. But for nested t.0.1 = rhs do read for t.0 tuple (still don't for t.0.1), and update t.0 after all. It's also done using the same LValContext, where t.0.1 is lval, and t.0 is "rval inside lval".

A challenging thing is handling "unique" parts, to be read/updated only once.

  • example: f(mutate globalTensor.0, mutate globalTensor.1), then globalTensor should be read/written once
  • example: (t.0.0, t.0.1) = rhs (m is [[int, int]]), then t.0 should be read/updated once

Detecting such "common parts" is done via calculating hashes of AST nodes of every lvalue and "rvalue inside lvalue" in LValContext.

By the way, this automatically gives an ability to detect and fire "multiple writes inside expression", like (a, a) = rhs / [t.0, (t.0.1, c)] = rhs.

Internals: built-in __expect_type() for testing purposes

Currently, the Tolk tester framework can test various "output" of the compiler: pass input and check output, validate fif codegen, etc. But it can not test compiler internals and AST representation.

I've added an ability to have special functions to check/expose internal compiler state. The first (and the only now) is:

__expect_type(some_expr, "<type>");

Such a call has special treatment in a compilation process: compilation fails if this expression doesn't have the requested type.

It's intended to be used in tests only. Not present in stdlib.

Related pull requests

Currently, tolk-tester can test various "output" of the compiler:
pass input and check output, validate fif codegen, etc.
But it can not test compiler internals and AST representation.

I've added an ability to have special functions to check/expose
internal compiler state. The first (and the only now) is:
> __expect_type(some_expr, "<type>");
Such a call has special treatment in a compilation process.
Compilation fails if this expression doesn't have requested type.

It's intended to be used in tests only. Not present in stdlib.
In FunC (and in Tolk before), tensor vars (actually occupying
several stack slots) were represented as a single var in terms
or IR vars (Ops):
> var a = (1, 2);
> LET (_i) = (_1, _2)

Now, every tensor of N stack slots is represented as N IR vars.
> LET (_i, _j) = (_1, _2)

This will give an ability to control access to parts of a tensor
when implementing `tensorVar.0` syntax.
It works both for reading and writing:
> var t = (1, 2);
> t.0;      // 1
> t.0 = 5;
> t;        // (5, 2)

It also works for typed/untyped tuples, producing INDEX and SETINDEX.

Global tensors and tuples works. Nesting `t.0.1.2` works. `mutate` works.
Even mixing tuples inside tensors inside a global for writing works.
They are not keywords anymore.
> var cell = ...;
> var cell: cell = ...;
Motivation: in the future, when structures are implemented, this obviously should be valid:
> struct a { ... }
> var a = ...;
Struct fields will also be allowed to have names int/slice/cell.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tolk Related to Tolk Language / compiler / tooling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant