About the `operator` attribute in `mathics.builtin.Builtin` classes, and about the Textual input and output processing. #52

mmatera · 2022-09-19T12:41:22Z

mmatera
Sep 19, 2022
Maintainer

Regarding the use of the attribute operator, and the connection with MathicsScanner, I was thinking about it, and I guess that the best option is to store on it the WL-Unicode representation.

I will try to explain here how this works in WMA, based on the description in WR (https://reference.wolfram.com/language/tutorial/TextualInputAndOutput.html), and how I think we can implement it in Mathics.
The main problem with this kind of description is that several (non-equivalent) steps have the same name (for example "parsing") and my own limitations in knowing the specific, technical words, but I am sure you could fill the gaps.

[Input]

The first step in the evaluation process is the user input, in a certain front-end. In a text-based front-end, the input is a String. In the graphics interface of WMA, the input could also be some kind of rich text, represented in a WL boxed structure.

In WMA, the following step is then to apply the MakeExpression rule over the input. This should convert a String or a BoxExpression into a WL evaluable Expression. As our front-ends just allow "1D" inputs, let's consider just this case.

Differently from WMA, instead of applying MakeExpression rules, we send the string directly to the Mathics parser (which is the basic rule for MakeExpression too).
The Mathics parser now is inside mathics-core, but it uses the tokenizer implemented in MathicsScanner. This tokenizer uses then the tables defined in that package to produce a sequence of tokens, which are going to be parsed. At that point, we can tell to the tokenizer which is the encoding of the input interface, to interpret different characters accordingly.
At the output of the parser, we have then a mathics.core.Expression (or mathics.core.Atom) (let's refer to it as expr) with String elements encoded with WL-Unicode characters. In the rest of the evaluation, (except in the evaluation of expressions like MakeExpression[...] or ToString[...]) there is no need to look at the encoding or to use any table.

[Processing]

Now the expression is evaluated by calling the method expr.evaluate(evaluation). The result of this part is another expression result. To get something that we can show in the front-end, we have first to pass result through the formatter.

[Output]

The formatter must convert result into a mathics.builtin.core.BoxMixin (mathics.builtin.base.BoxExpression) object (let's call it boxed_result). In Mathics, to do this, the expression MakeBoxes[result, StandardForm] is evaluated. Inside the MakeBoxes rules, an evaluation of Format[result] is done. This applies the formatting rules associated to the elements in the expression. At this point, an expression like
Equivalent[a,b,c]
is converted into
Infix[{"a","b","c"}, op] where op is a String (in the WL-Unicode encoding). Let's call the result of applying Format formattedResult.

Then, MakeBoxes rules are applied over formattedResult to produce a String or a BoxExpression (let's call it boxedResult). Here, String elements are still encoded as WL-Unicode.

Notice that here could be important to differentiate the regular evaluation from the specific sequence of replacement rules application. Format[expr] is not evaluated in the sense that is not equivalent to Expression(SymbolFormat, expr).evaluate(evaluation). In formatting, Format rules are applied in a specific order over the Expression. In MakeBoxes happens a similar process.

(I have run several experiments to check this behavior. I am going to organize them and put them bellow, as separated comments, to avoid making this presentation much longer than it already is)

[Interpret the formatted output on the front-end]

In WMA, the boxedResult is processed by the front end to produce the textual/graphical output. In Mathics, this is done by calling the boxedResult.boxes_to_format() method. In a text-only output, the result of this method should be a (Python) str encoded with the front-end encoding. This can be done by passing to the method the encoding as a parameter. Notice that in this last stage, we do not need the mathics.core.Definitions object anymore, but the tables in MathicsScanner.

Comment aside: The resulting str object should be equivalent to the value of the String object obtained from ToString[expression, CharacterEncoding->encoding].

@rocky, regarding the comment at the end of #43 (this one)
So, in your example, if ToExpression["\"\\[LeftVector]\""] is introduced in the interpreter, then, the apply method of ToExpression parses the argument using mathics_scanner, to produce the String ↼, (encoded as a WL-unicode character) and returns that String as output.

Now, the expression is formatted (now, it consists of adding double quotes to the string value). Then, MakeBoxes rules are applied (in this case, they are trivial) and produce the string "↼".

The following step is to call the boxes_to_format method. This is the point in which the (system) encoding matters: if the encoding is "ASCII", then (the String) "↼" should be translated to a (Python's) str, with the WL-Unicode character replaced by "[LeftVector]", or by an ASCII equivalent. If the encoding is UTF-8, then a (standard) UTF-8 equivalent should replace the character. And if there is another encoding, another replacement rule should be applied here.

rocky · 2022-09-19T13:59:06Z

rocky
Sep 19, 2022
Maintainer

Regarding the use of the attribute operator, and the connection with MathicsScanner, I was thinking about it, and I guess that the best option is to store on it the WL-Unicode representation. ...

While all of this discussion is useful, especially with regard to how WMA works, I don't see that any of this is related to either the Mathics scanner or to character tables. (I bundled the two in one project, but they could have been separated into two projects, and maybe in the future that would be a good idea).

Character tables have no notion of MakeBoxes, or formatting or any of that. The only thing they care about is whether the properties of characters are represented. The scanner's job is just to produce tokens for a parser. Again this has nothing to do with specific built-in functions like MakeBoxes, Infix, or Expressions or formatting. Right now all of this happens in mathics-core.

The second thing I'd say about "best option to store" is that it really doesn't matter. You could use anything that uniquely describe the operator. So the operator name, its ASCII sequence or the WL unicode would all work. Standard Unicode would probably work too, but we'd have to make sure that two operators don't map to the same Unicode. I think that's the case but best to just avoid the problem altogether. We know for example that the operator names and ASCII sequence have to be unique.

Since the code currently uses the ASCII sequence, I don't see a problem in keeping that. If you want to reduce vagueness around this name and make massive changes (which is what would have to be done with using the WL unicode), then change the name "operator" to "ascii_operator_string".

1 reply

mmatera Sep 19, 2022
Maintainer Author

Exactly, in the end, we can choose any representation, as far as each operator has a unique interpretation. The WL-Unicode choice just avoids making a translation. But we could also code the attribute as the ascii CharacterName and then, when the Builtin is loaded, generate the WL-Unicode representation.

rocky · 2022-09-19T14:03:53Z

rocky
Sep 19, 2022
Maintainer

Under "[Input]" For simplicity, let's start out assuming only character string input. We have to learn to walk before we can run.

This too has been a pervasive problem: complicating the problem so that things become harder at the outset.

I don't think any generality in a solution is lost if we start out with string input only. For other kinds of input, other kinds of scanners and parsers can be written.

2 replies

mmatera Sep 19, 2022
Maintainer Author

Yes, I put this in the description just to avoid being "myopic", but the subsequent description assumes that the input is just a String`.

rocky Sep 19, 2022
Maintainer

No no, NO!

There is a big difference between seeing a problem and deciding that this can be a self-contained item that can be separable, and not understanding that there is a bigger picture.

An often used teaching technique is to deliberately simplify a problem or tell "lies". For example you may say conservation of energy when it should include interchange of mass and energy. Here you understand the full situation, and the lie is a rough approximation, but in order to make (pedagogical) progress you deliberately remove that knowing that it can be added later without having to drastically change things.

Yes, I am glad you see the full picture with respect to different kinds of input. Thanks for pointing that out.

In the past, on other things we needed to make drastic changes because of a lack of understanding of the problem.

rocky · 2022-09-19T14:12:25Z

rocky
Sep 19, 2022
Maintainer

So, in your example, if ToExpression[""\[LeftVector]""] is introduced in the interpreter, then, the apply method of ToExpression parses the argument using mathics_scanner, to produce the String ↼, (encoded as a WL-unicode character) and returns that String as output.

Right - and what I am saying is that it is a gross inefficiency to have to go back to the scanner just to do a table lookup to convert a name like "LeftVector" into a particular character string. And the ToExpression above doesn't take the $CharacterEncoding parameter, which this built-in function would do even if only optionally at the Bulil-in level. On the Python side though, it would be explicit which is clearer and it facilitates caching.

Instead, define a built-in function to do this, unless there is already one in WMA (which I doubt).

2 replies

mmatera Sep 19, 2022
Maintainer Author

OK, that is a good point: for some reason. ToExpression does not have that parameter in WMA. In any case, what I was suggesting was to have a Python function (in MathicsScanner, or in MathicsCore) ensuring that the input to the parser (and the tokenizer) are codified in a canonical way. And yes, it should be explicitly called.

rocky Sep 19, 2022
Maintainer

for some reason. ToExpression does not have that parameter in WMA

I would imagine that is because WMA defines and wants to control its entire environment. Therefore, instead of using standard Unicode it defined new symbols in the "user-defined" area. To be fair, it is possible or likely that at the time certain symbols were defined, there were not appropriate ones in Standard Unicode, but this came later. But then WMA didn't go back and remove these when they did appear. I suppose you can't fault them too hard for that.

However, what is good for and right for WMA isn't necessarily what is good and right for Mathics. We live in a world where WMA Unicode is not the standard, but "Standard" Unicode is, and sometimes not even that but plain ASCII. Or encodings that are somewhere in between.

insuring that the input to the parser (and the tokenizer) are codified in a canonical way

For the tokenizer no. We allow on input several different both unicode representations of operators as well as their ASCII equivalent. If you want to also allow WMA's nonstandard Unicode input as we just did for DifferentialD, well, okay.

The tokens currently are standard - we use the operator name. When you run mathics --full-form you won't see a singe Unicode or ASCII representation of an operator, but rather the operator name. (--full-form doesn't change the behavior, it just shows what is getting done: parsing uses FullForm input.)

Right now I don't see an upside to what you suggest. The downside I see is that whereas right now we don't have to change the operator attribute on an Operator, if we went to something else we'd have to do a massive change to add this, and the value. And there is a downside in that all of our terminals don't support those altered characters either. So debugging this would add an annoyance where there isn't one right now.

It doesn't sound to me like you understand that what we have is purely an output (formatting, or boxing) problem not an input problem. Again this has nothing to do the either the scanner or parser. After mathics-core has worked out a way to handle this, there might be new or different translation tables that it would be convenient to produce here. But this is more a second-order or side effect kind of thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the `operator` attribute in `mathics.builtin.Builtin` classes, and about the Textual input and output processing. #52

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

About the operator attribute in mathics.builtin.Builtin classes, and about the Textual input and output processing. #52

mmatera Sep 19, 2022 Maintainer

Replies: 3 comments · 5 replies

rocky Sep 19, 2022 Maintainer

mmatera Sep 19, 2022 Maintainer Author

rocky Sep 19, 2022 Maintainer

mmatera Sep 19, 2022 Maintainer Author

rocky Sep 19, 2022 Maintainer

rocky Sep 19, 2022 Maintainer

mmatera Sep 19, 2022 Maintainer Author

rocky Sep 19, 2022 Maintainer

About the `operator` attribute in `mathics.builtin.Builtin` classes, and about the Textual input and output processing. #52

mmatera
Sep 19, 2022
Maintainer

Replies: 3 comments 5 replies

rocky
Sep 19, 2022
Maintainer

mmatera Sep 19, 2022
Maintainer Author

rocky
Sep 19, 2022
Maintainer

mmatera Sep 19, 2022
Maintainer Author

rocky Sep 19, 2022
Maintainer

rocky
Sep 19, 2022
Maintainer

mmatera Sep 19, 2022
Maintainer Author

rocky Sep 19, 2022
Maintainer