Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental parsing produces two attributes when only one expected. #64

Open
grokys opened this issue Oct 23, 2024 · 4 comments
Open

Incremental parsing produces two attributes when only one expected. #64

grokys opened this issue Oct 23, 2024 · 4 comments

Comments

@grokys
Copy link
Contributor

grokys commented Oct 23, 2024

This is a bit of an obscure one, but if:

  • One has an incomplete attribute with a name of more than 4 characters, but no value (i.e. the user is typing the attribute name)
  • And then adds a dot after it
  • And incrementally parses
  • Then two attributes are created instead of 1

The following test reproduces the behavior:

        [Fact]
        public void IncrementalParsingIsSameAsFullParsing_DotAfterAttributeName()
        {
            const string xml = "<A><B Names/></A>";
            var full = Parser.ParseText(xml);

            // Attribute name needs to be >= 5 characters to trigger the bug.
            var textToEdit = "Names";
            var insertIndex = xml.IndexOf(textToEdit) + textToEdit.Length;
            var newXml = xml.Insert(insertIndex, ".");
            var change = new TextChangeRange(new TextSpan(insertIndex, 0), 1);
            var incremental = Parser.ParseIncremental(newXml, new[] { change }, full);
            var fullAfterModification = Parser.ParseText(newXml);

            Assert.Single(fullAfterModification.Root.Elements.Single().Attributes);
            Assert.Single(incremental.Root.Elements.Single().Attributes);
        }
@grokys
Copy link
Contributor Author

grokys commented Oct 23, 2024

The reason this only happens when the attribute has > 4 characters seems to be that Scanner.MaxTokensLookAheadBeyondEOL is 4 characters. This causes _affectedRanges to include a span which prevents the previous Name attribute from being recycled, the span does not filter out the attribute when it's called "Names".

https://github.com/KirillOsenkov/XmlParser/blob/main/src/Microsoft.Language.Xml/Blender.cs#L96

@grokys
Copy link
Contributor Author

grokys commented Oct 23, 2024

Actually, this is more serious than originally thought. The inserted character does not need to be a . - it can be any valid attribute name character. So editing "Names" -> "Namesx" will cause two attributes to be produced, "Names" and "x":

        [Fact]
        public void IncrementalParsingIsSameAsFullParsing_AppendingToAttributeName()
        {
            const string xml = "<A><B Change/></A>";
            var full = Parser.ParseText(xml);

            // Attribute name needs to be >= 5 characters to trigger the bug.
            var textToEdit = "Change";
            var insertIndex = xml.IndexOf(textToEdit) + textToEdit.Length;
            var newXml = xml.Insert(insertIndex, "s");
            var change = new TextChangeRange(new TextSpan(insertIndex, 0), 1);
            var incremental = Parser.ParseIncremental(newXml, new[] { change }, full);
            var fullAfterModification = Parser.ParseText(newXml);

            Assert.Single(fullAfterModification.Root.Elements.Single().Attributes);
            Assert.Single(incremental.Root.Elements.Single().Attributes); // <-- FAILS
        }

@grokys
Copy link
Contributor Author

grokys commented Oct 23, 2024

This seems to be a problem in marking the affected ranges dirty:

  • Blender.ExpandToParentChain calls root.FindNode(change.Span.Start, includeTrivia: false);
  • This returns the /> of B
  • It marks <B> as dirty
  • But it doesn't mark the Change attribute as dirty

A simple fix is to change:

https://github.com/KirillOsenkov/XmlParser/blob/main/src/Microsoft.Language.Xml/Blender.cs#L45

To

var node = root.FindNode(change.Span.Start - 1, includeTrivia: false);

i.e. mark the node of the character before the change dirty. This fixes the issue, and all other tests still pass but I'm not convinced it's the correct thing to do. I'm not sure I know enough about how this was intended to work to go much further (and there aren't enough incremental parsing tests to allow me to be confident I'll not be breaking something else)

@KirillOsenkov
Copy link
Owner

KirillOsenkov commented Oct 23, 2024

I'd say I wouldn't worry about breaking stuff, because if we break something later we'll deal with it then.

Honestly I know about as much about incremental parsing as you do.

One thing we could do later is fuzz to randomly mutate a simple XML file, do a full parse and an incremental parse, and compare the results. If they differ, we found a bug.

I usually start with this file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<X>
  <n:T></n:T>
  <X/>
  <A.B></A.B>
  <A B="a"></A>
  <A>&#x03C0;</A>
  <A>a &lt;</A>
  <A><![CDATA[bar]]></A>
  <!-- comment -->
</X>

and then we can select a series of edits, either delete a char or insert a char (and favor chars such as < > " = and /).

For now since you've already built some understanding about this bug, just send a PR and we can merge and move on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants