Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor markdown parsing to correctly support nested data #112

Closed
taylorhadden opened this issue Nov 15, 2024 · 3 comments
Closed

Refactor markdown parsing to correctly support nested data #112

taylorhadden opened this issue Nov 15, 2024 · 3 comments
Labels
tangent-electron Issues relating to the Tangent Application itself.

Comments

@taylorhadden
Copy link
Contributor

Tangent's markdown parser is a seat-of-my-pants bespoke hack. It (mostly) works! However, there are multiple gaps in its functionality. Some of these have open issues:

There are other gaps in parsing that have put limitations on what Tangent does, e.g. tables, footnotes, link ids.

A potential solution is to adopt an existing parser and add the customizations we want. @lezer/markdown looks like a good place to start.

The downstream effects of this will be:

  • Ensuring that the structure system still gets the information it needs from the parsed data.
  • A significant/complete rewrite of how parsed data integrates with typewriter.
    • Bonus: if this fails, CodeMirror uses the lezer syntax tree to do its thing.
  • Optional: Convert the grammar parsing of the query system to also using lezer to simplify the stack.
@taylorhadden taylorhadden added the tangent-electron Issues relating to the Tangent Application itself. label Nov 15, 2024
@taylorhadden taylorhadden added this to the Tangent v0.9.x milestone Nov 15, 2024
@taylorhadden taylorhadden moved this to In progress in Tangent Development Nov 15, 2024
@taylorhadden
Copy link
Contributor Author

Started digging into @lezer/markdown. The primary engine is a nearly 2k line file. I immediately noticed that some of the key breaks Tangent takes from markdown (e.g. indentation, not joining lines as blocks) may be problematic.

I think using lezer is the right direction, but I'm not convinced that this particular package is something I want to start from. At the very least, I'd be digging into the internals and modifying them in the same way that I'm using typewriter. A few thoughts:

  • I don't expect @lezer/markdown to change much. Therefor, the value in forking it is low, as I'm not expecting to integrate upstream changes.
  • I expect that Tangent's divergences from markdown will continue to grow. This makes me want to fully own and understand that parser.

Experimentation will continue in the direction of making my own lezer-based parser.

@taylorhadden
Copy link
Contributor Author

I've walked away from creating a traditional syntax tree. I've pivoted to a smaller change that allows individual bits of parsing logic to control what kind of parsing can happen based on the text they encounter. This should allow some level of nested & contextual logic without invoking a truly massive refactor.

@taylorhadden taylorhadden moved this from In progress to Pending Release in Tangent Development Dec 22, 2024
@taylorhadden taylorhadden moved this from Pending Release to In review in Tangent Development Dec 22, 2024
@taylorhadden
Copy link
Contributor Author

This is implemented in v0.9.0-alpha.1 with some fixes in alpha 2 & 3. Any further issues with the conversion should be new bugs.

@github-project-automation github-project-automation bot moved this from In review to Done in Tangent Development Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tangent-electron Issues relating to the Tangent Application itself.
Projects
Status: Done
Development

No branches or pull requests

1 participant