Refactor markdown parsing to correctly support nested data #112

taylorhadden · 2024-11-15T15:08:36Z

Tangent's markdown parser is a seat-of-my-pants bespoke hack. It (mostly) works! However, there are multiple gaps in its functionality. Some of these have open issues:

Links don't support inline styling. #100 – Inline styles not working inside of links (due to lack of recursive parsing)
RGB color indicators in style attributes of raw HTML are processed as tags #68 – A complete lack of support for HTML parsing.

There are other gaps in parsing that have put limitations on what Tangent does, e.g. tables, footnotes, link ids.

A potential solution is to adopt an existing parser and add the customizations we want. @lezer/markdown looks like a good place to start.

The downstream effects of this will be:

Ensuring that the structure system still gets the information it needs from the parsed data.
A significant/complete rewrite of how parsed data integrates with typewriter.
- Bonus: if this fails, CodeMirror uses the lezer syntax tree to do its thing.
Optional: Convert the grammar parsing of the query system to also using lezer to simplify the stack.

The text was updated successfully, but these errors were encountered:

taylorhadden · 2024-11-16T14:54:03Z

Started digging into @lezer/markdown. The primary engine is a nearly 2k line file. I immediately noticed that some of the key breaks Tangent takes from markdown (e.g. indentation, not joining lines as blocks) may be problematic.

I think using lezer is the right direction, but I'm not convinced that this particular package is something I want to start from. At the very least, I'd be digging into the internals and modifying them in the same way that I'm using typewriter. A few thoughts:

I don't expect @lezer/markdown to change much. Therefor, the value in forking it is low, as I'm not expecting to integrate upstream changes.
I expect that Tangent's divergences from markdown will continue to grow. This makes me want to fully own and understand that parser.

Experimentation will continue in the direction of making my own lezer-based parser.

taylorhadden · 2024-12-21T20:03:31Z

I've walked away from creating a traditional syntax tree. I've pivoted to a smaller change that allows individual bits of parsing logic to control what kind of parsing can happen based on the text they encounter. This should allow some level of nested & contextual logic without invoking a truly massive refactor.

taylorhadden · 2024-12-30T16:41:12Z

This is implemented in v0.9.0-alpha.1 with some fixes in alpha 2 & 3. Any further issues with the conversion should be new bugs.

taylorhadden added the tangent-electron Issues relating to the Tangent Application itself. label Nov 15, 2024

taylorhadden added this to the Tangent v0.9.x milestone Nov 15, 2024

taylorhadden moved this to In progress in Tangent Development Nov 15, 2024

taylorhadden added this to Tangent Development Nov 15, 2024

taylorhadden moved this from In progress to Pending Release in Tangent Development Dec 22, 2024

taylorhadden moved this from Pending Release to In review in Tangent Development Dec 22, 2024

taylorhadden closed this as completed Dec 30, 2024

github-project-automation bot moved this from In review to Done in Tangent Development Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor markdown parsing to correctly support nested data #112

Refactor markdown parsing to correctly support nested data #112

taylorhadden commented Nov 15, 2024

taylorhadden commented Nov 16, 2024

taylorhadden commented Dec 21, 2024

taylorhadden commented Dec 30, 2024

Refactor markdown parsing to correctly support nested data #112

Refactor markdown parsing to correctly support nested data #112

Comments

taylorhadden commented Nov 15, 2024

taylorhadden commented Nov 16, 2024

taylorhadden commented Dec 21, 2024

taylorhadden commented Dec 30, 2024