-
Notifications
You must be signed in to change notification settings - Fork 129
Abstract Syntax Tree as JSON
In wtf_wikipedia
the concept of an Abstract Syntax Tree (AST) is a tree representation of the WikiMedia syntax of the source text in Wiki Markdown. The data in the Wiki article in stored in a JSON structure. The JSON is valuable for data management of extracted content elements. The AST can be represented in a JSON as well, in which each node of the tree denotes a content element (e.g. paragraph, header/title, image, mathematical expression) occurring in the source text downloaded via the MediaWiki API with wtf.fetch(...)
. The syntax tree is "abstract" in a sense that is not representing a special output format in detail (e.g. HTML, LaTeX, MarkDown,...). The AST nodes are encode in the Wiki Markdown and tree structure can be used to generate the different output formats by application of the appropriate tree node handler for the title, image, sentences to the AST.
Similar to programming languages the abstract syntax trees can be derived from a concrete syntax trees, traditionally generated by parsing a given string compliant with a defined grammar.
The following example is currently not an available feature in wtf_wikipedia
. It could serve as basis for further generation of other export formats for formats that will never be implemented in wtf_wikipedia
.
Wiki Page >-> wtf_wikipedia.js
>-> AST >-> ast2odf.js
>-> Open Document Format
{
language: "en",
domain: "wikiversity",
article: "Water",
ast: [
{
type:"paragraph",
value:"",
children:[
{
type:"sentence",
value:"My first sentence.",
children:[]
},
{
type:"sentence",
value:"My Second sentence.",
children:[]
},
]
},
{
type:"title",
value:"My Title",
children:null
},
{
type:"math",
value:"\sum_{k=1}^{n} k^2",
children:[]
},
]
}
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org