-
Notifications
You must be signed in to change notification settings - Fork 129
ContentList Parsing
The ContentList
is the most generic element for Abstract Syntax Tree (AST). This document explains the parsing elements done into the list of content elements.
On abstract level ContentList
is branching element in the AST, that preserves the sequence in which the content elements appear in wiki source text.
Parsing content elements into a ContentList
can be performed on different levels of detail in the AST:
- on block level mentioned below.
- on sentence level decomposing a sentence in sentence parts, inline citations, inline math, inline icons, ... . In this sense a sentence is a concatenation of content element. Especially citations have a specific location in the text and even in plain text a citation could be inserted with
[3]
for referencing the bibliography.
The following description is not implemented in wtf_wikipedia 5.0
and serves as a basis for the software design process. Please feel free to adapt the content of this GitHub wiki and improve the description prior to implementation of the solution. In the sense of Agile Software Developement some code is inserted here for checking the basic concept, but the code should be regards more or less a pseudo code to understand the proposal. The code is not meant to be a copy and paste resource for the implementation.
The following list describes the block level parsing of ContentList
. Assume we wiki source like this:
this is a line 1 of a TextBlock
this is a line 2 of a TextBlock
this is a line 3 of a TextBlock
* this is a line 1 of a BulletList
* this is a line 2 of a BulletList
this is a line 1 of the next TextBlock
this is a line 2 of the next TextBlock with inline math <math>k^2</math>
# this is a line 1 of a EnumList
# this is a line 2 of a EnumList
the following lines define a MathBlock as separate rendered line
:<math>
\sum_{k=1}^{n} k^2
+ n^3
</math>
this is the last line of the document
The parseBlock()
method decompose the wiki source into the follow AST tree nodes. The key strategy to split into blocks is the first character of a line. This implies, that parseBlock()
method needs a line split first, performed by:
let split = wiki.split(/\r?\n/);
The variable split
is an array of lines and wiki.split()
is method available for strings (don't get confused). Now we have to iterate over the lines of the array split
and check the first character. The first character of a line stored in the split[i]
can be accessed by:
split[i].charAt(0)
Now we expand the idea mentioned above by a while
iteration over the array of lines split
. Furthermore we define an object parser_pos
that stores the current index of lines, because parsing methods increment the index for the block end test and aggregation of lines in line_array
. After detection and parsing of a block end the line_array
will be joined with
let text_block = line_array.join("\n");
- a
TextBlock
will be decomposed into sentences (see Sentence parser in/src/sentence
. - a
BulletList
will be decomposed into the items, wrapped with an opening and closing AST tree node. The code in/src/section/list.js
can be used for that. Especially the regular expressions can be used. - a
EnumList
will be decomposed into the items, wrapped with an opening and closing AST tree node. - ...
The parser position mention below is implemented as parameters (see
let contentlist = new ContentList()
let parser_pos = {
"index":i,
"lines":split,
"line_array": []
};
// line_array is populated with the block
while (i<split.length) {
switch (split[i].charAt(0)) {
case ":":
//CALL: parseIndentation(parser_pos,contentlist,options)
// creates a indentation tree node Indentation or tree node MathBlock
break;
case "*":
//CALL: parseBulletList(parser_pos,contentlist,options)
// creates a tree node BulletList (in 5.0 performed in /src/list)
break;
case "#":
//CALL: parseEnumList(parser_pos,contentlist,options)
// creates a tree node BulletList (in 5.0 performed in /src/list)
break;
...
default:
// (optional) split paragraphs if the TextBlock contains double \n
};
// parsing incremented the line index, so update i
i = parser_pos.index;
// tree node is parsed so init the line_array
parser_pos.line_array = [];
}
In wtf_wikipedia 5.0
the AST nodes for EnumList
and BulletList
are handled in /src/list
. This can be handled still by the already existing methods implemented. The only requirement is, that the generated tree nodes in ContentList
can be destinguished by the type of tree node i.e. EnumList
and BulletList
, which is necessary for the output generation in different formats.
The result of the parsing process is more or less an array the following tree nodes of the Abstract Syntax Tree AST:
- AST Type:
TextBlock
this is a line 1 of a TextBlock
this is a line 2 of a TextBlock
this is a line 3 of a TextBlock
- AST Type:
BulletList
* this is a line 1 of a BulletList
* this is a line 2 of a BulletList
- AST Type:
TextBlock
this is a line 1 of the next TextBlock
this is a line 2 of the next TextBlock with inline math <math>k^2</math>
- AST Type:
EnumList
# this is a line 1 of a EnumList
# this is a line 2 of a EnumList
- AST Type:
TextBlock
the following lines define a MathBlock as separate rendered line
- AST Type:
MathBlock
\sum_{k=1}^{n} k^2
+ n^3
- AST Type:
TextBlock
this is the last line of the document
parseBlock()
method of ContentList
splits those blocks and call the appropriate parse method for AST treenode type.
The method createNode4AST()
creates a very simple node for the Abstract Syntax Tree (AST) by returning a hash with just the type
attribute. This AST node can be populated with additional attributes that may be relevant for generation of output formats. E.g. a Section
needs a title
and depth
attribute as implemented in /sec/section/index.js
l.40 with function splitSections()
.
const createNode4AST = function(nodeid) {
return {
"type":nodeid
}
}
A tree node for Section
will be created with createNode4AST("Section")
and populated with more content parsed in the section body. The contentlist
(is this equivalent to templates
attribute in section
hash??? or are these really templates).
A node-specific constructor could use switch
command for adding type specific additional attributes.
const createNode4AST = function(nodeid) {
let ast_node = {
"type":nodeid
};
switch (nodeid) {
case "Paragraph","BulletList","EnumList","TextBlock","Sentence":
ast_node.contentlist = new ContentList()
break;
case "Section":
ast_node.title = "";
ast_node.depth = -1;
ast_node.templates = [];
ast_node.contentlist = new ContentList()
break;
default:
}
return ast_node
}
- The parsing of Vicky document is based on the line split of the wiki source.
- a iteration over lines populate an array of line
line_array
until a specific content elements ends (e.g. aTextBlock
,List
(e.g.BulletList
,EnumList
),InfoBox
,Table
,Indentation
or mathematical expression, ... - If the next line indicates, that the new content element starts (e.g.
TextBlock
ends andBulletList
starts), then previously populated array of lines is concatenated and send to the appropriate message for parsing
let contentlist = new ContentList();
let split = wiki.split(/\r?\n/);
let i=0;
let line_array = [];
function block_end_check(split,i,pCondition) {
// split is an array of lines and i the current index in the block parse
// parameter pCondition is condition that defines the block end
// the returned boolean variable is used in a while statement,
// so boolean return should be true, if block did not end.
return Boolean((i<split.length) && (split[i].length > 0) && (!pCondition))
}
while (i<split.length) {
first_char = split[i].charAt(0) || "A";
// "A" or any other character A-Z or a-z means;
// it is a line of TextBlock
switch (first_char) {
case ":":
// check if line starts with ":<math>", "::<math>", ":::<math>", ...
// If "YES" aggregate line until closing math-tag </math> appears in line
if (split[i].match(/[:]+<math>/)) {
//MATHBLOCK: handle block math expression - define as function parseMathBlock()
while (block_end_check(split,i,split[i].indexOf("</math>") >= 0)) {
line_array.push(split[i]);
i++
};
let last_line = line_array[line_array.length-1];
//remove blanks, tabs, ... at end of line
last_line = last_line.replace(/[\s]+$/);
// last_line = "some formula </math> text after math expression"
let math_end = last_line.indexOf("</math>")+7; // 7 length "</math>"
if (last_line.indexOf("</math>")>=0) && (last_line.length > math_end)) {
line_array[line_array.length-1] = last_line.substring(0,math_end);
// set last line to "some formula </math>" and append rest to new line
i = i-1;
// replace the last line with the rest of line behind </math>
// for last_line = "some formula </math> text after math expression"
split[i] = last_line.substring(math_end+1,last_line.length);
// split[i] = " text after math expression"
// this will be added to next line_array because of i = i-1
}
} else {
// aggregate all indentation as block
while (block_end_check(split,i,first_char != ":")) {
line_array.push(split[i]);
i++
};
wiki_block = line_array.join("\n")
// subtree of indentation could contain e.g. bullet lists
/*
: before bullet list
:* bullet list starts
:* next bullet list item
:* bullet list ends
: after bullet list indented TexBlock start
*/
var indent = parseIndentation(section,wiki_block,options);
// CALL: parse indentation
// parsing removes preceeding ":" per line and
// parses the subtree of AST, i.e. parse the bullet list
// and the preceeding and appended TextBlocks
// consisting of one line only here.
/*
before bullet list
* bullet list starts
* next bullet list item
* bullet list ends
after bullet list indented TexBlock start
*/
// indent is itself a contentlist branching the Abstract Syntax Tree AST
contentlist.push(indent);
}
break;
default:
// Handle lines of TextBlock or (optional) split paragraphs by double "\n"
};
line_array = [];
i++;
}
...
- Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
- Output: Based on concepts of the swiss-army knife of
document conversion
developed by John MacFarlane PanDoc - https://www.pandoc.org