ContentList Parsing

The ContentList is the most generic element for Abstract Syntax Tree (AST). This document explains the parsing elements done into the list of content elements.

On abstract level ContentList is branching element in the AST, that preserves the sequence in which the content elements appear in wiki source text.

Parsing content elements into a ContentList can be performed on different levels of detail in the AST:

on block level mentioned below.
on sentence level decomposing a sentence in sentence parts, inline citations, inline math, inline icons, ... . In this sense a sentence is a concatenation of content element. Especially citations have a specific location in the text and even in plain text a citation could be inserted with [3] for referencing the bibliography.

Remark on Refactoring

The following description is not implemented in wtf_wikipedia 5.0 and serves as a basis for the software design process. Please feel free to adapt the content of this GitHub wiki and improve the description prior to implementation of the solution. In the sense of Agile Software Developement some code is inserted here for checking the basic concept, but the code should be regards more or less a pseudo code to understand the proposal. The code is not meant to be a copy and paste resource for the implementation.

Introduction to Block Level Parsing - ContentList

The following list describes the block level parsing of ContentList. Assume we wiki source like this:

this is a line 1 of a TextBlock 
this is a line 2 of a TextBlock
this is a line 3 of a TextBlock
* this is a line 1 of a BulletList
* this is a line 2 of a BulletList
this is a line 1 of the next TextBlock
this is a line 2 of the next TextBlock with inline math <math>k^2</math>
# this is a line 1 of a EnumList
# this is a line 2 of a EnumList
the following lines define a MathBlock as separate rendered line
:<math>
  \sum_{k=1}^{n} k^2
  + n^3  
</math> 
this is the last line of the document

The parseBlock() method decompose the wiki source into the follow AST tree nodes. The key strategy to split into blocks is the first character of a line. This implies, that parseBlock() method needs a line split first, performed by:

let split = wiki.split(/\r?\n/);

The variable split is an array of lines and wiki.split() is method available for strings (don't get confused). Now we have to iterate over the lines of the array split and check the first character. The first character of a line stored in the split[i] can be accessed by:

split[i].charAt(0)

Now we expand the idea mentioned above by a while iteration over the array of lines split. Furthermore we define an object parser_pos that stores the current index of lines, because parsing methods increment the index for the block end test and aggregation of lines in line_array. After detection and parsing of a block end the line_array will be joined with

let text_block = line_array.join("\n");

a TextBlock will be decomposed into sentences (see Sentence parser in /src/sentence.
a BulletList will be decomposed into the items, wrapped with an opening and closing AST tree node. The code in /src/section/list.js can be used for that. Especially the regular expressions can be used.
a EnumList will be decomposed into the items, wrapped with an opening and closing AST tree node.
...

The parser position mention below is implemented as parameters (see

let contentlist = new ContentList()

let parser_pos = {
   "index":i,
   "lines":split,
   "line_array": [] 
};
// line_array is populated with the block
while (i<split.length) {
  switch (split[i].charAt(0)) {
      case ":":
        //CALL: parseIndentation(parser_pos,contentlist,options) 
        // creates a indentation tree node Indentation or tree node MathBlock 
      break;
      case "*":
        //CALL: parseBulletList(parser_pos,contentlist,options) 
        // creates a tree node BulletList (in 5.0 performed in /src/list)
      break;
      case "#":
        //CALL: parseEnumList(parser_pos,contentlist,options) 
        // creates a tree node BulletList (in 5.0 performed in /src/list)
      break;
      ...
      default:
      // (optional) split paragraphs if the TextBlock contains double \n
  };
  // parsing incremented the line index, so update i
  i = parser_pos.index;
  // tree node is parsed so init the line_array
  parser_pos.line_array = [];
}

In wtf_wikipedia 5.0 the AST nodes for EnumList and BulletList are handled in /src/list. This can be handled still by the already existing methods implemented. The only requirement is, that the generated tree nodes in ContentList can be destinguished by the type of tree node i.e. EnumList and BulletList, which is necessary for the output generation in different formats.

The result of the parsing process is more or less an array the following tree nodes of the Abstract Syntax Tree AST:

AST Type: TextBlock

this is a line 1 of a TextBlock 
this is a line 2 of a TextBlock
this is a line 3 of a TextBlock

AST Type: BulletList

* this is a line 1 of a BulletList
* this is a line 2 of a BulletList

AST Type: TextBlock

this is a line 1 of the next TextBlock
this is a line 2 of the next TextBlock with inline math <math>k^2</math>

AST Type: EnumList

# this is a line 1 of a EnumList
# this is a line 2 of a EnumList

AST Type: TextBlock

the following lines define a MathBlock as separate rendered line

AST Type: MathBlock

  \sum_{k=1}^{n} k^2
  + n^3

AST Type: TextBlock

this is the last line of the document

parseBlock() method of ContentList splits those blocks and call the appropriate parse method for AST treenode type.

Create a Node for the AST

The method createNode4AST() creates a very simple node for the Abstract Syntax Tree (AST) by returning a hash with just the type attribute. This AST node can be populated with additional attributes that may be relevant for generation of output formats. E.g. a Section needs a title and depth attribute as implemented in /sec/section/index.js l.40 with function splitSections().

  const createNode4AST = function(nodeid) {
    return {
               "type":nodeid
           }
  }

A tree node for Section will be created with createNode4AST("Section") and populated with more content parsed in the section body. The contentlist (is this equivalent to templates attribute in section hash??? or are these really templates).

A node-specific constructor could use switch command for adding type specific additional attributes.

  const createNode4AST = function(nodeid) {
     let ast_node = {
               "type":nodeid
           };
     switch (nodeid) {
        case "Paragraph","BulletList","EnumList","TextBlock","Sentence":
           ast_node.contentlist = new ContentList()
        break;
        case "Section":
           ast_node.title = "";
           ast_node.depth = -1;
           ast_node.templates = [];
           ast_node.contentlist = new ContentList()
        break;
        default:
    
    }
    return ast_node
  }

Main Parsing Steps for ContentList

The parsing of Vicky document is based on the line split of the wiki source.
a iteration over lines populate an array of line line_array until a specific content elements ends (e.g. a TextBlock, List (e.g. BulletList, EnumList), InfoBox, Table, Indentation or mathematical expression, ...
If the next line indicates, that the new content element starts (e.g. TextBlock ends and BulletList starts), then previously populated array of lines is concatenated and send to the appropriate message for parsing

let contentlist = new ContentList();
let split = wiki.split(/\r?\n/);
let i=0;
let line_array = [];

function block_end_check(split,i,pCondition) {
  // split is an array of lines and i the current index in the block parse
  // parameter pCondition is condition that defines the block end
  // the returned boolean variable is used in a while statement,
  // so boolean return should be true, if block did not end.
  return Boolean((i<split.length) && (split[i].length > 0) &&  (!pCondition))
}

while (i<split.length) {
  first_char = split[i].charAt(0) || "A"; 
  // "A" or any other character A-Z or a-z means;
  // it is a line of TextBlock
  switch (first_char) {
      case ":":
         // check if line starts with ":<math>", "::<math>", ":::<math>", ...
         // If "YES" aggregate line until closing math-tag </math> appears in line
         if (split[i].match(/[:]+<math>/)) {
            //MATHBLOCK: handle block math expression - define as function parseMathBlock()
            while (block_end_check(split,i,split[i].indexOf("</math>") >= 0)) {
              line_array.push(split[i]);
              i++
            };
            let last_line = line_array[line_array.length-1];
            //remove blanks, tabs, ... at end of line
            last_line = last_line.replace(/[\s]+$/); 
            // last_line = "some formula  </math> text after math expression"
            let math_end = last_line.indexOf("</math>")+7; // 7 length "</math>"
            if (last_line.indexOf("</math>")>=0) && (last_line.length > math_end)) {
                 line_array[line_array.length-1] = last_line.substring(0,math_end);  
                 // set last line to     "some formula  </math>" and append rest to new line 
                 i = i-1;
                 // replace the last line with the rest of line behind </math> 
                 // for last_line = "some formula  </math> text after math expression"
                 split[i] = last_line.substring(math_end+1,last_line.length);
                 // split[i] = " text after math expression"
                 // this will be added to next line_array because of i = i-1
             }
         } else {
            // aggregate all indentation as block
            while (block_end_check(split,i,first_char != ":")) {
              line_array.push(split[i]);
              i++
            };
            wiki_block = line_array.join("\n")
            // subtree of indentation could contain e.g. bullet lists 
            /* 
            : before bullet list
            :* bullet list starts
            :* next bullet list item
            :* bullet list ends
            : after bullet list indented TexBlock start
            */
            var indent = parseIndentation(section,wiki_block,options);
            // CALL: parse indentation 
            // parsing removes preceeding ":" per line and 
            // parses the subtree of AST, i.e. parse the bullet list 
            // and the preceeding and appended TextBlocks 
            // consisting of one line only here.
            /* 
            before bullet list
            * bullet list starts
            * next bullet list item
            * bullet list ends
            after bullet list indented TexBlock start
            */
            // indent is itself a contentlist branching the Abstract Syntax Tree AST
            contentlist.push(indent);
              
         }
      break;
      default:
         // Handle lines of TextBlock or (optional) split paragraphs by double "\n" 
  };
  line_array = [];
  i++;
}
...

Parsing Concepts are based on Parsoid - https://www.mediawiki.org/wiki/Parsoid
Output: Based on concepts of the swiss-army knife of document conversion developed by John MacFarlane PanDoc - https://www.pandoc.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContentList Parsing

Remark on Refactoring

Introduction to Block Level Parsing - ContentList

Create a Node for the AST

Main Parsing Steps for ContentList

Clone this wiki locally