SLAXML is a pure-Lua SAX-like streaming XML parser. It is more robust than
many (simpler) pattern-based parsers that exist (such as mine), properly
supporting code like <expr test="5 > 7" />
, CDATA nodes, comments, namespaces,
and processing instructions.
It is currently not a truly valid XML parser, however, as it allows certain XML that is syntactically-invalid (not well-formed) to be parsed without reporting an error.
- Pure Lua in a single file (two files if you use the DOM parser).
- Streaming parser does a single pass through the input and reports what it sees along the way.
- Supports processing instructions (
<?foo bar?>
). - Supports comments (
<!-- hello world -->
). - Supports CDATA sections (
<![CDATA[ whoa <xml> & other content as text ]]>
). - Supports namespaces, resolving prefixes to the proper namespace URI (
<foo xmlns="bar">
and<wrap xmlns:bar="bar"><bar:kittens/></wrap>
). - Supports unescaped greater-than symbols in attribute content (a common failing for simpler pattern-based parsers).
- Unescapes named XML entities (
< > & " '
) and numeric entities (e.g.
) in attributes and text nodes (but—properly—not in comments or CDATA). Properly handles edge cases like&amp;
. - Optionally ignore whitespace-only text nodes (as appear when indenting XML markup).
- Includes a DOM parser that is a both a convenient way to pull in XML to use as well as a nice example of using the streaming parser.
- Does not add any keys to the global namespace.
local SLAXML = require 'slaxml'
local myxml = io.open('my.xml'):read()
-- Specify as many/few of these as you like
parser = SLAXML:parser{
startElement = function(name,nsURI) end, -- When "<foo" or <x:foo is seen
attribute = function(name,value,nsURI) end, -- attribute found on current element
closeElement = function(name,nsURI) end, -- When "</foo>" or </x:foo> or "/>" is seen
text = function(text) end, -- text and CDATA nodes
comment = function(content) end, -- comments
pi = function(target,content) end, -- processing instructions e.g. "<?yes mon?>"
}
-- Ignore whitespace-only text nodes and strip leading/trailing whitespace from text
-- (does not strip leading/trailing whitespace from CDATA)
parser:parse(myxml,{stripWhitespace=true})
If you just want to see if it will parse your document correctly, you can simply do:
local SLAXML = require 'slaxml'
SLAXML:parse(myxml)
…which will cause SLAXML to use its built-in callbacks that print the results as seen.
If you simply want to build tables from your XML, you can alternatively:
local SLAXML = require 'slaxdom' -- also requires slaxml.lua; be sure to copy both files
local doc = SLAXML:dom(myxml)
The returned table is a 'document' comprised of tables for elements, attributes, text nodes, comments, and processing instructions. See the following documentation for what each supports.
- Document - the root table returned from the
SLAXML:dom()
method.doc.type
: the string"document"
doc.name
: the string"#doc"
doc.kids
: an array table of child processing instructions, the root element, and comment nodes.doc.root
: the root element for the document
- Element
someEl.type
: the string"element"
someEl.name
: the string name of the element (without any namespace prefix)someEl.nsURI
: the namespace URI for this element;nil
if no namespace is appliedsomeEl.attr
: a table of attributes, indexed by name and indexlocal value = someEl.attr['attribute-name']
: any namespace prefix of the attribute is not part of the namelocal someAttr = someEl.attr[1]
: an single attribute table (see below); useful for iterating all attributes of an element, or for disambiguating attributes with the same name in different namespaces
someEl.kids
: an array table of child elements, text nodes, comment nodes, and processing instructionssomeEl.el
: an array table of child elements onlysomeEl.parent
: reference to the the parent element or document table
- Attribute
someAttr.type
: the string"attribute"
someAttr.name
: the name of the attribute (without any namespace prefix)someAttr.value
: the string value of the attribute (with XML and numeric entities unescaped)someEl.nsURI
: the namespace URI for the attribute;nil
if no namespace is appliedsomeEl.parent
: reference to the the parent element table
- Text - for both CDATA and normal text nodes
someText.type
: the string"text"
someText.name
: the string"#text"
someText.value
: the string content of the text node (with XML and numeric entities unescaped for non-CDATA elements)someText.parent
: reference to the the parent element table
- Comment
someComment.type
: the string"comment"
someComment.name
: the string"#comment"
someComment.value
: the string content of the attributesomeComment.parent
: reference to the the parent element or document table
- Processing Instruction
someComment.type
: the string"pi"
someComment.name
: the string name of the PI, e.g.<?foo …?>
has a name of"foo"
someComment.value
: the string content of the PI, i.e. everything but the namesomeComment.parent
: reference to the the parent element or document table
The following function can be used to calculate the "inner text" for an element:
function elementText(el)
local pieces = {}
for _,n in ipairs(el.kids) do
if n.type=='element' then pieces[#pieces+1] = elementText(n)
elseif n.type=='text' then pieces[#pieces+1] = n.value
end
end
return table.concat(pieces)
end
local xml = [[<p>Hello <em>you crazy <b>World</b></em>!</p>>]]
local para = SLAXML:dom(xml).root
print(elementText(para)) --> "Hello you crazy World!""
If you want the DOM tables to be simpler-to-serialize you can supply the simple
option via:
local dom = SLAXML:dom(myXML,{ simple=true })
In this case no table will have a parent
attribute, elements will not have the el
collection, and the attr
collection will be a simple array (without values accessible directly via attribute name). In short, the output will be a strict hierarchy with no internal references to other tables, and all data represented in exactly one spot.
- Does not require or enforce well-formed XML. Certain syntax errors are
silently ignored and consumed. For example:
foo="yes & no"
is seen as a valid attribute<foo></bar>
invokesstartElement("foo")
followed bycloseElement("bar")
<foo> 5 < 6 </foo>
is seen as valid text contents
- No support for custom entity expansion other than the standard XML
entities (
< > " ' &
) and numeric ASCII entities (e.g.
) - XML Declarations (
<?xml version="1.x"?>
) are incorrectly reported as Processing Instructions - No support for DTDs
- No support for extended (Unicode) characters in element/attribute names
- No support for charset
- No support for XInclude
- Lua 5.2 compatible
- Parser now errors if it finishes without finding a root element, or if there are unclosed elements at the end. (Proper element pairing is not enforced by the parser, but is—as in previous releases—enforced by the DOM builder.)
<foo xmlns="bar">
now directly generatesstartElement("foo","bar")
with no post callback fornamespace
required.
- Use the
local SLAXML=require 'slaxml'
pattern to prevent any pollution of the global namespace.
- Bugfix to allow empty attributes, i.e.
foo=""
closeElement
no longer includes namespace prefix in the name, includes the nsURI
- DOM adds
.parent
references SLAXML.ignoreWhitespace
is now:parse(xml,{stripWhitespace=true})
- "simple" mode for DOM parsing
- Support namespaces for elements and attributes
<foo xmlns="barURI">
will callstartElement("foo",nil)
followed bynamespace("barURI")
(and thenattribute("xmlns","barURI",nil)
); you must apply the namespace to your element after creation.- Child elements without a namespace prefix that inherit a namespace will
receive
startElement("child","barURI")
<xy:foo>
will callstartElement("foo","uri-for-xy")
<foo xy:bar="yay">
will callattribute("bar","yay","uri-for-xy")
- Runtime errors are generated for any namespace prefix that cannot be resolved
- Add (optional) DOM parser that validates hierarchy and supports namespaces
- Supports expanding numeric entities e.g.
"
->"
- Utility functions are local to parsing (not spamming the global namespace)
- Option to ignore whitespace-only text nodes
- Supports unescaped > in attributes
- Supports CDATA
- Supports Comments
- Supports Processing Instructions
Copyright © 2013 Gavin Kistner
Licensed under the MIT License. See LICENSE.txt for more details.