XML parser for valid XML streams in Lua. This module is a fork of the xml2lua library by @manoelcampos. It is available under the MIT license, as with the original library.
The parser provides a partially object-oriented API with its functionality split into tokeniser and handler components.
The handler instance from xml.handlers is passed to the tokeniser via xml.parser and receives callbacks for each XML element processed (if a suitable handler function is defined). The API is conceptually similar to the SAX API but implemented differently.
XML data is passed to the parser instance through the XMLParser:parse method. Note that the parser only accepts a single string currently.
The default XML handler is xml.handlers.DOM, due to its ability to nondestructively parse any XML (representing comments, text nodes and mixed content appropriately). The module provides a serialiser supporting XML DOM root tables at xml.serialise, which has a compatibility layer for XML tree root tables.
If your application involves bidirectional parsing of data, such as the contents of templates using Wikia's infobox component, the xml.handlers.DOM handler is recommended. When creating XML configuration files for use in Lua modules, it is recommended to use the xml.handlers.Tree handler which allows for easier node traversal and data extraction.
Features
- Tokenises well-formed XML (relatively robustly)
- Flexible handler-based event API (see
xml.handlersdocumentation). - Parses all XML infoset elements:
- Tags
- Text
- Comments
- CDATA
- XML declarations
- Processing instructions
- DOCTYPE declarations
- Provides limited well-formedness checking (checks for basic syntax & balanced tags only)
- Flexible whitespace handling (optional)
- Entity handling (optional)
Limitations
- Shallow well-formedness checking only (fails to detect most semantic errors)
- Non-validating
- No charset handling
- No namespace support
Usage
local xml = require('Dev:XML')
-------- Uses a handler that converts the XML to a Lua table. --------
local tree_handler = xml.handlers.Tree
local inspect = require('Module:Inspect')
local options = { indent = ' ' }
----------------------- Books XML parse code. ------------------------
mw.log('books.xml')
local books_root = xml.parse(xml.load('Dev:XML/testcases/books'))
mw.log(inspect(books_root, options))
----------------------- People XML parse code. -----------------------
mw.log('people.xml')
local people_root = xml.parse(xml.load('Dev:XML/testcases/people'))
mw.log(inspect(people_root, options))
Documentation
Package items
xml.parse(str, handler, parser_opts, handler_opts)(function)- Parses an XML string into an abstract syntax tree or event trace. This function includes logic to attach a handler to the XML parser, making it much more convenient than
xml.parser. - Parameters:
strXML string to be parsed. (string)handlerHandler to use. Default:"DOM". Accepts the following values:- (string|table)
parser_optsParser configuration options. Defaults are listed inxml.parseroptions. (table; optional)handler_optsHandler configuration options. Defaults are listed inxml.handleroptions. (table; optional)
- Error: 'XML handler "$handler" not found' (line 688)
- Returns: Lua representation of XML root structure. (table)
xml.serialise(tbl, level)(function)- Converts a Lua XML DOM tree to a XML string representation.
- Parameters:
tblDOM or tree root for XML conversion. This parameter is the root table generated by axml.handlers.DOMorxml.handlers.Treeparser instance. (table)levelOnly used internally, when the function is called recursively to print indentation. (number; optional)
- Error: 'cannot serialise this value. Are you using a handler other than "xml.handlers.DOM" and "xml.handlers.Tree"?' (line 739)
- Returns: XML string representation for table. (string)
xml.load(filepath)(function)- Loads an XML file from a specified path. If the file is in the Module namespace, the loader assumes the page is a Lua module returning a string. Otherwise, the loader will fetch the page's raw text, removing any leading non-XML comment/shebang.
- Parameter:
filepathXML file target path (including namespace). (string) - Error: 'file "$filepath" does not contain XML'
- The page
filepathdoes not exist. - The module
filepathdoes not exist or does not export a string.
- The page
- (line 784)
- Returns: The contents of the XML file. (string)
xml.parser(handler, options)(function)- Instantiates a
XmlParserobject to parse a XML string. - Parameters:
handlerHandler object to be used to convert the XML string to another format, usually fromxml.handlers. (table)optionsOptions for parsing XML. (table; optional)options.stripWSStrip non-significant whitespace (leading or trailing) and do not generate events for empty text elements. Default:true. (table; optional)options.stripWS(table; optional)options.expandEntitiesExpand entities (standard entities and single character numeric entities only currently - could be extended at runtime if a suitable DTD parser added elements to the table (seeXMLParser._ENTITIES). May also be possible to expand multibyre entities for UTF-8 only. Default:true. (table; optional)options.errorHandlerCustom error handler function. (table; optional)
- Returns: An XML parser instance used to parse the XML.
xml.handlers(table)- XML handlers for conversion logic in the
XML parser. xml.handlers.DOM(table)Handlerto generate a DOM-like node tree structure. The tree structure has a single ROOT node parent, and is capable of representing any valid XML document. Each node is a table comprising the fields below:_name- element name (string)_type- any of'ROOT','ELEMENT','TEXT','COMMENT','PI','DECL','DTD'(string)PI- XML Processing Instruction tag.DECL- XML declaration tag
_attr- node attributes - see callback API (table)_parent- parent node (table)_children- child nodes (table)
xml.handlers.DOM:new(options)(function • constructor)- Instantiates a new DOM handler.
- Parameters:
optionsHandler options for parsing. (table)options.commentNodeWhether to include comment nodes. Default:true. (boolean; optional)options.piNodeWhether to include processing instruction nodes. Default:true. (boolean; optional)options.dtdNodeWhether to include DTD declaration nodes. Default:true. (boolean; optional)options.declNodeWhether to include XML declaration nodes. Default:true. (boolean; optional)
xml.handlers.Tree(table)Handlerto generate a natural table-based tree. This handler supports many XML formats. The XML structure tree is mapped into a recursive map of node names to child elements (as a string representing text, or a table of values).- Where there is only a single child element this is inserted as a named key. If there are multiple elements, these are inserted as an array element (in some cases it may be preferable to always insert elements as an array elment which can be specified on a per element basis in the options). Attributes are inserted as a child element with a key of
'_attr'. - In general, this format is relatively useful, despite the following limitations:
- Tag/text & CDATA elements are processed - all others are ignored.
Mixed-ContentXML behaves unpredictably.- If a leaf element has both a text element and attributes, the text must be accessed through an array element (to provide a container for the attribute).
xml.handlers.Tree:new(options)(function • constructor)- Instantiates a new tree handler.
- Parameters:
- Returns: Tree handler instance. (Handler)
xml.handlers.Print(table)Handlerto generate simple event tracing during parsing. Outputs messages to the Scribunto console during the parse process, usually for debugging purposes.xml.handlers.Print:new(options)(function • constructor)- Instantiates a new Print handler.
- Parameters:
optionsHandler options for parsing. (table)options.commentNodeWhether to include comment nodes. Default:true. (boolean; optional)options.piNodeWhether to include processing instruction nodes. Default:true. (boolean; optional)options.dtdNodeWhether to include DTD declaration nodes. Default:true. (boolean; optional)options.declNodeWhether to include XML declaration nodes. Default:true. (boolean; optional)
XMLParser
Class providing the actual XML parser.
XmlParser.new(_handler, _options)(function)- Instantiates a XmlParser object.
- Parameters:
_handlerHandler object to be used to convert the XML string to another formats. See the available handlers atxml.handlers. (table)_optionsOptions for this XmlParser instance, defined inxml.parser.
XmlParser:parse(str, parseAttributes)(function)- Main function which starts the XML parsing process
- Parameters:
Handler
Handler object, used to generate parser output.
Handler:new(options)(function)- Instantiates a new handler object. Each instance can handle a single XML string. By using such a constructor, you can parse multiple XML files in the same application.
- Parameter:
optionsHandler configuration options. (table; optional) - Returns: Handler object instance. (Hander)
- Note: This method is not available in
xml.handlers.Print. Handler:starttag(tag, tag1, tag2, s, e)(function)- Parses a start tag.
- Parameters:
Handler:endtag(tag, tag1, tag2, s, e)(function)- Parses an end tag.
- Parameters:
Handler:text(text, s, e)(function)- Parses the text content of a tag.
- Parameters:
Handler:comment(text, s, e)(function)- Parses a comment tag.
- Parameters:
Handler:pi(tag, tag1, tag2, s, e)(function)- Parses a XML processing instruction (PI) tag
- Parameters:
Handler:decl(tag, tag1, tag2, s, e)(function)- Parse the XML declaration line (indicating the XML version).
- Parameters:
Handler:dtd(tag, tag1, tag2, s, e)(function)- Parses a DTD tag.
- Parameters:
Handler:cdata(text, s, e)(function)- Parses a CDATA section.
- Parameters: