Parsing XML with libxml2-wasm

libxml2-wasm supports parsing XML from either a string or a buffer.

Here’s an example:

import fs from 'node:fs';
import { XmlDocument } from 'libxml2-wasm';

const doc1 = XmlDocument.fromString('<note><to>Tove</to></note>');
const doc2 = XmlDocument.fromBuffer(fs.readFileSync('doc.xml'));
doc1.dispose();
doc2.dispose();

The underlying libxml2 library processes MBCS (mostly UTF-8) only. Therefore, when working with UTF-16 strings in JavaScript, an additional conversion step is required. Consequently, XmlDocument.fromBuffer is significantly faster than XmlDocument.fromString. For more information, refer to the benchmark.

XInclude 1.0 is now supported.

When libxml2 encounters the <xi:include> tag and requires the content of another XML, it utilizes callbacks to read the data.

With xmlRegisterInputProvider, an XmlInputProvider object with a set of four callbacks can be registered.

These 4 callbacks are

  • match: This callback is invoked with the URL of the included XML. If it returns true, the subsequent three callbacks will be used to retrieve the XML content. Otherwise, other callbacks will be considered.
  • open: This callback is invoked when the parser starts reading the included XML.
  • read: This callback is invoked while reading the XML content.
  • close: This callback is invoked when the parser has finished reading the included XML.

Sometimes, the href attribute of the <xi:include> tag contains a relative path. In such cases, an initial URL can be provided to the parsing function so that libxml can calculate the actual URL of the included XML.

For instance, if the href attribute is sub.xml, and the parent XML is parsed in the following call:

const doc = XmlDocument.fromBuffer(
await fs.readFile('/path/to/doc.xml'),
{ url: 'file:///path/to/doc.xml' },
);
doc.dispose();

The registered callbacks will be invoked with the file name file:///path/to/sub.xml.

For the scenario where the included XMLs are stored in memory buffers, two sets of helper functions are provided:

  • Lower level: Similar to fs.open, fs.read, and fs.close, openBuffer opens a buffer and returns a file descriptor that can be used by readBuffer and closeBuffer to read and close the buffer, respectively. You can call these functions when implementing your own provider callbacks.

  • Higher level: If all your sources are memory buffers and each has a name specified in the container XML, you can simply use XmlBufferInputProvider which implements XmlInputProvider and manages buffers by their names.

Serialize an XML

XmlDocument.toBuffer gradually dumps the content of the XML DOM tree into a buffer and calls the XmlOutputBufferHandler to process the data.

Please note that UTF-8 is the only supported encoding at this time.

Based on toBuffer, XmlDocument.toString is a convenience functions to get an XML string.

For instance, to save an XML as a compact string, use:

xml.toString({ format: false });

Node.js

For Node.js users who require callbacks for accessing local files, the module nodejs predefines convenience helper functions.

fsInputProviders is the callback implementation reading local files using node:fs module, which supports both file paths and file URLs. To enable this feature, either register the provider or simply call xmlRegisterFsInputProviders:

import { XmlDocument } from 'libxml2-wasm';
import { xmlRegisterFsInputProviders } from 'libxml2-wasm/lib/nodejs.mjs';

xmlRegisterFsInputProviders();

const doc = XmlDocument.fromBuffer(
await fs.readFile('path/to/doc.xml'),
{ url: 'path/to/doc.xml' },
);
doc.dispose();

To save an XML to a file in a Node.js environment, open a file and use saveDocSync:

import fs from 'node:fs';
import { saveDocSync } from 'libxml2-wasm/lib/nodejs.mjs';

const fd = fs.openSync('doc.xml', 'w');
saveDocSync(xml, fd);

saveDocSync uses XmlDocument.toBuffer, which is faster than XmlDocument.toString because similar to XmlDocument.fromBuffer and XmlDocument.fromString, it doesn't need to convert UTF-8 to UTF-16 and then convert it back.