simdxml

SIMD-accelerated XML parser with full XPath 1.0 support for Python.

simdxml parses XML into flat arrays instead of a DOM tree, then evaluates XPath expressions against those arrays. The approach adapts simdjson's structural indexing architecture to XML: SIMD instructions classify structural characters in parallel, producing a compact index that supports all 13 XPath 1.0 axes via array operations.

Installation

pip install simdxml

Pre-built wheels for Linux (x86_64, aarch64), macOS (arm64, x86_64), and Windows.

Quick start

import simdxml

doc = simdxml.parse(b"<library><book><title>Rust</title></book></library>")
titles = doc.xpath_text("//title")
assert titles == ["Rust"]

API

Native API

The native API gives you direct access to the SIMD-accelerated engine:

import simdxml

# Parse bytes or str
doc = simdxml.parse(xml_bytes)

# XPath queries
doc.xpath_text("//title")          # -> list[str] (direct child text)
doc.xpath_string("//title")        # -> list[str] (all descendant text, like XPath string())
doc.xpath("//book[@lang='en']")    # -> list[Element | str]

# Element traversal
root = doc.root
root.tag                           # "library"
root.text                          # direct text content or None
root.attrib                        # {"lang": "en", ...}
root.get("lang")                   # "en"
root[0]                            # first child element
len(root)                          # number of child elements
list(root)                         # all child elements

# Navigation (lxml-compatible)
elem.getparent()                   # parent element or None
elem.getnext()                     # next sibling or None
elem.getprevious()                 # previous sibling or None

# XPath from any element
elem.xpath(".//title")             # context-node evaluation
elem.xpath_text("author")         # text extraction from context

# Batch APIs (single FFI call, interned strings)
root.child_tags()                  # -> list[str] of child tag names
root.descendant_tags("item")       # -> list[str] filtered by tag

# Compiled XPath (like re.compile)
expr = simdxml.compile("//title")
expr.eval_text(doc)                # -> list[str]
expr.eval_count(doc)               # -> int
expr.eval_exists(doc)              # -> bool
expr.eval(doc)                     # -> list[Element]

# Batch: process many documents in one call
docs = [open(f).read() for f in xml_files]
expr = simdxml.compile("//title")
simdxml.batch_xpath_text(docs, expr)           # bloom prefilter
simdxml.batch_xpath_text_parallel(docs, expr)  # multithreaded

ElementTree drop-in (read-only)

Full read-only drop-in replacement for xml.etree.ElementTree. Every read-only Element method and module function is supported:

from simdxml.etree import ElementTree as ET

tree = ET.parse("books.xml")
root = tree.getroot()

# All stdlib Element methods work
root.tag, root.text, root.tail, root.attrib
root.find(".//title")              # first match
root.findall(".//book[@lang]")     # all matches
root.findtext(".//title")          # text of first match
root.iterfind(".//author")         # iterator
root.iter("title")                 # descendant iterator
root.itertext()                    # text iterator
root.get("key"), root.keys(), root.items()
len(root), root[0], list(root)

# All stdlib module functions work
ET.parse(file), ET.fromstring(text), ET.tostring(element)
ET.iterparse(file, events=("start", "end"))
ET.canonicalize(xml), ET.dump(element), ET.iselement(obj)
ET.XMLPullParser(events=("end",)), ET.XMLParser(), ET.TreeBuilder()
ET.fromstringlist(seq), ET.tostringlist(elem)
ET.QName(uri, tag), ET.XMLID(text)

# Plus full XPath 1.0 (lxml-compatible extension)
root.xpath("//book[contains(title, 'XML')]")

Mutation operations (append, remove, set, SubElement, indent, etc.) raise TypeError with a helpful message pointing to stdlib.

Read-only by design

simdxml Elements are immutable views into the structural index. Mutation operations raise TypeError with a helpful message:

root.text = "new"  # TypeError: simdxml Elements are read-only.
                    #   Use xml.etree.ElementTree for XML construction.

XPath 1.0 support

Full conformance with XPath 1.0:

327/327 libxml2 conformance tests (100%)
1015/1023 pugixml conformance tests (99.2%)
All 13 axes: child, descendant, parent, ancestor, following-sibling, preceding-sibling, following, preceding, self, attribute, namespace, descendant-or-self, ancestor-or-self
All 25 functions: string(), contains(), count(), position(), last(), starts-with(), substring(), concat(), normalize-space(), etc.
Operators: and, or, =, !=, <, >, +, -, *, div, mod, |
Predicates: positional [1], [last()], boolean [@attr='val'], nested

Benchmarks

Apple Silicon, Python 3.14, lxml 6.0. GC disabled, 3 warmup + 20 timed iterations, median reported. 100K-element catalog (5.6 MB). Run yourself: uv run python bench/bench_parse.py

Faster than lxml on every operation. Faster than stdlib on 11 of 14.

Operation	simdxml	lxml	stdlib	vs lxml	vs stdlib
`parse()`	10 ms	33 ms	55 ms	3x	5x
`find("item")`	<1 us	1 us	<1 us	faster	tied
`find(".//name")`	<1 us	1 us	1 us	faster	faster
`findall("item")`	0.23 ms	4.8 ms	0.89 ms	21x	4x
`findall(".//item")`	0.15 ms	6.2 ms	3.0 ms	42x	20x
`findall(predicate)`	1.5 ms	12 ms	4.9 ms	8x	3x
`findtext(".//name")`	<1 us	1 us	1 us	faster	faster
`xpath_text("//name")`	2.1 ms	19 ms	4.4 ms	9x	2x
`iter()`	9.2 ms	15 ms	1.3 ms	2x	0.14x
`iter("item")` filtered	4.5 ms	5.9 ms	1.9 ms	1.3x	0.4x
`itertext()`	2.6 ms	33 ms	1.4 ms	13x	0.5x
`child_tags()`	0.40 ms	6.2 ms	1.5 ms	16x	4x
`iterparse()`	51 ms	66 ms	70 ms	1.3x	1.4x
`canonicalize()`	1.8 ms	4.7 ms	4.6 ms	3x	3x

The three operations where stdlib is faster (iter, itertext, iter filtered) involve creating per-element Python objects. The batch alternatives (child_tags(), xpath_text()) beat both lxml and stdlib for those workloads.

Batch processing (multiple documents)

batch_xpath_text uses a bloom filter to skip non-matching documents at ~10 GiB/s. batch_xpath_text_parallel spreads parse + eval across threads. Both return all results in a single FFI call — zero per-document Python overhead.

Workload	Python loop	bloom batch	parallel batch
1K small docs	1.1 ms	0.37 ms (3x)	12 ms
100x 31KB docs	7.9 ms	8.2 ms	2.6 ms (3x)

Use bloom batch when many documents won't match the query (ETL filtering). Use parallel batch when documents are large (>10KB) and most will match.

How it works

Instead of building a DOM tree with heap-allocated nodes and pointer-chasing, simdxml represents XML structure as parallel arrays (struct-of-arrays layout). Each tag gets an entry in flat arrays for starts, ends, types, names, depths, and parents -- all indexed by the same position.

~16 bytes per tag vs ~35 bytes per DOM node
O(1) ancestor/descendant checks via pre/post-order numbering
O(1) child enumeration via CSR (Compressed Sparse Row) indices
SIMD-accelerated structural parsing (NEON on ARM, AVX2 on x86)
Parse eagerly builds all indices (CSR, name posting, parent map) so subsequent queries pay zero index construction cost

Platform support

Platform	SIMD Backend	Status
aarch64 (Apple Silicon, ARM)	NEON 128-bit	Production
x86_64	AVX2 256-bit / SSE4.2	Production
Other	Scalar (memchr-accelerated)	Working

Development

git clone https://github.com/simdxml/simdxml-python
cd simdxml-python

make dev        # build extension (debug mode)
make test       # run tests
make lint       # ruff check + format
make typecheck  # pyright

Requires Rust toolchain and Python 3.9+.

License

MIT OR Apache-2.0 (same as the simdxml Rust crate)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
bench		bench
python/simdxml		python/simdxml
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simdxml

Installation

Quick start

API

Native API

ElementTree drop-in (read-only)

Read-only by design

XPath 1.0 support

Benchmarks

Batch processing (multiple documents)

How it works

Platform support

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

simdxml

Installation

Quick start

API

Native API

ElementTree drop-in (read-only)

Read-only by design

XPath 1.0 support

Benchmarks

Batch processing (multiple documents)

How it works

Platform support

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages