Porting an HTML5 Parser to Swift

How I built swift-justhtml with Claude - from 0% to 100% html spec test compliance, finding crash bugs with fuzzing, and the performance optimization needed to match javascript's speed in Swift

Porting an HTML5 Parser to Swift

Emil Stenström spent months building justhtml, a Python HTML5 parser that achieves 100% compliance with the html5lib test suite. He wrote about the process - it involved starting from scratch multiple times, pivoting strategies when things weren't working, and iterating with AI coding agents until every edge case was handled.

Simon Willison then ported it to JavaScript in 4.5 hours. He had GPT-5.2 inspect the Python codebase, create a specification, and then just told it "do the rest, commit and push often" while he decorated for Christmas.

Looking around after reading those posts, I saw on the Github page:

JustHTML	✅ 100%	✅ Yes	⚡ Fast
BeautifulSoup	🔴 4%	✅ Yes	🐢 Slow

I have some projects which use BeautifulSoup and know how slow it can be. So I switched some of my python projects from BeautifulSoup to Emil Stenström's new justhtml library and loved it.

After that I started wishing I had a similar "known good" stable pure swift HTML parsing library for use in my Swift projects (the main language I actually write and use day to day).
Then I realised, I couple probably just have Claude make one. The same way Emil Stenström made the library originally. The tests!

So I built swift-justhtml, a port of the justhtml API to Swift which also passes 100% of the html5lib-tests like the original python version.


The Agentic Feedback Loop

LLMs are amazing at writing mostly correct code. But only mostly. Coding Agents like Claude Code and Codex are so powerful because they allow for instant feedback and iteration. The models can try to compile or run the code they just wrote, see the error and fix it. Repeatedly if needed.
So any programming task you can give them which fits that loop, fast iteration, quick feedback on any errors soon after they are made, guidance if the error has actually been fixed or not, and tools they can run to inspect, debug and test in different ways are perfect.

This project, with a fully spec'd out set of tests via the html5lib test suite, is perfect.
Both the input and output are text, something LLMs understand.
The task is to transform from one to another, via a CLI tool the coding agent can run itself to check its work.
And detailed feedback, including on regressions or things it has broken by accident while making another change, via the fully spec'd test suite.

So I downloaded the tests, onto my home dev server, created an empty swift library/package directory structure and setup the CLI tools needed. Then also downloaded the python and js implementations alongside the tests and told Claude Code with Opus 4.5 (which I still find produces the most reliable code, even if Codex GPT 5 is smarter at debugging errors) to look at the public API interface the python and js versions offer, and then to create a rough spec for a similar Swift API. This was to try to allow the model room to adjust the API to Swift's type system and idioms by not being too strict it be the same public facing API.

I then told it to setup the swift library tests to load and run the html5lib tests, and report how many pass/fail, and to implement the HTML parser, run the tests. To then work on fixing the failing tests, and re-run the tests. And to repeat that until it had 100% of the tests passing.

It took two days of intensive work with Claude working away in the background iterating on the tests, sometimes trying to cheat (we'll get back to that), and then after a lot more work on performance tuning than I thought would be needed! Here's how it actually went.

Racing to 100%

The initial commit was a straightforward smoke test of a valid HTML string to test and verify the DOM tree works <p>Hello</p>. This was because I remember'ed Emil's blog post mentioned he started with this before having it dive into edge cases and handling invalid input and that sounded very wise, so I made sure to start from a known point too.

Then came the core implementation. HTML5 parsing is deceptively complex. The specification defines 67 distinct tokenizer states, 23 tree builder insertion modes, and algorithms with names like "adoption agency" and "Noah's Ark clause."

This is the Noah's Ark clause. But with three per family instead of two.

The first real implementation got to 53% test compliance - just over half the html5lib tests passing.

From there, it was a steady but slow climb. Each commit tackled another piece of the spec:

  • Foreign content handling (SVG and MathML have their own rules)
  • Foster parenting (table elements need special handling)
  • The full adoption agency algorithm (for misnested formatting elements)
  • Template element handling
  • Entity parsing with partial matches

After a few hours of this, we were at 97.3% compliance. Then progress slowed dramatically.

At 99.6% (1763/1770 tests), the model hit a wall. The remaining 7 tests were genuinely hard edge cases - complex template/table/SVG interactions, multiply-nested table foster parenting, and a new HTML feature called <selectedcontent> that I'd never heard of before. The model choose instead to documented them as "remaining edge cases" and moved on to adding features like CSS selectors and streaming parsing, just skipping implementing these entirely.

When I noticed, it was actually quite difficult to make the agent go back and include them, it really wanted to just mark them as skipped and consider that a pass.

Eventually I stopped the agent, had a second fresh agent clear away and delete all the skipping code, and then restarted the first agent telling it to resume working through the failing tests. Which now included the ones it tried to skip.

The fixes turned out to be surprisingly small. The select mode marker handling needed a single check. The nested table problem needed explicit handling in two insertion modes. The final commit that achieved 100% compliance added just 34 lines to the tree builder:

// Select mode marker handling (webkit02.dat test 49):
// Insert marker when entering select mode to prevent reconstruction
// of formatting elements from outside select
case .select:
    self.activeFormattingElements.append(.marker)

Eventually: 1,770/1,770 tree construction tests passing.

Performance Reality Check

Having achieved 100% test coverage I thought I was now done (minus setting up some documentation and API examples).

I ran the benchmarks comparing Swift, JavaScript, and Python implementations parsing the same HTML files as I had all 3 downloaded to my computer already:

Implementation Parse Time
Swift 308ms
Python 417ms
JavaScript 108ms

Swift was only 1.4x faster than Python. JavaScript was 2.9x faster than Swift.

This was not the result I expected from a compiled language. I knew python was generally slow, but also expected js to be slow. So expected the Javascript might be around the same or a bit faster.

I wasn't expecting the js version to by 4x faster than the python version, and for the Swift version to be barely faster than the Python.

If you remember one of the lines from Swift's introduction (which has been shown to be not true in practice over an over again since then) was "python like code, with speed faster than C".

However, I then remembered, Swift's string are famously "spec correct" and slow. And HTML parsing is mostly string processing.

The Fuzzer Finds a Bug

Before investigating performance, I wanted to check my library was reliable and stable, before worrying about making it fast. So I also ran the test files for the crash scenarios Emil's fuzzer had found crashes in his implementation against my implementation.

Fortunately, none of those crashed my library, but I also might just have been lucky. So I had Claude Code write and run ran a swift fuzzer of our own - generating millions of random and malformed HTML documents to test parser robustness.

After running those hundred's of thousands of fuzzer tests, the fuzzer found 1 crash. I had told the coding agent not to fix any fuzzer crashes it found, but to investigate and create a test file which reproducible the crash reliably.
Once we had that test file setup I had it fix the issue and resume fuzzing the library.

Input: <table></table><li><table></table>
Context: select fragment
Result: SIGSEGV (segmentation fault)

The parser was hitting infinite recursion. When parsing table-related tags inside a select fragment context, popUntil("select") had no effect (select wasn't on the stack, just the context), and resetInsertionMode() would restore select mode, causing an infinite loop when reprocessing the tag.

Interestingly, simpler variants didn't crash:

  • <table></table> in select - fine
  • <table></table><li> in select - fine
  • <li><table></table> in select - fine
  • <table></table><li><table></table> in select - crash

It was the specific sequence that triggered infinite recursion. The fix was checking if select was context-only and, if so, clearing the context and switching directly to inBody mode.

This is why fuzzing is useful, especially for testing input handling - the html5lib tests don't cover every possible fragment parsing scenario.

The Performance Hunt

With the crash fixed, all the html5lib test passing, none of the other test files in the other repos causing crashes or failures, and my own fuzzer failing to find any issues, I turned back to performance. The 2.9x gap with JavaScript was troubling me.

I now had a reliable, stable swift implementation of the html5 spec with a justhtml style API to parse and process it. So I didn't want to break that, or make the code an unrecognisable mess in pursuit of performance.

So I created a turbo branch and started profiling. The initial breakdown showed roughly 50/50 time split between tokenizer and tree builder. Both needed work to speed this up.

Swift Strings to raw UTF-8 bytes

Unfortunately Swift strings are slow, and if you want to go faster one of the most straightforward decisions to avoid them.

So I gave the agent some benchmarking tools, sample code and guidance on how to research what was causing the time and told the agent to carefully profile, experiment, benchmark and then re-run the tests and fix any regressions caused. If it found a faster solution which still passed all the test to commit it and then start again, profiling to find the now slowest part and experiment with ways to optimise that.

The first optimization was switching from Swift's String.Index to raw UTF-8 bytes. Swift's string handling is Unicode-correct but expensive - String.Index advancement is O(n) because it has to account for grapheme clusters.

// Before: String.Index iteration
for ch in html { ... }  // O(n) per access

// After: Byte-level access
let bytes = ContiguousArray<UInt8>(html.utf8)
for i in 0..<bytes.count {
    let byte = bytes[i]  // O(1) access
}

Converting to ContiguousArray<UInt8> gave immediate gains:

302ms → 261ms (14% faster)

Batch text insertion

Next was batch text insertion. The tree builder was creating a new text node for every character token from the tokenizer. So instead now it builds up a buffer of raw bytes coming in, until it gets to the character it needs to switch on, then at that point it converts the buffer its been building up into a string object. Instead of repeatedly turning each char into a string and appending the string with the pending buffer string each time. So we only pay the object creation and conversion cost once.

Coalescing consecutive characters before insertion:

261ms → 182ms (30% faster)

Avoiding Memory Allocation

Then I directed it to look into memory allocation and ways to reduce how many objects we are creating in the loops, the agent found inline array literals in hot paths:

// This creates a temporary array on EVERY tag
if ["td", "th", "tr"].contains(name) { ... }

Moving these to module-level Set<String> constants eliminated thousands of allocations per document:

182ms → 172ms (6% faster)

Batch scanning for tag names

The biggest single win came from batch scanning for tag names. Instead of building names character-by-character with string concatenation, scan ahead to find the delimiter and extract the whole chunk at once, again avoiding paying the string creation and modification cost more than once:

159ms → 118ms (26% faster)

Removing the other inline array literals

The final optimizations involved hunting down every remaining inline array literal in the tree builder. 7 more were found in processStartTagInBody that were creating arrays on every single tag. For a "span" tag (common), the code was:

  1. Create and search a 10-element array (head tags) - no match
  2. Create and search a 26-element array (block tags) - no match
  3. Create and search an 11-element array (table tags) - no match
  4. Keep searching...

Converting these to static Sets:

118ms → 98ms (17% faster)

Final Result

One more pass with buffer reuse instead reassigning the object to a new empty [:] to clear it brought the final improvement to:

302ms → 97ms - a 3.1x speedup.

The Swift implementation now matched or just barely beat the JavaScript. But it took ~20 optimization commits, detailed profiling, and completely throwing out Swift's string APIs in favor of raw byte manipulation. The JavaScript version does none of this - it just uses simple, readable character-by-character processing with array.push().

More on that in a follow-up post about V8's performance.

Merging and Polish

With the turbo branch proving out, I merged it back to main. The rest of work was setting up documentation and examples:

  • DocC documentation with GitHub Pages deployment
  • Example CLI tools (htmltool, html2md, extractlinks, fetchpage)
  • A Swift Playground for interactive experimentation
  • Memory usage benchmarking of the 3 languages implementations.
  • README improvements

The Final Result

8,953 tests passing:

  • Tree construction: 1,831
  • Tokenizer: 6,810
  • Serializer: 230
  • Encoding: 82

Performance: 97ms to parse 2.5MB of HTML across 5 Wikipedia articles. 4x faster than Python, matching JavaScript.

Zero dependencies. Pure Swift using only Foundation (and even that lightly now String isn't being used much).

Fuzz tested. Millions of malformed documents without crashes.

How Do Other Swift Libraries Compare?

After finishing, I was curious how existing Swift HTML libraries fared against the same test suite. So downloaded and benchmarked the ones which supported linux. The results were:

Library Pass Rate Notes
swift-justhtml 100% (1831/1831) None
Kanna 94.4% (1542/1633) Uses libxml2 (HTML 4.01 parser)
SwiftSoup 87.9% (1436/1633) Infinite loop on 197 tests
LilHTML 47.4% (775/1634) Crashes on 52% of tests

Kanna and LilHTML both use libxml2 under the hood. libxml2 is a C library that implements HTML 4.01 parsing, not the WHATWG HTML5 specification that browsers actually now use. It's fast (native C code) but won't handle modern HTML correctly.

SwiftSoup is a port of Java's Jsoup library. It hit infinite loops on all 197 tests in tests16.dat - edge cases involving script tags.

LilHTML was the most surprising. It wraps libxml2 but crashes on over half the test inputs due to unhandled NULL returns. Not sure why the big difference to Kanna (it's possible I'm doing something wrong with the test for 1 of them. I didn't look into this that closely).

What I Learned

HTML5 parsing is harder than it looks. It looks like simple XML parsing, but that went out the window when XHTML was abandoned, now the specification is thousands of pages. And the edge cases interact in surprising ways.

Test suites are invaluable. The html5lib tests gave concrete pass/fail feedback for every change. Without them, LLM coding agents would not be able to tackle a task of this size!

Given tools and feedback LLM agents are now incredible powerful!: The key is packaging your task in a format which lets the llm check how it is going, and get feedback on what it did right or wrong so it can iterate it's way to a solution.

Fuzzing finds real bugs. The select fragment crash would never have been caught by the standard test suite alone.

Swift's performance isn't automatic. Being a compiled language doesn't guarantee speed unfortunately. Understanding memory allocation, string handling, and avoiding unnecessary work matters a lot if you want fast Swift code. Otherwise Swift can quite easily end up as slow to run as it is to compile.

V8 is WAY faster than I thought! Even the super optimised version of swift-justhtml with minimal String use, raw byte processing, no inline arrays or anything else which could cause allocations still only draws level to the default straightforward js implementation running via V8 in node.

AI coding agents are transformative for this kind of work. To reiterate: The tight feedback loop - run tests, see failures, propose fix, repeat - is exactly what they're good at. I spent my time on architecture decisions and code review rather than typing out implementations.

Which is basically the same conclusion Emil reached.

Links