Deus Lex Machina
Today, I am excited to announce the alpha release of a brand new compacting Zig tokenizer! You may find it in the following repository:
Give it a star and come back!
Please note it is not ready for prime-time just yet, since there are still more optimizations to be had, as well as support for more architectures. At the moment, only AMD64 machines with AVX-512 instructions are supported.
That being said, the new implementation can tokenize up to 2.75x faster than the mainline implementation, currently at 1.4GB/s on a single core on my laptop, with a lot of improvements coming soon!
All Your Tokens Are Belong to Us
The above repository benchmarks 3 Zig tokenizers:
- The tokenizer in the Zig Standard Library used by the 0.14 compiler.
A heat-seeking tokenizer (talk 1, talk 2)(had to temporarily remove it, but will add back by July)- A compacting tokenizer (talk 3)
Benchmark results on my laptop with a Ryzen AI 9 HX 370:
Read in files in 26.479ms (1775.63 MB/s) and used 47.018545MB memory with 3504899722 lines across 3253 files
Legacy Tokenizing took 91.419ms (0.51 GB/s, 38.34B loc/s) and used 40.07934MB memory
Tokenizing with compression took 33.301ms (1.41 GB/s, 105.25B loc/s) and used 16.209284MB memory
That's 2.75x faster and 2.47x less memory than the mainline implementation!
Headline features
The new compacting tokenizer processes an entire 64-byte chunk of source code at once (soon to be 512!). It includes:
- A fully SIMDized UTF-8 validator ported from simdjson/simdutf.
- A fully branchless bit-manipulation routine to determine which characters are escaped by backslashes. (Thanks to John Keiser (@jkeiser) for designing this algorithm for simdjson.)
- A loop which parses all lines in a given chunk in parallel for strings/comments/character_literals/line_strings. Characters inside of these constructs must be exempted from comprising their own tokens.
- A multi-purpose vectorized table-lookup. It performs Vectorized-Classification (making it so we can produce a bitstring of any group of characters we care about in one additional instruction), as well as mapping all characters that may appear in a multi-char symbol into a range of [0, 15] while everything else is mapped into the range [128, 255]. Cognoscenti would recognize this as matching the semantics of a
vpshufb
instruction. It also is itself a mostly-correct vector ofkinds
fields (one of the pieces of information we need to output for each token). - A mini non-deterministic finite state machine implemented using
vpshufb
instructions on the result of the vectorized table-lookup for multi-char-symbol matching, as well as an effectively-branchless (for real code) reconciliation loop which accepts bitstrings indicating where valid 2 and 3 character multi-char symbols end and deleting the ones that are impossible due to overlap. - SIMD hasher of up to 16 multi-char symbols at once, which works across chunk boundaries to produce the proper
kind
of multi-char symbols. - A fully vectorized cross-chunk keyword hasher and validator that pulls from a common superstring holding all keywords in a total of 4 64-byte vectors.
vpgather
instructions are not used. - Token-matching logic implemented almost entirely using bit-manipulation and SIMD operations, the things CPU’s are the fastest at.
- Logic void of almost all branches— they are only used where they provide a performance benefit.
Simplified Explanation
ELI5 How do we accomplish such speed? By processing entire chunks of 64-bytes (soon 512-bytes!) at once. Here’s a basic implementation:
First, we produce a few bitstrings where each bit tells us a piece of information about a corresponding byte in the 64-byte chunk
we process at once:
const V = @Vector(64, u8);
const chunk: V = ptr[0..64].*;
const alphas_lower: u64 =
@as(u64, @bitCast(chunk >= @as(V, @splat('a')))) &
@as(u64, @bitCast(chunk <= @as(V, @splat('z'))));
const alphas_upper: u64 =
@as(u64, @bitCast(chunk >= @as(V, @splat('A')))) &
@as(u64, @bitCast(chunk <= @as(V, @splat('Z'))));
const underscores: u64 =
@bitCast(chunk == @as(V, @splat('_')));
const numerics: u64 =
@as(u64, @bitCast(chunk >= @as(V, @splat('0')))) &
@as(u64, @bitCast(chunk <= @as(V, @splat('9'))));
const alpha_underscores = alphas_lower | alphas_upper | underscores;
const alpha_numeric_underscores = alpha_underscores | numerics;
Next, we do a series of bitmanipulations to figure out the start and end positions of all tokens within our chunk
. To find the starts and ends of identifiers, we do something similar to the following:
const identifier_starts = alpha_underscores & ~(alpha_numeric_underscores << 1);
const identifier_ends = alpha_numeric_underscores & ~(alpha_numeric_underscores >> 1);
Let’s walk through how identifier_starts
is computed very slowly:
alpha_numeric_underscores << 1
Shifting left by1
semantically moves our bits in the rightwards direction in our source file. This is due to the fact that this is written from a little-endian perspective, and every bit inalpha_numeric_underscores
corresponds to a byte ofchunk
, therefore the bit-order inherits the byte order. This might seem like an indictment of little-endian machines, but actually little-endian is better for this kind of processing because when we do a subtraction we want the carry-bits to propagate in the direction of the last byte (rather than the reverse on a big-endian machine). If you don’t understand that right now, that’s okay, just take my word for it. The point is,alpha_numeric_underscores << 1
produces a bitstring that indicates which positions inchunk
had a alphanumeric/underscore before it. We could express this as a regular expression like/(?<=\w)./g
(regexr link)We take the inverse of the previous bitstring. Overall we have
~(alpha_numeric_underscores << 1)
. This produces a bitstring that indicates which positions inchunk
had a NON-alphanumeric/underscore before it. Also note that the start of thechunk
is considered a NON-alphanumeric/underscore. This is because, in the first bit position, we shift in a 0, then unconditionally invert that to a1
.~(alpha_numeric_underscores << 1)
effectively matches the regular expression/(?<=^|\W)./g
(regexr link)Next, we AND the last expression with
alpha_underscores
. This leaves us with a bitstring that tells us where we have an alpha/underscore character that was preceded by a NON-alpha_numeric_underscore character. As a regular expression, this would be/(?<=^|\W)[a-zA-Z_]/g
(regexr link)
identifier_ends
is computed in much the same way but in the opposite direction.
With these two bistrings in hand, we can pass each of these into a vector compaction operation (called a “vector compression” on AVX-512 and RISC-V) to figure out where all identifiers in a chunk start and end simultaneously. A vector compaction accepts a bitstring and a vector, and keeps all elements in the vector which in the corresponding position in the bitstring have a 1 bit. The “kept” elements are concentrated in the front of the resulting vector, and the rest are discarded. In our case, we want to pass a vector which counts from 0 to 63 inclusive, so we can determine at which position in the chunk all tokens began and ended. See this animation as an example. For an exploration of how to do a vector compaction on ARM, see my article on the subject.
Future plans
There is still work to do, and several optimizations to be had.
I have done a lot of work intended to try processing 512-bytes at once, since we can fit 512-bit bitstrings in an AVX-512 vector.
- I wrote custom lowerings for u512 shl/shr, borrowed a u512 sub implementation, and even wrote a ~70 line compiler for convenient vpternlogq optimization.
The way that loop-carried variables are handled could probably be more efficient, either through packing them all into a single register (less likely to win) or using 2 or 3 enums instead of so many individual carried-bits (more likely to win). Luckily I wrote
carry.get
/carry.set
methods so it shouldn’t hurt too badly to swap the implementations out.I intend to support multiple comptime-switches for different ways of consuming the tokenizer. Some might prefer comments to be emitted, others prefer them to be omitted. Some people should use my idea of having a token be a 2-byte
len+tag
(with a 0 in the len indicating we need more than a byte, then using the next 4 bytes), others might want a more conventional 4-bytestart_index
+ 4-byteend_index
orlen
, and a 1-bytetag
. Either way, iterators will be provided which abstract over the memory differences/implications.- I currently disabled the code which expands the len of keyword/symbol tokens to include surrounding whitespace+comments. This will come back under a flag soon.
I also intend to give the best talk I have ever given on how all of the components work together this July at Utah Zig.
Running the Benchmark
Want to run the benchmark?
Well, unfortunately, at the moment, only x86-64 machines supporting the AVX-512 instruction set are supported. That means you need one of the last two generations of AMD hardware or a beefy Intel server.
If you do have a qualifying machine: First, clone my repository and then clone some Zig projects into src/files_to_parse
. E.g.
git clone https://github.com/Validark/Accelerated-Zig-Parser.git --depth 1
cd Accelerated-Zig-Parser
cd src/files_to_parse
git clone https://github.com/ziglang/zig.git --depth 1
git clone https://github.com/tigerbeetle/tigerbeetle.git --depth 1
git clone https://github.com/oven-sh/bun.git --depth 1
# Whatever else you want
Next, make sure you have Zig 0.14 installed. Here is the one-off script from Webi to download and install Zig:
curl -sS https://webi.sh/zig@0.14 | sh; \
source ~/.config/envman/PATH.env
Personally, I use Webi’s helper script, which can be installed like so:
curl -sS https://webi.sh/webi | sh; \
source ~/.config/envman/PATH.env
The helper script reduces the noise:
webi zig@0.14
# could also try @latest or @stable
Then build it and execute!
zig build -Doptimize=ReleaseFast
./zig-out/bin/exe
If you are on Linux, you can enable performance mode like so:
sudo cpupower frequency-set -g performance
I typically bind to a single core, which can help with reliability, especially when testing on devices with separate performance and efficiency cores:
taskset -c 0 ./zig-out/bin/exe