Validark's Blog

Deus Lex Machina

Wed, 16 Apr 2025 06:00:00 GMT

Today, I am excited to announce the alpha release of a brand new compacting Zig tokenizer! You may find it in the following repository:

::github{repo="Validark/Accelerated-Zig-Parser" license="MIT"}

Give it a star and come back!

Please note it is not ready for prime-time just yet, since there are still more optimizations to be had, as well as support for more architectures. At the moment, only AMD64 machines with AVX-512 instructions are supported.

That being said, the new implementation can tokenize up to 2.75x faster than the mainline implementation, currently at 1.4GB/s on a single core on my laptop, with a lot of improvements coming soon!

All Your Tokens Are Belong to Us

The above repository benchmarks 3 Zig tokenizers:

The tokenizer in the Zig Standard Library used by the 0.14 compiler.
~~A heat-seeking tokenizer (talk 1, talk 2)~~ (had to temporarily remove it, but will add back by July)
A compacting tokenizer (talk 3)

Benchmark results on my laptop with a Ryzen AI 9 HX 370:

       Read in files in 26.479ms (1775.63 MB/s) and used 47.018545MB memory with 3504899722 lines across 3253 files
Legacy Tokenizing took              91.419ms (0.51 GB/s, 38.34B loc/s) and used 40.07934MB memory
Tokenizing with compression took    33.301ms (1.41 GB/s, 105.25B loc/s) and used 16.209284MB memory
       That's 2.75x faster and 2.47x less memory than the mainline implementation!

Headline features

The new compacting tokenizer processes an entire 64-byte chunk of source code at once (soon to be 512!). It includes:

A fully SIMDized UTF-8 validator ported from simdjson/simdutf.
A fully branchless bit-manipulation routine to determine which characters are escaped by backslashes. (Thanks to John Keiser (@jkeiser) for designing this algorithm for simdjson.)
A loop which parses all lines in a given chunk in parallel for strings/comments/character_literals/line_strings. Characters inside of these constructs must be exempted from comprising their own tokens.
A multi-purpose vectorized table-lookup. It performs Vectorized-Classification (making it so we can produce a bitstring of any group of characters we care about in one additional instruction), as well as mapping all characters that may appear in a multi-char symbol into a range of [0, 15] while everything else is mapped into the range [128, 255]. Cognoscenti would recognize this as matching the semantics of a vpshufb instruction. It also is itself a mostly-correct vector of kinds fields (one of the pieces of information we need to output for each token).
A mini non-deterministic finite state machine implemented using vpshufb instructions on the result of the vectorized table-lookup for multi-char-symbol matching, as well as an effectively-branchless (for real code) reconciliation loop which accepts bitstrings indicating where valid 2 and 3 character multi-char symbols end and deleting the ones that are impossible due to overlap.
SIMD hasher of up to 16 multi-char symbols at once, which works across chunk boundaries to produce the proper kind of multi-char symbols.
A fully vectorized cross-chunk keyword hasher and validator that pulls from a common superstring holding all keywords in a total of 4 64-byte vectors. vpgather instructions are not used.
Token-matching logic implemented almost entirely using bit-manipulation and SIMD operations, the things CPU's are the fastest at.
Logic void of almost all branches-- they are only used where they provide a performance benefit.

Simplified Explanation

ELI5 How do we accomplish such speed? By processing entire chunks of 64-bytes (soon 512-bytes!) at once. Here's a basic implementation:

First, we produce a few bitstrings where each bit tells us a piece of information about a corresponding byte in the 64-byte chunk we process at once:

const V = @Vector(64, u8);
const chunk: V = ptr[0..64].*;

const alphas_lower: u64 =
    @as(u64, @bitCast(chunk >= @as(V, @splat('a')))) &
    @as(u64, @bitCast(chunk <= @as(V, @splat('z'))));

const alphas_upper: u64 =
    @as(u64, @bitCast(chunk >= @as(V, @splat('A')))) &
    @as(u64, @bitCast(chunk <= @as(V, @splat('Z'))));

const underscores: u64 =
    @bitCast(chunk == @as(V, @splat('_')));

const numerics: u64 =
    @as(u64, @bitCast(chunk >= @as(V, @splat('0')))) &
    @as(u64, @bitCast(chunk <= @as(V, @splat('9'))));

const alpha_underscores = alphas_lower | alphas_upper | underscores;
const alpha_numeric_underscores = alpha_underscores | numerics;

Next, we do a series of bitmanipulations to figure out the start and end positions of all tokens within our chunk. To find the starts and ends of identifiers, we do something similar to the following:

const identifier_starts = alpha_underscores & ~(alpha_numeric_underscores << 1);
const identifier_ends = alpha_numeric_underscores & ~(alpha_numeric_underscores >> 1);

Let's walk through how identifier_starts is computed very slowly:

alpha_numeric_underscores << 1 Shifting left by 1 semantically moves our bits in the rightwards direction in our source file. This is due to the fact that this is written from a little-endian perspective, and every bit in alpha_numeric_underscores corresponds to a byte of chunk, therefore the bit-order inherits the byte order. This might seem like an indictment of little-endian machines, but actually little-endian is better for this kind of processing because when we do a subtraction we want the carry-bits to propagate in the direction of the last byte (rather than the reverse on a big-endian machine). If you don't understand that right now, that's okay, just take my word for it. The point is, alpha_numeric_underscores << 1 produces a bitstring that indicates which positions in chunk had a alphanumeric/underscore before it. We could express this as a regular expression like /(?<=\w)./g (regexr link)
We take the inverse of the previous bitstring. Overall we have ~(alpha_numeric_underscores << 1). This produces a bitstring that indicates which positions in chunk had a NON-alphanumeric/underscore before it. Also note that the start of the chunk is considered a NON-alphanumeric/underscore. This is because, in the first bit position, we shift in a 0, then unconditionally invert that to a 1. ~(alpha_numeric_underscores << 1) effectively matches the regular expression /(?<=^|\W)./g (regexr link)
Next, we AND the last expression with alpha_underscores. This leaves us with a bitstring that tells us where we have an alpha/underscore character that was preceded by a NON-alpha_numeric_underscore character. As a regular expression, this would be /(?<=^|\W)[a-zA-Z_]/g (regexr link)

identifier_ends is computed in much the same way but in the opposite direction.

With these two bistrings in hand, we can pass each of these into a vector compaction operation (called a "vector compression" on AVX-512 and RISC-V) to figure out where all identifiers in a chunk start and end simultaneously. A vector compaction accepts a bitstring and a vector, and keeps all elements in the vector which in the corresponding position in the bitstring have a 1 bit. The "kept" elements are concentrated in the front of the resulting vector, and the rest are discarded. In our case, we want to pass a vector which counts from 0 to 63 inclusive, so we can determine at which position in the chunk all tokens began and ended. See this animation as an example. For an exploration of how to do a vector compaction on ARM, see my article on the subject.

Future plans

There is still work to do, and several optimizations to be had.

I have done a lot of work intended to try processing 512-bytes at once, since we can fit 512-bit bitstrings in an AVX-512 vector.
- I wrote custom lowerings for u512 shl/shr, borrowed a u512 sub implementation, and even wrote a ~70 line compiler for convenient vpternlogq optimization.
The way that loop-carried variables are handled could probably be more efficient, either through packing them all into a single register (less likely to win) or using 2 or 3 enums instead of so many individual carried-bits (more likely to win). Luckily I wrote carry.get/carry.set methods so it shouldn't hurt too badly to swap the implementations out.
I intend to support multiple comptime-switches for different ways of consuming the tokenizer. Some might prefer comments to be emitted, others prefer them to be omitted. Some people should use my idea of having a token be a 2-byte len+tag (with a 0 in the len indicating we need more than a byte, then using the next 4 bytes), others might want a more conventional 4-byte start_index + 4-byte end_index or len, and a 1-byte tag. Either way, iterators will be provided which abstract over the memory differences/implications.
- I currently disabled the code which expands the len of keyword/symbol tokens to include surrounding whitespace+comments. This will come back under a flag soon.

I also intend to give the best talk I have ever given on how all of the components work together this July at Utah Zig.

Running the Benchmark

Want to run the benchmark?

Well, unfortunately, at the moment, only x86-64 machines supporting the AVX-512 instruction set are supported. That means you need one of the last two generations of AMD hardware or a beefy Intel server.

If you do have a qualifying machine: First, clone my repository and then clone some Zig projects into src/files_to_parse. E.g.

git clone https://github.com/Validark/Accelerated-Zig-Parser.git --depth 1
cd Accelerated-Zig-Parser
cd src/files_to_parse
git clone https://github.com/ziglang/zig.git --depth 1
git clone https://github.com/tigerbeetle/tigerbeetle.git --depth 1
git clone https://github.com/oven-sh/bun.git --depth 1
# Whatever else you want

Next, make sure you have Zig 0.14 installed. Here is the one-off script from Webi to download and install Zig:

curl -sS https://webi.sh/zig@0.14 | sh; \
source ~/.config/envman/PATH.env

Personally, I use Webi's helper script, which can be installed like so:

curl -sS https://webi.sh/webi | sh; \
source ~/.config/envman/PATH.env

The helper script reduces the noise:

webi zig@0.14
# could also try @latest or @stable

Then build it and execute!

zig build -Doptimize=ReleaseFast
./zig-out/bin/exe

If you are on Linux, you can enable performance mode like so:

sudo cpupower frequency-set -g performance

I typically bind to a single core, which can help with reliability, especially when testing on devices with separate performance and efficiency cores:

taskset -c 0 ./zig-out/bin/exe

(It feels like this would also prevent the OS from moving your running process to another core in the middle of a benchmark?)

Local Compiler Explorer Setup for Zig

Tue, 04 Feb 2025 14:00:00 GMT

Compiler Explorer is known for its ability to automatically handle all the necessary setup to view source code side-by-side with assembly code. One simply navigates to godbo.lt, and, after the download completes, everything just works.

However, often times I prefer to use a local setup because it is faster and does not require sending a request to an external server to compile my code.

This article will walk through how to set up Compiler Explorer on a local machine.

Prerequisites

First, open a terminal with Node.js and Zig installed. If done, skip to the setup section of this article.

If you are unsure whether you have Node.js, you can always check by pasting the following command into your terminal and hitting enter:

node -v

If Node.js exists on your system, you will see some version number like so:

v22.13.1

If you see something like "command not found", you will have to install it. You may install it via any package manager, but I prefer using Webi. Simply paste this into your terminal and hit enter.

curl -sS https://webi.sh/node | sh; \
source ~/.config/envman/PATH.env

If you are unsure whether you have Zig, you can always check by pasting the following command into your terminal and hitting enter:

zig version

If it says something like "command not found", you can install it via Webi.

curl -sS https://webi.sh/zig | sh; \
source ~/.config/envman/PATH.env

Webi also conveniently comes with a little helper script that will allow you to update or switch versions via webi node@<tag> (you can use @lts for long-term support, @beta for pre-releases, or @x.y.z for a specific version).

Setup

Next, navigate in the terminal to the folder where you want the eventual "compiler-explorer" folder to live. I usually install GitHub stuff in Documents/github. So I would do:

cd Documents/github

Next, git clone the Compiler Explorer and cd into it:

git clone https://github.com/compiler-explorer/compiler-explorer.git --depth 1
cd compiler-explorer/

Next, we want to download all the dependencies of Compiler Explorer. Luckily, they have a very easy-to-use helper:

git pull origin main
make prebuild EXTRA_ARGS='--language zig'

You can re-run that command any time you want to pull the latest changes from GitHub and rebuild locally.

(I use EXTRA_ARGS='--language zig' here because I only care about the dependencies necessary for using Godbolt with the Zig compiler.)

Next, we are going to download a little script I wrote (disclosure: AI wrote the first draft) and name it zig.sh. Then we give it permission to be "executable":

curl -o zig.sh https://raw.githubusercontent.com/Validark/Zig-Compiler-Explorer-Shim/refs/heads/main/zig.sh
chmod +x zig.sh

Usage

Now we can start the Compiler Explorer Server by simply running:

./zig.sh

This script will try to automatically find all of your zig compilers that are installed in the same place as the one in your PATH and emit a zig.local.properties file which lists each one for Compiler Explorer. If it doesn't, feel free to open an issue here.

If everything worked properly, it should open on a local port you can open in your browser.

For me, I can access it by navigating to http://localhost:10240/

Configuration

I recommend changing some of the default settings of Compiler Explorer. In your browser, click "More" and then "Settings".

Then navigate to the "Compilation" tab, untick "Use automatic delay before compiling", and move the slider below it all the way to 0.25s. This makes the compiler feel a lot snappier.

Now you can enjoy Compiler Explorer locally!

Changing the Compiler Target

In the "Compiler Options" field you might want to try different targets.

The format is -target <arch><sub>-<os>-<abi> (ABI is optional). Examples:

-target aarch64-macos
# OR
-target x86_64-windows
# OR
-target riscv64-linux

You can also add an -mcpu= flag and if you click the "Output" button at the bottom it will show you a list of options for a given architecture.

I personally use -target x86_64-linux -mcpu=znver5. If you want to target an M-series MacBook, you could use -target aarch64-macos -mcpu=apple_m3.

My zig.sh script automatically sets the compiler to use -O ReleaseFast. You can override this in the same "Compiler Options" field if you want. You could try:

-O ReleaseSafe
# OR
-O ReleaseSmall
# OR
-O Debug

Note that, for some reason, source-mapping does not work with ReleaseSmall.

Have fun!

‒ Validark

Eine Kleine Vectorized Classification

Sun, 29 Sep 2024 09:35:00 GMT

For the new version of my SIMD Zig parser I gave a talk about on October 10, I came up with a slightly better technique for Vectorized Classification than the one used by Langdale and Lemire (2019) for simdjson, in that it stacks slightly better.

Vectorized Classification solves the problem of quickly mapping some bytes to some sets. For my use case, I just want to figure out which characters in a vector called chunk match any of the following:

const single_char_ops = [_]u8{ '~', ':', ';', '[', ']', '?', '(', ')', '{', '}', ',' };

To do this, I create a shuffle vector that we will pass into the table parameter of vpshufb. vpshufb is an x86-64 instruction that takes a table vector and an indices vector, and returns a vector where the value at position i becomes table[indices[i]] for each 16-byte section of the table and indices. Depending on how new a machine is, this allows us to lookup 32 or 64 bytes simultaneously into a 16-byte lookup table (one could also use a different 16-byte table for each 16-byte chunk, but typically we duplicate the same 16-byte table for each chunk). Here is how it is depicted on officedaytime.com:

Here is the table generator:

comptime var table: @Vector(16, u8) = @splat(0);
inline for (single_char_ops) |c|
	table[c & 0xF] |= 1 << (c >> 4);

As you can see, the index we store data at is c & 0xF where c is each of { '~', ':', ';', '[', ']', '?', '(', ')', '{', '}', ',' }. The data we associate with the low nibble given by c & 0xF is 1 << (c >> 4). This takes the upper nibble, and then shifts 1 left by that amount. This allows us to store up to 8 valid upper nibbles (corresponding to the number of bits in a byte), in the range <span style="white-space: nowrap">$\footnotesize \left[0,\ 7\right]$.</span> This isn't quite <span style="whitespace: nowrap">$\footnotesize \left[0,\ 15\right]$,</span> the actual range of a nibble (4 bits), but for our use-case, we only are matching ascii characters, so this limitation does not affect us. Then we just have to do the same transform 1 << (c >> 4) on the data in chunk and do a bitwise & to test if the upper nibble we found matches one of the valid options.

E.g. ; is 0x3B in hex, so we do table[0x3B & 0xF] |= 1 << (0x3B >> 4);, which reduces to table[0xB] |= 1 << 0x3;, which becomes table[0xB] |= 0b00001000;. [ is 0x5B, so we do table[0xB] |= 0b00100000;. { is 0x7B, so we do table[0xB] |= 0b10000000;. In the end, table[0xB] is set to 0b10101000. This tells us that 3, 5, and 7 are the valid upper nibbles corresponding to a lower nibble of 0xB.

We can query table for each value in chunk like so:

vpshufb(table, chunk);

Just ~2 cycles later, we will have completed 32 or 64 lookups simultaneously! Note that we don't have to take the lower 4 bits via & 0xF, because vpshufb does that automatically unless the upper nibble is 8 or above, i.e. when the byte is 128 or higher. For those bytes, vpshufb will zero the result, regardless of what's in the table. However, we already said we don't care about non-ascii bytes for this problem, so we are fine with those being zeroed out.

Now all we need to do is verify that the upper nibble matches the data we stored in table.

To do so, we can produce a vector like so:

const upper_nibbles_as_bit_pos = @as(@TypeOf(Chunk), @splat(1)) << (chunk >> @splat(4));

Unfortunately, at the moment, LLVM gives us a pretty expensive assembly routine for the above line of code (Godbolt link):

</div>

<div>

<p><a href="https://github.com/llvm/llvm-project/issues/${i}">llvm/llvm-project#${i}</a></p>

</div> </div>`).join("\n\n"); </script>

.LCPI0_1:
        .zero   32,16
.LCPI0_2:
        .zero   32,252
.LCPI0_3:
        .zero   32,224
.LCPI0_4:
        .byte   1
foo:
        vpsllw  ymm0, ymm0, 5
        vpbroadcastb    ymm1, byte ptr [rip + .LCPI0_4]
        vpblendvb       ymm1, ymm1, ymmword ptr [rip + .LCPI0_1], ymm0
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_3]
        vpsllw  ymm2, ymm1, 2
        vpand   ymm2, ymm2, ymmword ptr [rip + .LCPI0_2]
        vpaddb  ymm0, ymm0, ymm0
        vpblendvb       ymm1, ymm1, ymm2, ymm0
        vpaddb  ymm0, ymm0, ymm0
        vpaddb  ymm2, ymm1, ymm1
        vpblendvb       ymm0, ymm1, ymm2, ymm0
        ret

Luckily, there is a very easy way to map nibbles to a byte. Use vpshufb again! Actually this works out better for us because we can map upper nibbles in the range $\footnotesize \left[8,\ 15\right]$ to 0xFF. We'll see why later.

comptime var powers_of_2_up_to_128: [16]u8 = undefined;
inline for (&powers_of_2_up_to_128, 0..) |*slot, i| slot.* = if (i < 8) @as(u8, 1) << i else 0xFF;

const upper_nibbles_as_bit_pos = vpshufb(powers_of_2_up_to_128, chunk >> @splat(4));

Much better emit! (Godbolt link)

.LCPI0_0:
        ...
.LCPI0_2:
        ...
foo2:
        vpsrlw  ymm0, ymm0, 4
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        vpbroadcastb    ymm1, byte ptr [rip + .LCPI0_2]
        vpshufb ymm0, ymm1, ymm0
        ret

Now we simply bitwise & the two together, and check if it is non-0 on AVX-512 targets, else we check it for equality against the upper_nibbles_as_bit_pos bitstring. I wrote a helper function for this:

fn intersect_byte_halves(a: anytype, b: anytype) std.meta.Int(.unsigned, @typeInfo(@TypeOf(a, b)).vector.len) {
	return @bitCast(if (comptime std.Target.x86.featureSetHas(builtin.cpu.features, .avx512bw))
		@as(@TypeOf(a, b), @splat(0)) != (a & b)
	else
		a == (a & b));
}

This gives us better emit on AVX-512 because we have vptestmb, which does the whole @as(@TypeOf(a, b), @splat(0)) != (a & b) in one instruction, without even using a vector of zeroes! On AVX2 targets, we have to use a bitwise & either way, and to do a vectorized-not-equal we have to use vpcmpb+not. Hence, we avoid that extra not by instead checking a == (a & b)), where a has to be the bitstring that has only a single bit set, which is upper_nibbles_as_bit_pos for our problem.

So here is everything put together:

const single_char_ops = [_]u8{ '~', ':', ';', '[', ']', '?', '(', ')', '{', '}', ',' };
comptime var table: @Vector(16, u8) = @splat(0);
inline for (single_char_ops) |c| table[c & 0xF] |= 1 << (c >> 4);
comptime var powers_of_2_up_to_128: [16]u8 = undefined;
inline for (&powers_of_2_up_to_128, 0..) |*slot, i| slot.* = if (i < 8) @as(u8, 1) << i else 0xFF;

const upper_nibbles_as_bit_pos = vpshufb(powers_of_2_up_to_128, chunk >> @splat(4));
const symbol_mask = intersect_byte_halves(upper_nibbles_as_bit_pos, vpshufb(table, chunk));

:::note This properly produces a 0 corresponding to bytes in chunk in the range $\footnotesize \left[\mathrm{0x80},\ \mathrm{0xFF}\right]$. This is because vpshufb(table, chunk) will produce a 0 for bytes in chunk in the range $\footnotesize \left[\mathrm{0x80},\ \mathrm{0xFF}\right]$ and vpshufb(powers_of_2_up_to_128, chunk >> @splat(4)) will produce 0xFF for them. intersect_byte_halves on AVX-512 will do 0 != (a & b), where a is 0xFF and b is 0, which will reduce to 0 != 0, which is false. On non-AVX-512 targets, intersect_byte_halves will do a == (a & b). Substituting the same values for a and b, we get 0xFF == (0xFF & 0). This properly produces false as well. :::

Compiled for Zen 3, we get (Godbolt link):

</div>

<div>

<p><a href="https://github.com/llvm/llvm-project/issues/${i}">llvm/llvm-project#${i}</a> gives us some dead data, which I manually removed</p>

</div> </div>`).join("\n\n"); </script>

.LCPI0_0:
        .zero   32,15
.LCPI0_3:
        ...
.LCPI0_4:
        ...
findCharsInSet:
        vpsrlw  ymm1, ymm0, 4
        vbroadcasti128  ymm3, xmmword ptr [rip + .LCPI0_3]
        vpand   ymm1, ymm1, ymmword ptr [rip + .LCPI0_0]
        vbroadcasti128  ymm2, xmmword ptr [rip + .LCPI0_4]
        vpshufb ymm1, ymm2, ymm1
        vpshufb ymm0, ymm3, ymm0
        vpand   ymm0, ymm0, ymm1
        vpcmpeqb        ymm0, ymm0, ymm1
        vpmovmskb       eax, ymm0
        vzeroupper
        ret

const closed_completed = '<path stroke="none" fill="#8250df" d="M11.28 6.78a.75.75 0 0 0-1.06-1.06L7.25 8.69 5.78 7.22a.75.75 0 0 0-1.06 1.06l2 2a.75.75 0 0 0 1.06 0l3.5-3.5Z"></path><path stroke="none" fill="#8250df" d="M16 8A8 8 0 1 1 0 8a8 8 0 0 1 16 0Zm-1.5 0a6.5 6.5 0 1 0-13 0 6.5 6.5 0 0 0 13 0Z"></path>';

const closed_not_planned = '<path stroke="none" fill="#59636e" d="M8 0a8 8 0 1 1 0 16A8 8 0 0 1 8 0ZM1.5 8a6.5 6.5 0 1 0 13 0 6.5 6.5 0 0 0-13 0Zm9.78-2.22-5.5 5.5a.749.749 0 0 1-1.275-.326.749.749 0 0 1 .215-.734l5.5-5.5a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042Z"></path>';
for (const issue_id of [110317, 110305]) {
    fetch(`https://api.github.com/repos/llvm/llvm-project/issues/${issue_id}`)
        .then(e => e.json())
        .then(e => {
            const svg = e.state === "open" ? open : e.state_reason === "completed" ? closed_completed : closed_not_planned;

            for (const e of document.getElementsByClassName("issue-indicator")) {
                if (e.id === `issue-indicator-${issue_id}`) {
                    e.classList.remove("issue-indicator-unknown");
                    e.innerHTML = svg;
                }
            }
        })
}

} </script>

.individual-issue { display: flex; flex-direction: row; align-items: center; align-self: flex-start; }

.individual-issue > div { margin-right: 0.5em; }

.individual-issue > div + div { height: 2.3em; }

.individual-issue > div + div > p { margin-top: 0; margin-bottom: 0; line-height: 1.5; white-space: nowrap; }

p + div#issue-dump { margin-bottom: 0.5em; margin-top: -1.5em; } </style>

Compiled for Zen 4, we get (Godbolt link):

.LCPI0_3:
        ...
.LCPI0_4:
        ...
.LCPI0_5:
        ...
findCharsInSet:
        vgf2p8affineqb  ymm1, ymm0, qword ptr [rip + .LCPI0_3]{1to4}, 0
        vbroadcasti128  ymm2, xmmword ptr [rip + .LCPI0_4]
        vbroadcasti128  ymm3, xmmword ptr [rip + .LCPI0_5]
        vpshufb ymm0, ymm3, ymm0
        vpshufb ymm1, ymm2, ymm1
        vptestmb        k0, ymm0, ymm1
        kmovd   eax, k0
        vzeroupper
        ret

The advantage of this strategy over the one used by Langdale and Lemire (2019) for simdjson is that we can reuse the vector containing the upper nibbles as a bit position (1 << (c >> 4)) if we want to do more Vectorized Classification. That means we can add more Vectorized Classification routines and only have to pay the cost for the lower nibbles, avoiding the need for an additional vbroadcasti128+vpshufb pair for the upper nibbles. To add another table, the additional overhead for Zen 3 is just vbroadcasti128+vpshufb+vpand+vpcmpeqb+vpmovmskb. For Zen 4, it's just vbroadcasti128+vpshufb+vptestmb+kmovd.

Dare I say this is...

</div>

‒ Validark

:::note[Note from Geoff Langdale:] <span style="font-size: smaller; line-height: 0">The simdjson PSHUFB lookup is essentially borrowed from Hyperscan's own shufti/Teddy (shufti is a acceleration technique for NFA/DFA execution, while Teddy is a full-on string matcher, but both use similar techniques). The code in question is in https://github.com/intel/hyperscan/blob/master/src/nfa/shufti.c and https://github.com/intel/hyperscan/blob/master/src/nfa/truffle.c albeit kind of difficult to read (since there's a lot of extra magic for all the various platforms etc). Shufti is a 2-PSHUFB thing that is used "usually", truffle is "this will always work" and uses a technique kind of similar to yours (albeit for different reasons).</span> :::

Eliminating Shifted Geometric Recursion

Sat, 21 Sep 2024 09:00:00 GMT

A while back, I noticed this code in Zig's ArrayList.ensureTotalCapacity:

var better_capacity = self.capacity;
while (true) {
    better_capacity +|= better_capacity / 2 + 8;
    if (better_capacity >= new_capacity) break;
}

In assembly:

ensureTotalCapacity:
        mov     rcx, -1
        mov     rax, rdi
.LBB0_1:
        mov     rdx, rax
        shr     rdx
        add     rdx, 8
        add     rax, rdx
        cmovb   rax, rcx
        cmp     rax, rsi
        jb      .LBB0_1
        ret

Just for fun, in this article I will investigate replacing this with a branchless approximation.

First, I will temporarily disregard the fact that we are dealing with 64 bit integers, disregard the fact that doing an integer division by 2 can floor the true quotient by 0.5 when the dividend is odd, and disregard the saturating arithmetic and just write this in terms of a recursive sequence.

$$ \LARGE \begin{equation} \begin{split} U_0 &= \texttt{capacity}\ U_n &= U_{n-1} \times 1.5 + 8\ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

If you remember your pre-calculus class, this recursive sequence is called "shifted geometric", because it has a multiply that is being shifted by an addition. For <span style="white-space: nowrap">$\footnotesize U_0 = c$,</span> the expansion of this recursive sequence looks like:

$$ \LARGE \begin{equation} \begin{split} U_1 = c \times 1.5 + 8\ U_2 = (c \times 1.5 + 8) \times 1.5 + 8\ U_3 = ((c \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8\ U_4 = (((c \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8\ U_5 = ((((c \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8\ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

To get the general equation, let's replace $\footnotesize 1.5$ with $\footnotesize r$ and $\footnotesize 8$ with <span style="white-space: nowrap">$\footnotesize d$:</span>

$$ \LARGE \begin{equation} \begin{split} U_0 = c \ U_1 = c \times r + d\ U_2 = (c \times r + d) \times r + d\ U_3 = ((c \times r + d) \times r + d) \times r + d\ U_4 = (((c \times r + d) \times r + d) \times r + d) \times r + d\ U_5 = ((((c \times r + d) \times r + d) \times r + d) \times r + d) \times r + d\ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

Let's apply the distributive property of multiplication:

$$ \LARGE \begin{equation} \begin{split} U_1 =cr^1 + dr^0\ U_2 = cr^2 + dr^1 + dr^0\ U_3 = cr^3 + dr^2 + dr^1 + dr^0\ U_4 = cr^4 + dr^3 + dr^2 + dr^1 + dr^0\ U_5 = cr^5 + dr^4 + dr^3 + dr^2 + dr^1 + dr^0\ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

The pattern here is pretty obvious. We can express it using $\footnotesize \Sigma$ notation:

$$ \LARGE U_n = cr^n + \sum_{i=1}^{n} dr^{i-1} $$ </div> </foreignObject> </svg>

You may notice that the $\footnotesize \Sigma$ term is the "sum of a finite geometric sequence". Replacing that term with the well-known formula for that allows us to write an explicit function:

$$ \LARGE f(n) = cr^n + d \left(\frac{1 - r^n}{1 - r}\right) $$ </div> </foreignObject> </svg>

Let's put $\footnotesize 1.5$ back in for $\footnotesize r$ and $\footnotesize 8$ back in for $\footnotesize d$ and assess the damage:

$$ \LARGE f(n) = c \times 1.5^n + 8 \left(\frac{1 - 1.5^n}{1 - 1.5}\right) $$ </div> </foreignObject> </svg>

Luckily, we can simplify $\footnotesize (1 - 1.5)$ to <span style="white-space: nowrap">$\footnotesize -0.5$.</span> Dividing by $\footnotesize -0.5$ is equivalent to multiplying by <span style="white-space: nowrap">$\footnotesize -2$,</span> which we can combine with the $\footnotesize 8$ term to get <span style="white-space: nowrap">$\footnotesize -16$:</span>

$$ \LARGE f(n) = c \times 1.5^n + -16 (1 - 1.5^n) $$ </div> </foreignObject> </svg>

We could stop here, but let's distribute the <span style="white-space: nowrap">$\footnotesize -16$:</span>

$$ \LARGE f(n) = c \times 1.5^n - 16 + 16 \times 1.5^n $$ </div> </foreignObject> </svg>

Since we have two terms being added which each are multiplied by <span style="white-space: nowrap">$\footnotesize 1.5^n$,</span> we can factor it out like so:

$$ \LARGE f(n) = (c+16) \times 1.5^n - 16 $$ </div> </foreignObject> </svg>

This looks how we probably expected it would, and it is relatively easy to deal with. Now let's try to apply this to our original problem. The first thing we want to do, is find an $\footnotesize n$ for which <span style="white-space: nowrap">$\footnotesize x \ge f(n)$,</span> where $\footnotesize x$ is the requested new_capacity. To find <span style="white-space: nowrap">$\footnotesize n$,</span> we have to isolate it on the right-hand side:

$$ \LARGE \begin{equation} \begin{split} x &\ge (c+16) \times 1.5^n - 16 \ \small \texttt{(+16 to both sides)} \ x + 16 &\ge (c+16) \times 1.5^n \ \small \texttt{(divide by (c+16) on both sides)} \ \frac{x + 16}{c+16} &\ge 1.5^n \ \small \texttt{(take the log of both sides)} \ \log{\left(\frac{x + 16}{c+16}\right)} &\ge \log{(1.5^n)} \ \small \texttt{(property of logarithms on the right-hand side)} \ \log{\left(\frac{x + 16}{c+16}\right)} &\ge n\log{(1.5)} \ \small \texttt{(divide each side by log(1.5))} \ \frac{ \log{\left(\frac{x + 16}{c+16}\right)}}{\log{(1.5)}} &\ge n \ \small \texttt{(property of logarithms on the left-hand side)} \ \log_{1.5}{\left(\frac{x + 16}{c+16}\right)} &\ge n \ \small \texttt{(property of logarithms on the left-hand side)} \ \log_{1.5}{(x + 16)} - \log_{1.5}{(c + 16)} &\ge n \ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

Now this is usable for our problem. We can compute $\footnotesize n$ by doing <span style="white-space: nowrap">$\footnotesize \lceil\log_{1.5}{(x + 16)} - \log_{1.5}{(c + 16)}\rceil$,</span> then plug that in to $\footnotesize n$ in <span style="white-space: nowrap">$\footnotesize f(n) = (c+16) \times 1.5^n - 16$.</span> Together, that's:

$$ \LARGE (c+16) \times 1.5^{\lceil(\log_{1.5}{(x + 16)} - \log_{1.5}{(c + 16)})\rceil} - 16 $$ </div> </foreignObject> </svg>

For those of you who skipped ahead, $\footnotesize c$ is self.capacity and $\footnotesize x$ is new_capacity, and this formula gives you the better_capacity. Note that this formula will give numbers a bit higher than the original while loop, because the original while loop loses some 0.5's when dividing an odd number by 2.

Now, the remaining question is how to compute the previous expression, or rather, an approximation of it, efficiently.

Sadly, efficiently computing the base $\footnotesize 1.5$ logarithm of an integer is not ideal. If we were allowed to change the original problem such that we could use the base $\footnotesize 2$ logarithm, that would be much easier to compute, that's just @typeInfo(@TypeOf(c)).int.bits - 1 - @clz(c) (obviously, this would be an integer, so we should be careful on how the flooring of the true answer affects rounding error). Let's use this information to make an approximation. Using the change of base property of logarithms, we can rewrite the equation like so:

$$ \LARGE \frac{\log_2{(x + 16)}}{\log_2{1.5}} - \frac{\log_2{(c + 16)}}{\log_2{1.5}} $$ </div> </foreignObject> </svg>

Equivalently:

$$ \LARGE (\log_2{(x + 16)} - \log_2{(c + 16)}) \times \frac{1}{\log_2{1.5}} $$ </div> </foreignObject> </svg>

<span style="white-space: nowrap">$\footnotesize \frac{1}{\log_2{1.5}} \approx 1.7095112913514547$,</span> so we can approximate the above expression like so:

$$ \LARGE (\log_2{(x + 16)} - \log_2{(c + 16)}) \times 1.7095112913514547 $$ </div> </foreignObject> </svg>

As hinted to earlier, we can find $\footnotesize \lceil\log_2{(x + 16)}\rceil - \lceil\log_2{(c + 16)}\rceil$ by doing @clz(c + 15) - @clz(x + 15). Note that the terms are now in reverse order because the answer returned by @clz(b) is actually <span style="white-space: nowrap">$\footnotesize 63 - \lfloor\log_2{b}\rfloor$.</span> We also subtracted $\footnotesize 1$ from $\footnotesize 16$ because we probably want the ceil base $\footnotesize 2$ logarithm instead, and the algorithm for that is 64 - @clz(x - 1). (64 - @clz((x + 16) - 1)) - (64 - @clz((c + 16) - 1)) reduces to @clz(c + 15) - @clz(x + 15). That's slightly different than what we want, which is to ceil only after multiplying by <span style="white-space: nowrap">$\footnotesize 1.7095112913514547$,</span> but if we're careful about which way the rounding works, we should be fine.

We can change $\footnotesize 1.7095112913514547$ to a nicer number like $\footnotesize 2$ by working backwards. To make it so we would multiply by $\footnotesize 2$ instead, we would change our recursive sequence to:

$$ \LARGE \begin{equation} \begin{split} U_0 = \texttt{capacity}\ U_n = U_{n-1} \times \sqrt 2 + 8\ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

This works because $\footnotesize \frac{1}{\log_2{\sqrt 2}}$ is <span style="white-space: nowrap">$\footnotesize 2$.</span> This is still pretty close to our original formula, as $\footnotesize \sqrt 2 \approx 1.41421$ and <span style="white-space: nowrap">$\footnotesize 1.41421 \approx 1.5$.</span> If we did the same steps as before, $\footnotesize \frac{8}{1 - \sqrt 2} \approx 19.313708498984756$ would be in all the places where we had $\footnotesize 16$ in our original equations. Let's round that up to $\footnotesize 20$ this time, since we rounded $\footnotesize 1.5$ down to <span style="white-space: nowrap">$\footnotesize \sqrt 2$.</span> To do that, we change the common difference of $\footnotesize 8$ to <span style="white-space: nowrap">$\footnotesize -20 (1 - \sqrt 2)$,</span> which is about <span style="white-space: nowrap">$\footnotesize 8.2842712474619$.</span> Reminder: the point here is that when we divide this value by <span style="white-space: nowrap">$\footnotesize (1 - \sqrt 2)$,</span> we get $\footnotesize -20$ rather than the $\footnotesize -16$ we had earlier.

$$ \LARGE \begin{equation} \begin{split} U_0 &= \texttt{capacity}\ U_n &= U_{n-1} \times \sqrt 2 - 20 (1 - \sqrt 2)\ U_n &\approx U_{n-1} \times 1.41421 + 8.2842712474619\ \end{split} \end{equation} $$ </div> </foreignObject> </svg>

By the same steps shown above, this gives us the coveted:

$$ \LARGE (c+20) \times \sqrt 2^{\lceil 2(\log_2{(x + 20)} - \log_2{(c + 20)})\rceil} - 20 $$ </div> </foreignObject> </svg>

I.e.:

$$ \LARGE (c+20) \times \sqrt 2^{\lceil \log_{\sqrt 2}{(x + 20)} - \log_{\sqrt 2}{(c + 20)}\rceil} - 20 $$ </div> </foreignObject> </svg>

As mentioned before, we can find $\footnotesize \lceil\log_2{(x + 20)}\rceil - \lceil\log_2{(c + 20)}\rceil$ by doing @clz(c + 19) - @clz(x + 19). However, this is not close enough to $\footnotesize \lceil \log_{\sqrt 2}{(x + 20)} - \log_{\sqrt 2}{(c + 20)}\rceil$ for our use-case because we need at least the granularity of a $\footnotesize \log_{\sqrt 2}{}$ either way (ideally, we could use even more precision in some cases). This could be accomplished via a lookup table, or via another approximation. As an approximation, we could pretend that each odd power of $\footnotesize \sqrt 2$ is half-way between powers of $\footnotesize 2$ that fall on even powers of <span style="white-space: nowrap">$\footnotesize \sqrt 2$.</span> If you think about it, this is kind of semantically in line with what we are doing when we subtract the @clz of two numbers, now with slightly more granularity. Here is how we could accomplish that:

// Basically @clz but with double the normal granularity
fn log_sqrt_2_int(x: u64) u64 {
    assert(x != 0);
    const fls = 63 - @as(u64, @clz(x));
    const is_bit_under_most_significant_bit_set = (x & (x << 1)) >> @intCast(fls);
    return (fls * 2) | is_bit_under_most_significant_bit_set;
}

This is kind of what we are looking for, with a bit more accuracy than before. We can also scale this up even more if desired:

// Kinda an approximation of 16 log2(x). Will be divided by 8 to approximate 2 log2(x).
export fn log_approx_helper(y: usize) usize {
    const COMPLEMENT = @typeInfo(usize).int.bits - 1;
    const BITS_TO_PRESERVE = @as(comptime_int, COMPLEMENT - @clz(@as(usize, 20)));

    const x = y +| 20;
    const fls: std.math.Log2Int(usize) = @intCast(COMPLEMENT - @clz(x)); // [4, 63]

    const pack_bits_under_old_msb = switch (builtin.cpu.arch) {
        // the `btc` instruction saves us a cycle on x86
        .x86, .x86_64 => (x ^ (@as(usize, 1) << fls)) >> (fls - BITS_TO_PRESERVE),
        else => @as(std.meta.Int(.unsigned, BITS_TO_PRESERVE), @truncate((x >> (fls - BITS_TO_PRESERVE)))),
    };
    return (@as(usize, fls) << BITS_TO_PRESERVE) | pack_bits_under_old_msb; // [16, 1023] on 64-bit
}

// usage:
const n = 1 + (log_approx_helper(x + 19) - log_approx_helper(c + 19)) / 8;
// i.e.:
const n = 1 + (log_approx_helper(new_capacity + 19) - log_approx_helper(self.capacity + 19)) / 8;

Now that we have calculated <span style="white-space: nowrap">$\footnotesize n$,</span> the last problem is approximating <span style="white-space: nowrap">$\footnotesize \sqrt 2^n$.</span> Again, this can be done with a lookup table, or we could pretend once more that odd powers of $\footnotesize \sqrt 2$ are directly in the middle of powers of <span style="white-space: nowrap">$\footnotesize 2$.</span> Let's try that.

fn approx_sqrt_2_pow(y: u7) u64 {
    // y is basically a fixed point integer, with the 1's place being after the decimal point
    const shift = @intCast(u6, y >> 1);
    return (@as(u64, 1) << shift) | (@as(u64, y & 1) << (shift -| 1));
}

And here are the estimates versus what we would get from std.math.pow(f64, std.math.sqrt2, n):

pow: est vs double  <- format
√2^1: 1 vs 1.4142135623730951
√2^3: 3 vs 2.8284271247461907
√2^5: 6 vs 5.656854249492383
√2^7: 12 vs 11.313708498984768
√2^9: 24 vs 22.627416997969544
√2^11: 48 vs 45.254833995939094
√2^13: 96 vs 90.50966799187822
√2^15: 192 vs 181.01933598375646
√2^17: 384 vs 362.038671967513
√2^19: 768 vs 724.0773439350261
√2^21: 1536 vs 1448.1546878700526
√2^23: 3072 vs 2896.3093757401057
√2^25: 6144 vs 5792.618751480213
√2^27: 12288 vs 11585.237502960428
√2^29: 24576 vs 23170.475005920864

√2^31: 49152 vs 46340.950011841735
√2^33: 98304 vs 92681.9000236835
√2^35: 196608 vs 185363.80004736703
√2^37: 393216 vs 370727.60009473417
√2^39: 786432 vs 741455.2001894685
√2^41: 1572864 vs 1482910.4003789374
√2^43: 3145728 vs 2965820.800757875
√2^45: 6291456 vs 5931641.601515752
√2^47: 12582912 vs 11863283.203031506
√2^49: 25165824 vs 23726566.406063017
√2^51: 50331648 vs 47453132.81212604
√2^53: 100663296 vs 94906265.62425211
√2^55: 201326592 vs 189812531.24850425
√2^57: 402653184 vs 379625062.4970086
√2^59: 805306368 vs 759250124.9940174
√2^61: 1610612736 vs 1518500249.9880352
√2^63: 3221225472 vs 3037000499.976071
√2^65: 6442450944 vs 6074000999.952143
√2^67: 12884901888 vs 12148001999.904287
√2^69: 25769803776 vs 24296003999.808582
√2^71: 51539607552 vs 48592007999.61717
√2^73: 103079215104 vs 97184015999.23438
√2^75: 206158430208 vs 194368031998.46878
√2^77: 412316860416 vs 388736063996.9377
√2^79: 824633720832 vs 777472127993.8755
√2^81: 1649267441664 vs 1554944255987.7512
√2^83: 3298534883328 vs 3109888511975.503
√2^85: 6597069766656 vs 6219777023951.008
√2^87: 13194139533312 vs 12439554047902.018
√2^89: 26388279066624 vs 24879108095804.043
√2^91: 52776558133248 vs 49758216191608.09
√2^93: 105553116266496 vs 99516432383216.22
√2^95: 211106232532992 vs 199032864766432.47
√2^97: 422212465065984 vs 398065729532865.06
√2^99: 844424930131968 vs 796131459065730.2
√2^101: 1688849860263936 vs 1592262918131461
√2^103: 3377699720527872 vs 3184525836262922.5
√2^105: 6755399441055744 vs 6369051672525847
√2^107: 13510798882111488 vs 12738103345051696
√2^109: 27021597764222976 vs 25476206690103400
√2^111: 54043195528445952 vs 50952413380206810
√2^113: 108086391056891904 vs 101904826760413630
√2^115: 216172782113783808 vs 203809653520827300
√2^117: 432345564227567616 vs 407619307041654700
√2^119: 864691128455135232 vs 815238614083309600
√2^121: 1729382256910270464 vs 1630477228166619600
√2^123: 3458764513820540928 vs 3260954456333240000
√2^125: 6917529027641081856 vs 6521908912666482000
√2^127: 13835058055282163712 vs 13043817825332965000

</details>

With a little polishing, this is the code I ended up with:

const std = @import("std");

// Kinda an approximation of 16 log2(x). Will be divided by 8 to approximate 2 log2(x).
fn log_approx_helper(y: usize) usize {
    const COMPLEMENT = @typeInfo(usize).int.bits - 1;
    const BITS_TO_PRESERVE = @as(comptime_int, COMPLEMENT - @clz(@as(usize, 20)));

    const x = y +| 20;
    const fls: std.math.Log2Int(usize) = @intCast(COMPLEMENT - @clz(x)); // [4, 63]
    const pack_bits_under_old_msb: std.meta.Int(.unsigned, BITS_TO_PRESERVE) = @truncate(x >> (fls - BITS_TO_PRESERVE));
    return (@as(usize, fls) << BITS_TO_PRESERVE) | pack_bits_under_old_msb; // [16, 1023] on 64-bit
}

/// Modify the array so that it can hold at least `new_capacity` items.
/// Invalidates pointers if additional memory is needed.
export fn ensureTotalCapacity(capacity: usize, new_capacity: usize) usize {
    const power = 1 + (log_approx_helper(new_capacity) -| log_approx_helper(capacity)) / 8;
    const shift: std.math.Log2Int(usize) = @intCast(power >> 1);
    const approx_sqrt_2_power = (@as(usize, 1) << shift) | (@as(usize, power & 1) << (shift -| 1));
    return @max(capacity +| (capacity / 2 + 8), (capacity +| 20) *| approx_sqrt_2_power - 20);
}

// side note: I decided to just always use 20 instead of 19 where applicable, because it is a mostly trivial difference
// and we can reuse `capacity +| 20` in 2 locations.

Here is the godbolt link. According to llvm-mca, on a Zen 4 system this ensureTotalCapacity function would take ~22 cycles to execute. The Zen 4 optimization manual says

The branch misprediction penalty is in the range from 11 to 18 cycles, depending on the type of mispredicted branch and whether or not the instructions are being fed from the Op Cache. The common case penalty is 13 cycles.

That means if we expect our original loop to run for at least 10 cycles, plus a branch mispredict penalty, this approximation would be faster (while reducing the overall pressure on the branch predictor). I'm not sure we actually expect this loop to branch backwards anyway, but if we did, this could be a good alternative.

Before:

ensureTotalCapacity:
        mov     rcx, -1
        mov     rax, rdi
.LBB0_1:
        mov     rdx, rax
        shr     rdx
        add     rdx, 8
        add     rax, rdx
        cmovb   rax, rcx
        cmp     rax, rsi
        jb      .LBB0_1
        ret

After:

ensureTotalCapacity:
        add     rsi, 20
        mov     rcx, -1
        mov     dl, 59
        mov     r8b, 59
        cmovb   rsi, rcx
        lzcnt   rax, rsi
        sub     dl, al
        shl     eax, 4
        shrx    rdx, rsi, rdx
        and     edx, 15
        or      rdx, rax
        mov     rax, rdi
        xor     rdx, 1008
        add     rax, 20
        cmovb   rax, rcx
        lzcnt   rsi, rax
        sub     r8b, sil
        shl     esi, 4
        shrx    r8, rax, r8
        and     r8d, 15
        or      r8, rsi
        xor     esi, esi
        xor     r8, 1008
        sub     rdx, r8
        mov     r8d, 0
        cmovae  r8, rdx
        shr     r8d, 3
        inc     r8
        mov     edx, r8d
        and     r8d, 1
        shr     dl
        mov     r9d, edx
        and     r9b, 63
        sub     r9b, 1
        movzx   r9d, r9b
        cmovb   r9d, esi
        shlx    rsi, r8, r9
        mov     r9, rdi
        shr     r9
        bts     rsi, rdx
        add     r9, 8
        add     r9, rdi
        cmovb   r9, rcx
        mul     rsi
        cmovo   rax, rcx
        add     rax, -20
        cmp     r9, rax
        cmova   rax, r9
        ret

The original version of this article was posted here. Please don't comment on that issue, but feel free to read the comments there.

Anyway, if you made it this far, thanks for reading!

‒ Validark

Vector Compression in Interleaved Space on ARM

Mon, 09 Sep 2024 12:56:30 GMT

In my last article, Use interleaved vectors for parsing on ARM, I covered the three main algorithms we need to support on interleaved vectors for high performance parsing of utf8, JSON, or Zig for aarch64/ARM architectures. Namely, the movemask and unmovemask routines, as well as the elementwise shift replacement, all of which are more efficient when performed on interleaved vectors (except when shifting by a multiple of 16). I also briefly explained that we can perform prefix sum operations using these elementwise shifts, but I did not explain why we might want to. The answer is vector compression.

Vector Compression

In high performance parsers, we sometimes produce a vector bytemask and/or a bitmask which indicates certain elements we want to keep, and the rest should be thrown out. On the latest and greatest x86-64 hardware (with AVX-512), we have the VPCOMPRESSB instruction, which can extract just those bytes from a 64-byte vector corresponding to a bitmask we pass in.

On Arm, however, there is no VPCOMPRESSB, nor is there an interleaved equivalent. So, per usual, we have to roll our own.

Luckily, we reproduce the semantics of VPCOMPRESSB by:

Finding the (exclusive) prefix sum of the bytemask.
Adding the prefix sum to the identity counting vector.
Simultaneously shifting each element by their corresponding amount, calculated in step 2.

Prefix Sum

As in the previous article, the prefix-sum of a vector can be computed like so:

fn prefixSum(vec_: @Vector(64, u8)) @Vector(64, u8) {
    var vec = vec_;
    inline for (0..6) |i| { // iterates from 0 to 5
        vec += std.simd.shiftElementsRight(vec, 1 << i, 0);
    }
    return vec;
}

And, as first shown in the previous article, we can abstract away the elementwise vector shifts with a helper function. That way, our interleaved version looks like so:

fn prefixSum(vec_: @Vector(64, u8)) @Vector(64, u8) {
    var vec = vec_;
    inline for (0..6) |i| { // iterates from 0 to 5
        vec += shiftInterleavedElementsRight(vec, 1 << i, 0);
    }
    return vec;
}

It looks the same!

Since we abstracted away the interleaved shift, we can think of it as though it's just a normal-ordered vector. So here is what a prefix sum looks like for a normally-ordered vector:

As depicted above, we first shift the vector right by one, then add that to itself. Then we shift the result of that addition to the right by two, then add that to our previous result. If we continue with this pattern, every column ends up becoming the sum of all bytes that came before it.

To apply this to vector compression, start with a -1 in each slot you intend to keep, and a 0 otherwise:

Next, shift our result to the right by 1 because we want the exclusive prefix sum (if possible, it's more efficient to do this beforehand). Then add the final result to the identity counting vector.

This resulting travel_distances vector contains, for the (red) elements we care about, how many slots they each need to be shifted leftwards.

Compression via prefix-sum

Next, we shift each element left by the travel_distances we calculated from the prefix_sums vector. To accomplish this, we shift the vector by successive powers of 2, each time keeping only values whose binary representation has a 1 bit in the place value corresponding to the current power of 2, otherwise keeping the previous value. E.g., if we want to shift an element by a total of 5 slots, we will shift it during the 1-shift and 4-shift stage, because the binary representation of 5 is 0b101 (i.e. 1+4).

// Compresses the elements in `data` corresponding to non-zero bytes in the `condition` vector.
// Note this return a compressed 64-byte vector in interleaved space, meaning that if you want to write
// this out to memory, you need to use the `st4` instruction.
fn compress(data: @Vector(64, u8), prefix_sums: @Vector(64, u8)) @Vector(64, u8) {
    const indices = comptime @as(@Vector(64, u8), @bitCast(std.simd.deinterlace(4, std.simd.iota(u8, 64))));
    var travel_distances = indices +% prefix_sums;
    var compressed_data = data;

    inline for (0..6) |x| {
        const i = 1 << x;
        const shifted_travel_distances = shiftInterleavedElementsLeft(travel_distances, i, 0));
        const shifted_compressed_data = shiftInterleavedElementsLeft(compressed_data, i, 0));
        const selector = cmtst(shifted_travel_distances, @splat(i));
        travel_distances = bsl(selector, shifted_travel_distances, travel_distances);
        compressed_data = bsl(selector, shifted_compressed_data, compressed_data);
    }

    return compressed_data;
}

fn cmtst(a: anytype, comptime b: @TypeOf(a)) @TypeOf(a) {
    return @select(u8, (a & b) != @as(@TypeOf(a), @splat(0)), @as(@TypeOf(a), @splat(0xff)), @as(@TypeOf(a), @splat(0)));
}

fn bsl(selector: anytype, a: @TypeOf(selector), b: @TypeOf(selector)) @TypeOf(selector) {
    return (a & selector) | (b & ~selector);
}

Unfortunately, this method requires a logarithmic number of steps, both in the prefix sum and the compression stage. It also has a strict serial dependency chain from start to finish.

We can do better by finding the prefix sum of each group of 8 or 16, and writing out to memory 8 or 4 times instead of once.

Breaking it up

In order to prefix-sum and vector-compress groups of 8, 16 or 32, we need to be careful that our shiftInterleavedElementsLeft and shiftInterleavedElementsRight functions do not go across the boundaries we set. If we want to vector-compress groups of 8, we want our elementwise shift emulation to not add element 7 to element 8, 15 to 16, 23 to 24, etc. Luckily, we have an instruction that puts barriers between these groups. The shl/shr instructions! For groups of 8, we use 2-byte-granularity shifts, for groups of 16, we use 4, and for groups of 32 we would use 8-byte-granularity shifts.

I define a custom shiftElementsLeft upon which shiftInterleavedElementsLeft is built (see definition of shiftInterleavedElementsRight here), which has a comptime boundary parameter. When the boundary is smaller than a u128, we will use a 16, 32, or 64 bit-wise shift:

fn shiftElementsLeft(vec: @Vector(16, u8), comptime amount: std.simd.VectorCount(@Vector(64, u8)), comptime boundary: type) @Vector(16, u8) {
    return if (boundary == u128)
        std.simd.shiftElementsLeft(vec, amount, 0)
    else
        @bitCast(@as(@Vector(16 / @sizeOf(boundary), boundary), @bitCast(vec)) >> @splat(8*amount));
}

This allows us to get our st4 instructions started earlier, with a lot more parallelism:

inline for (0..64 / WIDTH) |i| {
    st4(
        dest[if (i == 0) 0 else prefix_sum_of_offsets[i*(WIDTH / 8) - 1]..],
        shiftInterleavedElementsLeft(compressed_data, WIDTH*i, u128)
    );
}

Where prefix_sum_of_offsets is defined like so:

comptime var prefix_sum_multiplier = 0;
inline for (0..64 / WIDTH) |i| prefix_sum_multiplier |= 1 << i*WIDTH;
const prefix_sum_of_offsets: [8]u8 = @bitCast(
@as([2]u64, @bitCast(
    uzp2(
        neg(
            @as([4]@Vector(16, u8), @bitCast(prefix_sums))[3]
        )
    )
))[0] *% prefix_sum_multiplier);

This takes the bottommost prefix-sums vector, corresponding to what was originally 3, 7, 11, 15, 19, etc, then takes the arithmetic negative of each element, then extracts the odd bytes into a u64, then multiplies by a constant prefix_sum_multiplier that has every WIDTH bits set to 1. E.g. when WIDTH is 8, it will multiply by 0x0101010101010101, which will compute the byte-wise prefix-sum. When WIDTH is 16, it will multiply by 0x0001000100010001, computing the 2-byte-wise prefix-sum. Quickly producing the prefix_sum_of_offsets with a multiply instead of a serial dependency chain of add instructions allows us to calculate the destination pointers in parallel.

Putting it all together: (full code here)

const WIDTH = 16; // How many elements to operate on at once

/// Compresses the elements in `data` corresponding to the `condition` vector.
/// Writes to `dest`, including a number of undefined bytes.
/// In total, this expression gives the number of bytes written past `dest`:
/// switch (WIDTH) {
///    8, 16, 32 => (64 - WIDTH) + 32,
///    64 => 64,
/// }
export fn compress(data: @Vector(64, u8), condition: @Vector(64, u8), dest: [*]u8) u8 {
    const U = std.meta.Int(.unsigned, WIDTH*2);
    const indices = comptime @as(@Vector(64, u8), @bitCast(std.simd.deinterlace(4, std.simd.iota(u8, 64) & @as(@Vector(64, u8), @splat(WIDTH - 1)))));

    var prefix_sums = @select(u8, condition != @as(@Vector(64, u8), @splat(0)),
        @as(@Vector(64, u8), @splat(255)),
        @as(@Vector(64, u8), @splat(0)),
    );

    // Next, shift elements right by 1, 2, 4, 8, 16, and 32, and accumulate at each step
    inline for (0..std.math.log2(WIDTH)) |i| {
        prefix_sums +%= shiftInterleavedElementsRight(prefix_sums, 1 << i, U);
    }

    comptime var prefix_sum_multiplier = 0;
    inline for (0..64 / WIDTH) |i| prefix_sum_multiplier |= 1 << i*WIDTH;
    const prefix_sum_of_offsets: [8]u8 = @bitCast(
    @as([2]u64, @bitCast(
        uzp2(
            neg(
                @as([4]@Vector(16, u8), @bitCast(prefix_sums))[3]
            )
        )
    ))[0] *% prefix_sum_multiplier);

    // Now take the identity indices and add it to the prefix_sums.
    // This value tells us how far each value should be left-shifted
    var travel_distances = indices +% shiftInterleavedElementsRight(prefix_sums, 1, U);
    var compressed_data = data;

    inline for (0..std.math.log2(WIDTH)) |x| {
        const i = 1 << x;
        const shifted_left = shiftInterleavedElementsLeft(travel_distances, i, U);
        const shifted_compressed_data = shiftInterleavedElementsLeft(compressed_data, i, U);
        const selector = cmtst(shifted_left, @splat(i));
        travel_distances = bsl(selector, shifted_left, travel_distances);
        compressed_data = bsl(selector, shifted_compressed_data, compressed_data);
    }

    inline for (0..64 / WIDTH) |i| {
        (if (WIDTH == 64) st4 else st4_first_32)(
            dest[if (i == 0) 0 else prefix_sum_of_offsets[i*(WIDTH / 8) - 1]..],
            shiftInterleavedElementsLeft(compressed_data, WIDTH*i, u128)
        );
    }

    return prefix_sum_of_offsets[7];
}

Subject to the following issues:

</div>

<div>

<p><a href="https://github.com/llvm/llvm-project/issues/${i}">llvm/llvm-project#${i}</a></p>

</div> </div>`).join("\n\n"); </script>

const closed_completed = '<path stroke="none" fill="#8250df" d="M11.28 6.78a.75.75 0 0 0-1.06-1.06L7.25 8.69 5.78 7.22a.75.75 0 0 0-1.06 1.06l2 2a.75.75 0 0 0 1.06 0l3.5-3.5Z"></path><path stroke="none" fill="#8250df" d="M16 8A8 8 0 1 1 0 8a8 8 0 0 1 16 0Zm-1.5 0a6.5 6.5 0 1 0-13 0 6.5 6.5 0 0 0 13 0Z"></path>';

const closed_not_planned = '<path stroke="none" fill="#59636e" d="M8 0a8 8 0 1 1 0 16A8 8 0 0 1 8 0ZM1.5 8a6.5 6.5 0 1 0 13 0 6.5 6.5 0 0 0-13 0Zm9.78-2.22-5.5 5.5a.749.749 0 0 1-1.275-.326.749.749 0 0 1 .215-.734l5.5-5.5a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042Z"></path>';
for (const issue_id of [107438, 107423, 107404, 107243, 107099, 107093, 107088]) {
    fetch(`https://api.github.com/repos/llvm/llvm-project/issues/${issue_id}`)
        .then(e => e.json())
        .then(e => {
            const svg = e.state === "open" ? open : e.state_reason === "completed" ? closed_completed : closed_not_planned;

            for (const e of document.getElementsByClassName("issue-indicator")) {
                if (e.id === `issue-indicator-${issue_id}`) {
                    e.classList.remove("issue-indicator-unknown");
                    e.innerHTML = svg;
                }
            }
        })
}

} </script>

.individual-issue { display: flex; flex-direction: row; align-items: center; align-self: flex-start; }

.individual-issue > div { margin-right: 0.5em; }

.individual-issue > div + div { height: 2.3em; }

.individual-issue > div + div > p { margin-top: 0; margin-bottom: 0; line-height: 1.5; white-space: nowrap; }

p + div#issue-dump { margin-bottom: 0.5em; } </style>

Technique II: Lookup table

Unfortunately, it seems that even with those issues fixed, it's still going to be more efficient to use a lookup table, if you can afford to consume 2KiB of your precious cache. (Godbolt link)

export fn compress(interleaved_data: @Vector(64, u8), bitstring: u64, dest: [*]u8) void {
    comptime var lookups: [256]@Vector(8, u8) = undefined;
    comptime {
        @setEvalBranchQuota(100000);
        for (&lookups, 0..) |*slot, i| {
            var pos: u8 = 0;
            for (0..8) |j| {
                const bit: u1 = @truncate(i >> j);
                slot[pos] = j / 4 + (j & 3) * 16;
                pos += bit;
            }

            for (pos..8) |j| {
                slot[j] = 255;
            }
        }
    }

    const chunks: [4]@Vector(16, u8) = @bitCast(interleaved_data);

    const prefix_sum_of_popcounts =
        @as(u64, @bitCast(@as(@Vector(8, u8), @popCount(@as(@Vector(8, u8), @bitCast(bitstring))))))
            *% 0x0101010101010101;

    inline for (@as([8]u8, @bitCast(bitstring)), @as([8]u8, @bitCast(prefix_sum_of_popcounts)), 0..)
    |byte, pos, i| {
        dest[pos..][0..8].* = tbl4(
            chunks[0],
            chunks[1],
            chunks[2],
            chunks[3],
            lookups[byte] +| @as(@Vector(8, u8), @splat(2*i))
        );
    }
}

fn tbl4(
    table_part_1: @Vector(16, u8),
    table_part_2: @Vector(16, u8),
    table_part_3: @Vector(16, u8),
    table_part_4: @Vector(16, u8),
    indices: @Vector(8, u8)
) @TypeOf(indices) {
    return struct {
        extern fn @"llvm.aarch64.neon.tbl4"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(table_part_3), @TypeOf(table_part_4), @TypeOf(indices)) @TypeOf(indices);
    }.@"llvm.aarch64.neon.tbl4"(table_part_1, table_part_2, table_part_3, table_part_4, indices);
}

This technique also has to use tbl4 because it is deinterleaving the data at the same time as compressing. In normal space, you would just use tbl1. But hey, as long as there is no serial dependency, you only eat the latency once.

Now go compress your interleaved vectors, you glorious vectorizers!

‒ Validark

Use interleaved vectors for parsing on ARM

Tue, 03 Sep 2024 13:15:30 GMT

When parsing, we can take advantage of data-level parallelism by operating on more than one byte at a time. Modern CPU architectures provide us with SIMD (single-instruction, multiple-data) instructions which allow us to do some operation on all bytes in a vector simultaneously. On the latest and greatest x86-64 hardware (with AVX-512), we have hardware support for 64-byte vectors. This is convenient because we often want to do a movemask operation which reduces the 64-byte vector into a 64-bit bitstring. Each bit in this "mask" corresponds to a byte in the vector, and can tell us some piece of information like "is it a whitespace?"

Because modern hardware also has nice bitstring-query operations, we can efficiently do, e.g., a count-trailing-zeroes operation in 1 or 2 cycles to answer questions like "how many non-whitespace characters are there at the start of the vector?"

Enter ARM Neon

On CPU's sporting the ARM Neon instruction set, like the popular Apple M-series chips, we do not have direct hardware support for 64-byte vectors. Instead, we have to use 4 vectors of 16 bytes each to emulate 64-byte width.

This leaves us with a choice. We could load in 4 vectors in normal order, like so: (in Zig, we can load in 64 bytes and let the compiler load in 4 vectors for us)

export fn load64Bytes(ptr: [*]u8) @Vector(64, u8) {
    return ptr[0..64].*;
}

Aarch64 Assembly emit:

ldp     q0, q1, [x0]; loads two vectors from pointer `x0`
ldp     q2, q3, [x0, #32]; loads two vectors from `x0+32`

Or in interleaved order, like so:

export fn load64BytesInterleaved(ptr: [*]u8) @Vector(64, u8) {
    return @bitCast(std.simd.deinterlace(4, ptr[0..64].*));
}

Aarch64 Assembly emit (we have an instruction for that!):

ld4     { v0.16b, v1.16b, v2.16b, v3.16b }, [x0]

However, interleaved order, hence the name, does not load data in normal order. It loads every 4th byte, with the first vector starting at byte 0, the second vector starting at byte 1, and so on:

This strategy is nice for when you have an array of 4 byte structs, where each byte is a separate field.

const rgba = struct { r: u8, g: u8, b: u8, a: u8 };

E.g. if you have an array of rgbas, ld4 will return your r, g, b, and a values in separate vectors.

However, even when reordering is not what we're going for, we can still see efficiency gains by using this facility when movemasking, when unmovemasking, or when doing vector element shifts.

Movemask

If you want to produce a 64-bit bitstring that tells you where the space characters are in a 64-byte chunk, you might try the following:

export fn checkWhitespace(ptr: [*]u8) u64 {
    return @bitCast(
        @as(@Vector(64, u8), ptr[0..64].*) == @as(@Vector(64, u8), @splat(' '))
    );
}

At the time of writing, LLVM will give you this emit: (Check what it is today)

.LCPI0_0:
    .byte   1
    .byte   2
    .byte   4
    .byte   8
    .byte   16
    .byte   32
    .byte   64
    .byte   128
    .byte   1
    .byte   2
    .byte   4
    .byte   8
    .byte   16
    .byte   32
    .byte   64
    .byte   128
checkWhitespace:
    ldp     q0, q1, [x0]
    ldp     q2, q3, [x0, #32]
    movi    v4.16b, #32
    cmeq    v3.16b, v3.16b, v4.16b
    adrp    x8, .LCPI0_0
    ldr     q5, [x8, :lo12:.LCPI0_0]
    and     v3.16b, v3.16b, v5.16b
    ext     v6.16b, v3.16b, v3.16b, #8
    zip1    v3.16b, v3.16b, v6.16b
    addv    h3, v3.8h
    fmov    w8, s3
    cmeq    v2.16b, v2.16b, v4.16b
    and     v2.16b, v2.16b, v5.16b
    ext     v3.16b, v2.16b, v2.16b, #8
    zip1    v2.16b, v2.16b, v3.16b
    addv    h2, v2.8h
    fmov    w9, s2
    bfi     w9, w8, #16, #16
    cmeq    v1.16b, v1.16b, v4.16b
    and     v1.16b, v1.16b, v5.16b
    ext     v2.16b, v1.16b, v1.16b, #8
    zip1    v1.16b, v1.16b, v2.16b
    addv    h1, v1.8h
    fmov    w8, s1
    cmeq    v0.16b, v0.16b, v4.16b
    and     v0.16b, v0.16b, v5.16b
    ext     v1.16b, v0.16b, v0.16b, #8
    zip1    v0.16b, v0.16b, v1.16b
    addv    h0, v0.8h
    fmov    w10, s0
    bfi     w10, w8, #16, #16
    orr     x0, x10, x9, lsl #32

Unfortunately, LLVM, has not yet seen the light-- of interleaved vectors. Here is the x86-64 emit for Zen 4, for reference:

.LCPI0_0:
    .zero   64,32
checkWhitespace:
    vmovdqu64       zmm0, zmmword ptr [rdi]
    vpcmpeqb        k0, zmm0, zmmword ptr [rip + .LCPI0_0]
    kmovq   rax, k0
    vzeroupper

This paints Arm/Aarch64 in an unnecessarily bad light. With interleaved vectors, and telling the compiler exactly what we want to do, we can do a lot better.

export fn checkWhitespace(ptr: [*]u8) u64 {
    const vec: @Vector(64, u8) = @bitCast(std.simd.deinterlace(4, ptr[0..64].*));
    const spaces = @select(u8, vec == @as(@Vector(64, u8), @splat(' ')),
        @as(@Vector(64, u8), @splat(0xFF)),
        @as(@Vector(64, u8), @splat(0)));
    return vmovmaskq_u8(spaces);
}

fn vmovmaskq_u8(vec: @Vector(64, u8)) u64 {
    const chunks: [4]@Vector(16, u8) = @bitCast(vec);
    const t0 = vsriq_n_u8(chunks[1], chunks[0], 1);
    const t1 = vsriq_n_u8(chunks[3], chunks[2], 1);
    const t2 = vsriq_n_u8(t1, t0, 2);
    const t3 = vsriq_n_u8(t2, t2, 4);
    const t4 = vshrn_n_u16(@bitCast(t3), 4);
    return @bitCast(t4);
}

This gives us the following assembly (See full Zig code and assembly here):

checkWhitespace:
    movi    v0.16b, #32
    ld4     { v1.16b, v2.16b, v3.16b, v4.16b }, [x0]
    cmeq    v5.16b, v3.16b, v0.16b
    cmeq    v6.16b, v4.16b, v0.16b
    cmeq    v7.16b, v1.16b, v0.16b
    cmeq    v0.16b, v2.16b, v0.16b
    sri     v0.16b, v7.16b, #1
    sri     v6.16b, v5.16b, #1
    sri     v6.16b, v0.16b, #2
    sri     v6.16b, v6.16b, #4
    shrn    v0.8b, v6.8h, #4
    fmov    x0, d0

This is a lot cleaner! I show and explain how this routine works here.

Note: if you check LLVM-mca, you will find that this is expected to run slower than the previous version if you only care to do a single movemask. However, if you want to do several movemasks simultaneously, this version will be faster. LLVM's cost model is also conservative; the `shrn` instruction has 3 cycles of latency on Apple M3's performance core, but 4 cycles of latency on their efficiency core. LLVM treats it as a 4-cycle operation.

Unmovemask

Sometimes, we want to go the other way. This may be because we did a movemask, then did some bit manipulation on the mask, and now we want to turn our mask back into a vector. Here is the routine for normal vectors on ARM64:

export fn unmovemask64(x: u64) @Vector(64, u8) {
    const bit_positions = @as(@Vector(64, u8), @splat(1)) << @truncate(std.simd.iota(u8, 64));
    const v0 = std.simd.join(@as(@Vector(8, u8), @splat(0)), @as(@Vector(8, u8), @splat(1)));
    const v1 = std.simd.join(@as(@Vector(8, u8), @splat(2)), @as(@Vector(8, u8), @splat(3)));
    const v2 = std.simd.join(@as(@Vector(8, u8), @splat(4)), @as(@Vector(8, u8), @splat(5)));
    const v3 = std.simd.join(@as(@Vector(8, u8), @splat(6)), @as(@Vector(8, u8), @splat(7)));

    const v = std.simd.join(@as(@Vector(8, u8), @bitCast(x)), @as(@Vector(8, u8), @splat(undefined)));

    const final: @Vector(64, u8) = @bitCast([4]@Vector(16, u8){ tbl(v, v0), tbl(v, v1), tbl(v, v2), tbl(v, v3) });

    return @select(u8, (final & bit_positions) == bit_positions,
        @as(@Vector(64, u8), @splat(0xFF)),
        @as(@Vector(64, u8), @splat(0)),
    );
}

fn tbl(table: @Vector(16, u8), indices: anytype) @TypeOf(indices) {
    switch (@TypeOf(indices)) {
        @Vector(8, u8), @Vector(16, u8) => {},
        else => @compileError("[tbl] Invalid type for indices"),
    }
    return struct {
        extern fn @"llvm.aarch64.neon.tbl1"(@TypeOf(table), @TypeOf(indices)) @TypeOf(indices);
    }.@"llvm.aarch64.neon.tbl1"(table, indices);
}

And the corresponding assembly: (I reordered the instructions and removed C ABI stuff)

.LCPI1_0:
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
.LCPI1_1:
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
.LCPI1_2:
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
.LCPI1_3:
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
.LCPI1_4:
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
unmovemask64:
        adrp    x9, .LCPI1_0; load the 5 vectors above into registers
        ldr     q1, [x9, :lo12:.LCPI1_0]
        adrp    x9, .LCPI1_1
        ldr     q2, [x9, :lo12:.LCPI1_1]
        adrp    x9, .LCPI1_2
        ldr     q3, [x9, :lo12:.LCPI1_2]
        adrp    x9, .LCPI1_3
        ldr     q4, [x9, :lo12:.LCPI1_3]
        adrp    x9, .LCPI1_4
        ldr     q5, [x9, :lo12:.LCPI1_4]
        fmov    d0, x0; move data from a scalar register to a vector register
        tbl     v1.16b, { v0.16b }, v1.16b; broadcast byte 0 to bytes 0-7, byte 1 to bytes 8-15
        tbl     v2.16b, { v0.16b }, v2.16b; broadcast byte 2 to bytes 0-7, byte 3 to bytes 8-15
        tbl     v3.16b, { v0.16b }, v3.16b; broadcast byte 4 to bytes 0-7, byte 5 to bytes 8-15
        tbl     v4.16b, { v0.16b }, v4.16b; broadcast byte 6 to bytes 0-7, byte 7 to bytes 8-15
        cmtst   v1.16b, v1.16b, v5.16b; turn each unique bit position into a byte of all 0xFF or 0
        cmtst   v2.16b, v2.16b, v5.16b; turn each unique bit position into a byte of all 0xFF or 0
        cmtst   v3.16b, v3.16b, v5.16b; turn each unique bit position into a byte of all 0xFF or 0
        cmtst   v4.16b, v4.16b, v5.16b; turn each unique bit position into a byte of all 0xFF or 0

In interleaved space, we can do better:

export fn unmovemask(x: u64) @Vector(64, u8) {
    const vec = @as(@Vector(8, u8), @bitCast(x));
    const interlaced_vec = std.simd.interlace(.{ vec, vec });

    return std.simd.join(
        std.simd.join(cmtst(interlaced_vec, @bitCast(@as(@Vector(8, u16), @splat(@as(u16, @bitCast([2]u8{ 1 << 0, 1 << 4 })))))),
                      cmtst(interlaced_vec, @bitCast(@as(@Vector(8, u16), @splat(@as(u16, @bitCast([2]u8{ 1 << 1, 1 << 5 }))))))),
        std.simd.join(cmtst(interlaced_vec, @bitCast(@as(@Vector(8, u16), @splat(@as(u16, @bitCast([2]u8{ 1 << 2, 1 << 6 })))))),
                      cmtst(interlaced_vec, @bitCast(@as(@Vector(8, u16), @splat(@as(u16, @bitCast([2]u8{ 1 << 3, 1 << 7 })))))))
    );
}

fn cmtst(a: anytype, comptime b: @TypeOf(a)) @TypeOf(a) {
    return @select(u8, (a & b) != @as(@TypeOf(a), @splat(0)), @as(@TypeOf(a), @splat(0xff)), @as(@TypeOf(a), @splat(0)));
}

In assembly: (after llvm/llvm-project#107243, reordering, and removing C ABI ceremony)

unmovemask_interleaved:
        mov     w9, #0x400; load 4 constants into vectors
        dup     v0.8h, w9
        mov     w9, #0x501
        dup     v1.8h, w9
        mov     w9, #0x602
        dup     v2.8h, w9
        mov     w9, #0x703
        dup     v3.8h, w9
        fmov    d4, x0; move data from a scalar register to a vector register
        zip1    v4.16b, v4.16b, v4.16b; interleave input with itself -> 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8
        cmtst   v0.16b, v0.16b, v4.16b; match bits positions: 0,4,0,4,0,4,0,4,0,4,0,4,0,4,0,4
        cmtst   v1.16b, v1.16b, v4.16b; match bits positions: 1,5,1,5,1,5,1,5,1,5,1,5,1,5,1,5
        cmtst   v2.16b, v2.16b, v4.16b; match bits positions: 2,6,2,6,2,6,2,6,2,6,2,6,2,6,2,6
        cmtst   v3.16b, v3.16b, v4.16b; match bits positions: 3,7,3,7,3,7,3,7,3,7,3,7,3,7,3,7

In this case, using interleaved vectors eliminated the memory accesses and reduced 4 tbl instructions to a single zip1 instruction!

Elementwise Shifts

Initially, I believed we could only do algorithms on interleaved vectors where order didn't change. However, while working on the UTF8 validator for my Accelerated-Zig-Parser, I realized we can emulate element shifts in interleaved space.

Let's say we have a vector where each byte contains its own index. Next, we shift the elements right by one, shifting in -1. On normal vectors, this looks like so (only showing first 16 bytes of the vector due to space constraints):

This aligns the bytes such that each pair of contiguous bytes is now aligned column-wise. This allows us to validate 2-byte UTF8 codepoints efficiently. To properly validate 3 and 4 byte codepoints, we will need to do two more such shifts, shifting in -2, and -3:

In this article, I will refer to the main vectors as prev0 (the top vector in the previous diagram), and the vectors containing the previous bytes, relative to prev0, as prev1, prev2, and prev3.

For UTF8 validation, we can match the 4th byte of a 4-byte sequence in prev0, the 3rd byte in prev1, the 2nd byte in prev2, and the 1st byte in prev3 (because the byte-order increases by 1 as we move up a column).

Now, looking again at the interleaved vectors given by ld4, think about how we might find the relative prev1, prev2, and prev3 of each of the following vectors (each one is a prev0):

Here are the prev1 vectors relative to each of the vectors above:

Hopefully it is obvious that all we did is subtract one from each index from the perspective of byte-order. As you can see, the lower 3 vectors were already created by ld4! It's only the uppermost vector which needs to be computed, by shifting in -1 to the vector that starts with 3, 7, 11, 15, etc.

That means we can get the semantics of a 64-byte shift by 1 by only shifting a single 16-byte vector!

Let's see how this extends to the prev2 vectors:

Same deal as before, the bottom three vectors in this diagram we already had when we computed prev1. Again, we just need to compute the uppermost vector in this diagram by shifting -2 into the vector that starts with 2, 6, 10, 14, etc.

We can do the same thing to produce the prev3 vectors. The only additional computation needed is that we need to shift in -3 to the vector that starts with 1, 5, 9, 13, etc. Once we do that, we will have the following vectors:

In the above diagram, each of the bottom 4 vectors are a prev0 vector. The prev1 vector, relative to each prev0 vector, is the one above it, the prev2 is the one 2 rows above, and the prev3 is 3 rows above, for each prev0 vector.

With only 3 vector shifts, and a total of 7 vectors in play, we can operate on all the shifted vectors we need for a 64-byte chunk. Compare this to needing to produce a separate prev1, prev2, and prev3 for each 16-byte vector, which takes a total of 12 vector shifts for 64 bytes.

Using this intuition, we can write a function which emulates the semantics of a vector shift by any compile-time known amount on normally-ordered vectors, but for interleaved vectors! This removes the restriction of only using this vector interleaving trick in circumstances where order didn't change<span id="shiftInterleavedElementsRight">.</span>

fn shiftInterleavedElementsRight(
    vecs: [4]@Vector(16, u8),
    comptime amount: std.simd.VectorCount(@Vector(64, u8)),
    shift_in: std.meta.Child(@Vector(64, u8))
) [4]@Vector(16, u8) {
    var new_vecs = vecs;

    if ((amount & 1) == 1) {
        const n = std.simd.shiftElementsRight(new_vecs[3], 1, shift_in);
        new_vecs[3] = new_vecs[2];
        new_vecs[2] = new_vecs[1];
        new_vecs[1] = new_vecs[0];
        new_vecs[0] = n;
    }

    if ((amount & 2) == 2) {
        const n1 = std.simd.shiftElementsRight(new_vecs[3], 1, shift_in);
        const n0 = std.simd.shiftElementsRight(new_vecs[2], 1, shift_in);
        new_vecs[3] = new_vecs[1];
        new_vecs[2] = new_vecs[0];
        new_vecs[1] = n1;
        new_vecs[0] = n0;
    }

    const leftover_amt = amount >> 2;

    if (leftover_amt > 0) {
        new_vecs = .{
            std.simd.shiftElementsRight(new_vecs[0], leftover_amt, shift_in),
            std.simd.shiftElementsRight(new_vecs[1], leftover_amt, shift_in),
            std.simd.shiftElementsRight(new_vecs[2], leftover_amt, shift_in),
            std.simd.shiftElementsRight(new_vecs[3], leftover_amt, shift_in)
        };
    }

    return new_vecs;
}

Prefix sums

While interleaved vectors are a clear win for reducing the number of vector shifts required for our UTF8 use-case, even emulating a 64-byte vector shift by doing only a single 16-byte shift (!!), it doesn't always work out favorably.

The prefix-sum algorithm includes a situation with worse performance for interleaved vectors relative to normal-ordered vectors:

fn prefixSum(vec_: @Vector(64, u8)) @Vector(64, u8) {
    var vec = vec_ + std.simd.shiftElementsRight(vec_, 1, 0);
    vec += std.simd.shiftElementsRight(vec, 2, 0);
    vec += std.simd.shiftElementsRight(vec, 4, 0);
    vec += std.simd.shiftElementsRight(vec, 8, 0);
    vec += std.simd.shiftElementsRight(vec, 16, 0);
    return vec + std.simd.shiftElementsRight(vec, 32, 0);
}

To compute the last two lines, where we want to shift by 16 and 32 respectively, each of those will require 4 vector shifts and 4 adds to simulate over interleaved vectors. However, with normally-ordered vectors, 0 instructions are necessary to shift by multiples of 16 (the vector length), and only 3 and 2 adds are needed respectively (because adding a vector of all zeroes is optimized away). See here for a playground showing the prefix-sum performed in interleaved space versus normal space.

If we consider that each line of the prefixSum function "should" take 4 vector shift (ext) and 4 add instructions, the interleaved variant saves 5 vector shift instructions on the first two lines, and the normal version saves 8 vector shift instructions and 3 adds on the last two lines. That means the normal version takes 6 fewer instructions per iteration.

However, using interleaved vectors has instruction-level-parallelism advantages that almost even it out. According to LLVM-mca, the Apple M3 can do a prefix sum in interleaved space in ~14.86 cycles, whereas it can do it in normal (non-interleaved) space in ~12.87 cycles, a difference of ~1.99 cycles, despite the difference of 6 instructions.

Conclusion

As shown above, the movemask and unmovemask routines can not only be emulated in interleaved space, but are more efficicent than the routines for vectors in normal space. Elementwise-shifts are also more efficient when shifting only 1-3 slots left or right, but are less efficient when shifting by a multiple of 16.

So next time you want to parse utf8, JSON, or Zig, be sure to use interleaved vectors!

‒ Validark