Byte-Level Transforms

WebP lossless inverse transforms using u8x16 generics for pixel prediction

Not all SIMD is floating-point. Image codecs often manipulate raw bytes — pixel prediction, delta decoding, channel extraction. This example from zenwebp shows u8x16 generics for WebP's lossless inverse transforms: adding a green channel to red and blue, and applying left-pixel prediction.

Green channel addition

WebP lossless stores images with a "subtract green" transform: R and B are stored relative to G. The inverse adds G back to both channels. With RGBA pixels packed as [R, G, B, A, R, G, B, A, ...], we need to extract every G byte, then add it to its neighboring R and B bytes.

use magetypes::simd::backends::U8x16Backend;
use magetypes::simd::generic::u8x16;

/// Add green to red and blue channels.
/// Processes 4 RGBA pixels (16 bytes) per iteration.
#[inline(always)]
fn add_green_portable<T: U8x16Backend>(
    token: T,
    data: &mut [u8],
    range: core::ops::Range<usize>,
) {
    // Shuffle mask: broadcast G to R and B positions within each pixel.
    // RGBA layout: [R0, G0, B0, A0, R1, G1, B1, A1, ...]
    // We want:     [G0, 0,  G0, 0,  G1, 0,  G1, 0,  ...]
    let green_mask = u8x16::from_array(token, [
        1, 0, 1, 0,   // pixel 0: G to R pos, 0 to G pos, G to B pos, 0 to A pos
        5, 0, 5, 0,   // pixel 1
        9, 0, 9, 0,   // pixel 2
        13, 0, 13, 0, // pixel 3
    ]);

    let zero = u8x16::zero(token);

    let mut offset = range.start;
    while offset + 16 <= range.end {
        let chunk: &mut [u8; 16] = data[offset..offset + 16]
            .try_into().unwrap();
        let pixels = u8x16::load(token, chunk);

        // Extract green channels via shuffle, zero out G and A positions
        let greens = pixels.shuffle(green_mask);

        // Wrapping byte addition: R += G, B += G, G += 0, A += 0
        let result = pixels.wrapping_add(greens);
        result.store(chunk);

        offset += 16;
    }

    // Scalar tail: handle remaining pixels
    while offset + 4 <= range.end {
        let g = data[offset + 1];
        data[offset] = data[offset].wrapping_add(g);     // R += G
        data[offset + 2] = data[offset + 2].wrapping_add(g); // B += G
        offset += 4;
    }
}

Left-pixel prediction (predictor add)

WebP's predictor 1 stores each pixel as a delta from the pixel to its left. The inverse accumulates left-to-right: pixel[x] += pixel[x-1]. This is inherently serial per row (each pixel depends on the previous), but we can process 4 pixels at a time using a parallel prefix sum within the SIMD register.

/// Apply left-pixel prediction (predictor 1) to 4 RGBA pixels.
/// Each pixel is the sum of itself and all pixels to its left.
#[inline(always)]
fn predictor_add_left<T: U8x16Backend>(
    token: T,
    data: &mut [u8],
    start: usize,
    prev_pixel: [u8; 4],
) {
    let chunk: &mut [u8; 16] = data[start..start + 16]
        .try_into().unwrap();
    let deltas = u8x16::load(token, chunk);

    // Parallel prefix sum across 4 pixels:
    // Step 1: Add pixel 0 to pixel 1 (shift by 4 bytes and add)
    let shifted1 = deltas.shift_left_bytes::<4>();
    let sum1 = deltas.wrapping_add(shifted1);

    // Step 2: Add pixels 0-1 to pixels 2-3 (shift by 8 bytes and add)
    let shifted2 = sum1.shift_left_bytes::<8>();
    let sum2 = sum1.wrapping_add(shifted2);

    // Step 3: Add the previous pixel (from the end of the prior chunk)
    let prev = u8x16::from_array(token, [
        prev_pixel[0], prev_pixel[1], prev_pixel[2], prev_pixel[3],
        prev_pixel[0], prev_pixel[1], prev_pixel[2], prev_pixel[3],
        prev_pixel[0], prev_pixel[1], prev_pixel[2], prev_pixel[3],
        prev_pixel[0], prev_pixel[1], prev_pixel[2], prev_pixel[3],
    ]);
    let result = sum2.wrapping_add(prev);
    result.store(chunk);
}

Top-pixel prediction (batch add)

Predictors that reference the row above (top, top-right, top-left) don't have the serial dependency within a row — each pixel depends only on the row above and itself. These parallelize fully:

/// Apply top-pixel prediction: pixel[x] += top_row[x].
/// Processes 4 pixels (16 bytes) at a time.
#[inline(always)]
fn predictor_add_top<T: U8x16Backend>(
    token: T,
    data: &mut [u8],
    offset: usize,
    top_row: &[u8],
    top_offset: usize,
) {
    let chunk: &mut [u8; 16] = data[offset..offset + 16]
        .try_into().unwrap();
    let top_chunk: &[u8; 16] = top_row[top_offset..top_offset + 16]
        .try_into().unwrap();

    let delta = u8x16::load(token, chunk);
    let top = u8x16::load(token, top_chunk);
    delta.wrapping_add(top).store(chunk);
}

Dispatch pattern

The entry point uses #[arcane] with a concrete token, then calls the generic inner functions:

#[cfg(target_arch = "x86_64")]
#[arcane]
pub fn add_green_sse2_entry(token: X64V1Token, data: &mut [u8]) {
    add_green_portable(token, data, 0..data.len());
}

// NEON and WASM128 use the same generic function:
#[cfg(target_arch = "aarch64")]
#[arcane]
pub fn add_green_neon_entry(token: NeonToken, data: &mut [u8]) {
    add_green_portable(token, data, 0..data.len());
}

Integer vs float generics

The u8x16 generic pattern is identical to f32x8 — same trait bound approach, same partition_slice, same load/store. The available operations differ:

`f32x8`	`u8x16`
`+`, `-`, `*`, `/`	`wrapping_add`, `wrapping_sub`, `saturating_add`
`mul_add`, `sqrt`, `abs`	`shuffle`, `shift_left_bytes`, `shift_right_bytes`
`simd_ge`, `blend`	`simd_eq`, `bitand`, `bitor`
`reduce_add`, `reduce_min`	`any_nonzero`, `count_ones`

The byte types shine for data movement (shuffles, shifts, masks) where float types shine for arithmetic. Choose the type that matches your data, and the generic pattern works the same.

Found an error or it needs a clarification? Open an issue on GitHub.

Substantiated corrections will be incorporated with attribution.

Found a typo? Fork, modify and open a PR.