Convolution Kernel

Horizontal image filter with f32x4 accumulator and channel-count specialization

Image resizing is a weighted sum (convolution) of neighboring pixels. This example from zenresize shows two patterns: a 4-channel specialization that uses f32x4 for natural RGBA accumulation, and a dispatch wrapper that selects the right path.

4-channel horizontal convolution

Each output pixel is the weighted sum of max_taps input pixels. With 4-channel RGBA data, one pixel is exactly one f32x4 register. The accumulator stays in SIMD the entire time — no lane extraction needed.

use magetypes::simd::backends::F32x4Backend;
use magetypes::simd::generic::f32x4;

/// 4-channel horizontal f32 convolution using f32x4.
///
/// Uses * + += instead of mul_add because on WASM (no FMA),
/// mul_add goes through a branch to (a * b) + c anyway.
/// Direct * + += gives LLVM a cleaner expression to schedule.
#[inline(always)]
fn filter_h_4ch<T: F32x4Backend>(
    token: T,
    input: &[f32],
    output: &mut [f32],
    weights: &F32WeightTable,
) {
    let (in_pixels, _) = input.as_chunks::<4>();
    let (out_pixels, _) = output.as_chunks_mut::<4>();

    for out_x in 0..weights.len() {
        let left = weights.left[out_x] as usize;
        let w = weights.weights(out_x);
        let mut acc = f32x4::zero(token);

        for (t, &weight) in w.iter().enumerate() {
            // Each input pixel is a [f32; 4] — loads directly into f32x4
            acc += f32x4::from_array(token, in_pixels[left + t])
                * f32x4::splat(token, weight);
        }

        out_pixels[out_x] = acc.to_array();
    }
}

Key insight: as_chunks::<4>() gives &[[f32; 4]] — the pixel boundaries are in the type system. from_array loads a [f32; 4] directly into a SIMD register with zero overhead (it's a transmute on aligned data). The accumulator acc stays as f32x4 for all taps, only converting back to an array at the store.

Channel-count dispatch

The SIMD benefit only applies to 4-channel data (or any channel count matching the SIMD width). For 3 channels, the data isn't aligned to any SIMD boundary — scalar code auto-vectorizes well enough. The #[magetypes] macro generates the dispatch function, and the inner match selects the right path:

#[magetypes(neon, wasm128)]
#[inline(always)]
pub fn filter_h_row_f32(
    token: Token,
    input: &[f32],
    output: &mut [f32],
    weights: &F32WeightTable,
    channels: usize,
) {
    match channels {
        4 => filter_h_4ch(token, input, output, weights),
        3 => filter_h_3ch(input, output, weights),     // scalar
        _ => filter_h_generic(input, output, weights, channels), // scalar
    }
}

The 3-channel and generic paths don't use SIMD at all — they're plain scalar loops that LLVM auto-vectorizes. The SIMD effort goes where it matters: the 4-channel path that handles ~90% of real-world image data.

Why `f32x4` and not `f32x8`?

For convolution, the accumulation happens across taps (weights), not across pixels. Each tap multiplies one pixel (4 values) by one scalar weight. With f32x4, the weight broadcasts naturally to all 4 channels via splat.

With f32x8, you'd pack two pixels into one register, but then you'd need the same weight for lanes 0-3 and a different weight for lanes 4-7. That means either loading two weights and shuffling, or accepting that both pixels use the same weight (which only works for very specific kernel structures). For general-purpose convolution, f32x4 per pixel is simpler and equally fast.

The weight table

The F32WeightTable stores pre-computed filter weights with zero-padding to max_taps:

struct F32WeightTable {
    left: Vec<u32>,        // left-most input pixel for each output pixel
    data: Vec<f32>,        // flat: max_taps entries per output pixel
    max_taps: usize,
}

impl F32WeightTable {
    fn weights(&self, out_x: usize) -> &[f32] {
        let start = out_x * self.max_taps;
        &self.data[start..start + self.max_taps]
    }
}

Zero-padding to max_taps means the inner loop always iterates the same number of times — no branch on the actual tap count. The zero weights produce zero contributions via multiply, which the CPU handles with no branch prediction overhead.

Found an error or it needs a clarification? Open an issue on GitHub.

Substantiated corrections will be incorporated with attribution.

Found a typo? Fork, modify and open a PR.

Convolution Kernel

4-channel horizontal convolution

Channel-count dispatch

Why f32x4 and not f32x8?

The weight table

ON THIS PAGE

Why `f32x4` and not `f32x8`?