Interleaved Data

Image pixels, audio samples, and sensor data are often interleaved: RGBARGBA... or LRLRLR... SIMD works best on separate channels, so you deinterleave before processing and reinterleave after.

4-Channel Deinterleave (RGBA)

Separate interleaved RGBA data into individual channels:

use magetypes::simd::{
    generic::f32x4,
    backends::F32x4Backend,
};

#[inline(always)]
fn deinterleave_rgba<T: F32x4Backend>(
    token: T,
    r0: [f32; 4], g0: [f32; 4], b0: [f32; 4], a0: [f32; 4],
    r1: [f32; 4], g1: [f32; 4], b1: [f32; 4], a1: [f32; 4],
    r2: [f32; 4], g2: [f32; 4], b2: [f32; 4], a2: [f32; 4],
    r3: [f32; 4], g3: [f32; 4], b3: [f32; 4], a3: [f32; 4],
) {
    // Input: 4 f32x4 vectors of interleaved RGBA data
    let input = [
        f32x4::<T>::from_array(token, [r0[0], g0[0], b0[0], a0[0]]),
        f32x4::<T>::from_array(token, [r1[0], g1[0], b1[0], a1[0]]),
        f32x4::<T>::from_array(token, [r2[0], g2[0], b2[0], a2[0]]),
        f32x4::<T>::from_array(token, [r3[0], g3[0], b3[0], a3[0]]),
    ];

    let [r, g, b, a] = f32x4::<T>::deinterleave_4ch(input);
    // r = [R0, R1, R2, R3]
    // g = [G0, G1, G2, G3]
    // b = [B0, B1, B2, B3]
    // a = [A0, A1, A2, A3]
}

Process Channels

Once deinterleaved, operations on each channel are straightforward:

use magetypes::simd::{
    generic::f32x4,
    backends::{F32x4Backend, F32x4Convert},
};

#[inline(always)]
fn process_channels<T: F32x4Backend + F32x4Convert>(token: T, r: f32x4<T>, g: f32x4<T>) {
    // Brighten the red channel
    let r_bright = r + f32x4::<T>::splat(token, 0.1);

    // Apply gamma to green
    let g_gamma = g.pow_midp(1.0 / 2.2);
}

4-Channel Reinterleave

Pack channels back into interleaved format:

// given r_bright, g_gamma, b, a: f32x4<T>
let output = f32x4::<T>::interleave_4ch([r_bright, g_gamma, b, a]);
// output[0..4] contain RGBARGBA... interleaved data

Low/High Interleave

For simpler cases, interleave two vectors element by element:

use magetypes::simd::{
    generic::f32x4,
    backends::F32x4Backend,
};

#[inline(always)]
fn interleave_example<T: F32x4Backend>(token: T) {
    let a = f32x4::<T>::from_array(token, [1.0, 2.0, 3.0, 4.0]);
    let b = f32x4::<T>::from_array(token, [5.0, 6.0, 7.0, 8.0]);

    let lo = a.interleave_lo(b);  // [1.0, 5.0, 2.0, 6.0]
    let hi = a.interleave_hi(b);  // [3.0, 7.0, 4.0, 8.0]
}

interleave_lo and interleave_hi are available on f32x4 across all platforms. They interleave the lower or upper halves of two vectors.

Transpose

For matrix-like data, transpose 4x4 blocks:

use magetypes::simd::{
    generic::f32x4,
    backends::F32x4Backend,
};

#[inline(always)]
fn transpose_example<T: F32x4Backend>(token: T, row0: f32x4<T>, row1: f32x4<T>, row2: f32x4<T>, row3: f32x4<T>) {
    let mut rows = [row0, row1, row2, row3];
    f32x4::<T>::transpose_4x4(&mut rows);
    // rows are now transposed in-place

    // Or use the non-mutating version:
    let [r0, r1, r2, r3] = f32x4::<T>::transpose_4x4_copy([row0, row1, row2, row3]);
}

Platform Notes

x86-64: Uses vunpcklps/vunpckhps and shuffle instructions
AArch64: Uses native vzip1q/vzip2q
WASM: Uses i32x4_shuffle

The API is identical across platforms. Performance is comparable.

Found an error or it needs a clarification? Open an issue on GitHub.

Substantiated corrections will be incorporated with attribution.

Found a typo? Fork, modify and open a PR.