Polyfills

Magetypes lets you write code using wider types (like f32x8) even on hardware with narrower registers (128-bit NEON or WASM SIMD128). The polyfill layer handles this transparently.

How It Works

On x86-64 with AVX2, f32x8 maps directly to a single 256-bit __m256 register. On AArch64 NEON (128-bit registers), the same f32x8 type is implemented as two f32x4 operations internally. Every method — +, reduce_add(), splat(), etc. — works identically regardless of the underlying implementation.

The generic type f32x8<T> is parameterized by a backend token. Different backends produce different implementations:

use magetypes::simd::{
    generic::f32x8,
    backends::{x64v3, neon, scalar},
};

// x86-64 AVX2: maps to a single __m256 register
let a = f32x8::<x64v3>::splat(token, 1.0);

// AArch64 NEON: internally two f32x4 NEON operations
let a = f32x8::<neon>::splat(token, 1.0);

// Scalar fallback: eight individual f32 values
let a = f32x8::<scalar>::splat(token, 1.0);

Write your code once using a generic backend bound — the right implementation is selected at the call site:

use magetypes::simd::{
    generic::f32x8,
    backends::F32x8Backend,
};

#[inline(always)]
fn example<T: F32x8Backend>(token: T) {
    let a = f32x8::<T>::splat(token, 1.0);
    let b = f32x8::<T>::splat(token, 2.0);
    let c = a + b;       // Two vaddq_f32 on NEON, one vaddps on AVX2
    let sum = c.reduce_add();
}

Pick the Right Size for Your Algorithm

The polyfill approach means you pick the vector width that matches your algorithm, not your hardware:

Processing 8 floats at a time? Use f32x8. On ARM, it's two NEON ops — still faster than scalar.
Processing 4 floats at a time? Use f32x4. Native on all platforms.
Processing 16 floats at a time? Use f32x16 (requires avx512 feature). On ARM, it's four NEON ops.

Wider polyfills have overhead (2x or 4x the instruction count) but the overhead is constant and predictable. For data-parallel workloads, using f32x8 on ARM is still substantially faster than scalar f32 code.

implementation_name()

Every magetypes vector has an implementation_name() associated function that returns a string identifying the actual implementation. It's an associated function, not a method — call it on the type, not on a value:

use magetypes::simd::{
    generic::f32x8,
    backends::{x64v3, neon},
};

println!("{}", f32x8::<x64v3>::implementation_name());
// "x86::v3::f32x8"

println!("{}", f32x8::<neon>::implementation_name());
// "polyfill::neon::f32x8"

Platform	f32x8 implementation_name
x86-64 (AVX2)	`"x86::v3::f32x8"`
AArch64 (NEON)	`"polyfill::neon::f32x8"`
WASM	`"polyfill::wasm128::f32x8"`

For native-width types, the prefix reflects the platform directly:

Platform	f32x4 implementation_name
x86-64	`"x86::v2::f32x4"`
AArch64	`"arm::neon::f32x4"`
WASM	`"wasm::wasm128::f32x4"`

This is useful for debugging and logging — you can verify which code path is actually running.

Polyfill Tiers

Width	x86-64	AArch64	WASM
128-bit (f32x4, i32x4, ...)	Native (SSE/AVX)	Native (NEON)	Native (SIMD128)
256-bit (f32x8, i32x8, ...)	Native (AVX2)	Polyfill (2x NEON)	Polyfill (2x SIMD128)
512-bit (f32x16, i32x16, ...)	Native (AVX-512)*	Polyfill (4x NEON)	Polyfill (4x SIMD128)

*512-bit types require the avx512 feature flag.

What's Polyfilled, What's Not

The polyfill layer covers:

All arithmetic operators (+, -, *, /, negation)
FMA (mul_add, mul_sub)
Comparisons (simd_lt, simd_eq, etc.)
Reductions (reduce_add, reduce_max, reduce_min)
Construction and extraction (from_array, to_array, splat, etc.)
Transcendentals (exp, log2, ln, etc.)
Bitwise operations
Conversions

The API is identical. The only difference is the number of hardware instructions emitted.

Performance: Generic = Concrete Inside `#[arcane]` (When Inlined)

A common concern: does using f32x8::<T> with a generic T: F32x8Backend produce worse code than f32x8::<x64v3>? No — if the generic function inlines into the #[arcane] caller. The backend methods are all #[inline(always)], and once the generic function body is inside the #[target_feature] region, LLVM emits the same SIMD instructions.

Pattern	Time
`f32x8::<T>` generic `#[inline(always)]` inside `#[arcane]`	1.35 ns
`f32x8::<x64v3>` concrete inside `#[arcane]`	1.16 ns
`f32x8::<T>` generic `#[inline(never)]` inside `#[arcane]`	23.7 ns (18x — inlining is required!)
`f32x8::<T>` generic without `#[arcane]`	24.7 ns (18x slower)

The generic function has no #[target_feature] of its own. It inherits the caller's features through inlining. Mark generic SIMD helpers #[inline(always)] to guarantee this. Without inlining, intrinsics become function calls — even inside #[arcane].

Found an error or it needs a clarification? Open an issue on GitHub.

Substantiated corrections will be incorporated with attribution.

Found a typo? Fork, modify and open a PR.