Behavioral Differences

Magetypes maintains full API parity across x86-64, AArch64, and WASM — the same methods exist everywhere. But a few operations have different semantics between architectures. These are not bugs; they reflect real hardware differences.

Bitwise Operators

x86-64AArch64WASM
a & b, a | b, a ^ b, !aTrait impls (operators work)Methods onlyMethods only
a.and(b), a.or(b), a.xor(b), a.not()WorksWorksWorks

Portable choice: Always use .and(), .or(), .xor(), .not() methods. They work on all platforms.

use magetypes::simd::{
    generic::f32x8,
    backends::F32x8Backend,
};

#[inline(always)]
fn bitwise_example<T: F32x8Backend>(a: f32x8<T>, b: f32x8<T>) {
    // Portable — works on all platforms
    let result = a.and(b);

    // x86-64 only — won't compile on ARM/WASM
    // let result = a & b;
}

Shift Right for Signed Integers

x86-64AArch64WASM
shr on signed types (i8, i16, i32)Logical (zero-fill)Arithmetic (sign-extend)Arithmetic (sign-extend)
shr_arithmeticSign-extendingSign-extendingSign-extending

Portable choice: Use shr_arithmetic when you want sign-extending behavior. Use shr when you want zero-filling behavior (and be aware it sign-extends on ARM/WASM).

use magetypes::simd::{
    generic::i32x4,
    backends::I32x4Backend,
};

#[inline(always)]
fn shift_example<T: I32x4Backend>(token: T) {
    let v = i32x4::<T>::splat(token, -8);

    // shr: behavior differs by platform
    let shifted = v.shr::<1>();
    // x86-64:  [-8 >> 1] with zero-fill = some large positive number
    // ARM/WASM: [-8 >> 1] with sign-extend = -4

    // shr_arithmetic: consistent everywhere
    let shifted = v.shr_arithmetic::<1>();
    // All platforms: -4
}

This difference exists because x86-64's SSE/AVX shift instructions are logical for all types, while ARM NEON and WASM use arithmetic shifts for signed types.

Blend Signature

x86-64 / AArch64WASM
blend(mask, true_val, false_val)blend(true_val, false_val) on maskself.blend(other, mask) on value

The method exists on all platforms but the calling convention differs. For portable code that uses blend, test on all target platforms.

interleave_lo / interleave_hi

Available on f32x4 across all platforms. Not available on integer types.

use magetypes::simd::{
    generic::f32x4,
    backends::F32x4Backend,
};

#[inline(always)]
fn interleave_example<T: F32x4Backend>(a: f32x4<T>, b: f32x4<T>) {
    // Works everywhere
    let lo = a.interleave_lo(b);
    let hi = a.interleave_hi(b);
}

Floating-Point Behavioral Differences

These differences arise from how hardware implements IEEE 754 operations. They affect specific edge cases (NaN, signed zero, near-zero cancellation) but not normal arithmetic.

Negation and Signed Zero

x86-64AArch64WASM
neg(0.0)+0.0 (uses sub(0, x))-0.0 (uses vneg)-0.0 (uses f32x4_neg)

x86 implements negation as 0 - x, which produces +0.0 for zero inputs (IEEE 754: +0 - +0 = +0). ARM and WASM flip the sign bit directly, preserving -0.0. If the sign of zero matters for your algorithm, use bitwise XOR with a sign mask instead of the - operator.

Min/Max NaN Propagation

x86-64AArch64 / WASM / Scalar
min(NaN, x)Returns x (second operand)Returns x (non-NaN value)
min(x, NaN)Returns NaNReturns x (non-NaN value)

SSE minps/maxps always returns the second operand when the first is NaN, and the first when the second is NaN — it doesn't distinguish "propagate NaN" from "return non-NaN." ARM and WASM (and scalar f32::min) always return the non-NaN value regardless of operand order. If your inputs may contain NaN, filter them first or use comparison + blend for consistent behavior.

FMA vs Separate Multiply-Add

x86-64 (AVX2+) / AArch64WASM / Scalar fallback
mul_add(a, b, c)Fused multiply-add (one rounding)a * b + c (two roundings)

Hardware FMA computes a × b + c with a single rounding step, while the scalar/WASM fallback rounds the intermediate product before adding c. This matters most when a × b nearly cancels with c — the results can differ by many ULPs near zero. For most inputs the difference is sub-ULP. Accept small differences in dispatch parity tests.

Comparison NaN Semantics (simd_ne)

x86-64AArch64 / WASM
simd_ne(NaN, x)True (unordered: NaN ≠ anything)May vary by implementation

The simd_ne operation uses the hardware's not-equal comparison, which may be "ordered" or "unordered" depending on platform. For portable NaN-aware inequality, use simd_eq + not.

Reduction Associativity (reduce_add)

All backends compute reduce_add using tree reduction, but the exact grouping may differ between scalar (left-fold) and hardware (pairwise tree). For inputs with large magnitude differences, floating-point associativity causes small relative errors (~1e-6). This is inherent to IEEE 754 and not a bug.

Rounding Consistency (Fixed in 0.9.16)

As of version 0.9.16, round(), floor(), ceil(), and to_i32_round() produce identical results across all backends including the scalar fallback. Previously, the scalar backend used ties-away-from-zero for rounding while all hardware used ties-to-even (IEEE 754 default). This was fixed by implementing roundevenf in the scalar math library.

Summary: Safe Portable Patterns

OperationPortable Method
Bitwise AND.and()
Bitwise OR.or()
Bitwise XOR.xor()
Bitwise NOT.not()
Sign-extending right shift.shr_arithmetic::<N>()
Arithmetic, comparisons, roundingAll operators and methods (bit-exact across backends)
ReductionsAll methods (tiny FP associativity differences possible)
FMA (mul_add/mul_sub)All methods (±1 ULP difference scalar vs hardware)
TranscendentalsAll methods (tolerance-based parity)

The vast majority of magetypes operations are fully portable with identical semantics. The floating-point edge cases listed above affect only NaN, signed zero, and near-zero cancellation scenarios.

Found an error or it needs a clarification? Open an issue on GitHub.
Substantiated corrections will be incorporated with attribution.