Types and Dispatch

Magetypes vectors work with archmage's incant! macro and the #[magetypes] macro for multi-platform dispatch. You write one body, the macros generate the per-tier target-feature contexts, incant! dispatches at runtime.

See magetypes/examples/idiomatic_patterns_all.rs — a single runnable file that exercises every pattern on this page.

The Default: `#[magetypes]` + `incant!`

Write the algorithm once inside #[magetypes]. The macro generates one #[arcane]-wrapped variant per listed tier, substituting Token with the concrete token type. define(...) injects the matching-tier magetypes type aliases at the top of each variant body. incant! dispatches at runtime.

use archmage::prelude::*;

#[magetypes(define(f32x8), v4, v3, neon, wasm128, scalar)]
fn scale_plane_impl(token: Token, plane: &mut [f32], factor: f32) {
    // `f32x8` is in scope via `define` — resolves to `f32x8<X64V3Token>` in
    // the v3 variant, `f32x8<NeonToken>` in neon, etc.
    let factor_v = f32x8::splat(token, factor);
    let (chunks, tail) = f32x8::partition_slice_mut(token, plane);
    for chunk in chunks {
        (f32x8::load(token, chunk) * factor_v).store(chunk);
    }
    for v in tail { *v *= factor; }
}

pub fn scale_plane(plane: &mut [f32], factor: f32) {
    incant!(scale_plane_impl(plane, factor))
}

define(...) takes a list: define(f32x8, u8x16, i16x8) to inject several types. Without define, you can write type f32x8 = ::magetypes::simd::generic::f32x8<Token>; manually at the top of the body — both forms work.

#[magetypes] IS the #[arcane] wrapper generator. Do not write per-tier #[arcane] wrappers by hand around a generic kernel — the macro already does that for every tier in the list. Each generated variant has its own #[target_feature] region; incant! picks the highest available at runtime.

Polyfills are automatic. f32x8::<NeonToken> emits two 128-bit NEON ops; f32x8::<Wasm128Token> emits two 128-bit SIMD128 ops; f32x8::<X64V3Token> emits one 256-bit AVX2 op. The same body works everywhere. Pick the vector width your algorithm wants — f32x8, f32x16, u8x16 — and the library handles the hardware split.

Extract a Generic Kernel When You Need Reuse

If the same kernel is called from several entry points (or you want to share it with other crates), extract it as a generic function bounded on a backend trait. The #[magetypes] entry becomes a thin stub that passes its token through. T is inferred from the concrete Token in each tier.

use archmage::prelude::*;
use magetypes::simd::{
    backends::F32x8Backend,
    generic::f32x8 as GenericF32x8,
};

#[inline(always)]   // MANDATORY — see below
fn dot_kernel<T: F32x8Backend>(token: T, a: &[f32], b: &[f32]) -> f32 {
    let mut acc = GenericF32x8::<T>::zero(token);
    let chunks = a.len() / 8;
    for i in 0..chunks {
        let va = GenericF32x8::<T>::load(token, a[i * 8..][..8].try_into().unwrap());
        let vb = GenericF32x8::<T>::load(token, b[i * 8..][..8].try_into().unwrap());
        acc = va.mul_add(vb, acc);
    }
    let mut total = acc.reduce_add();
    for i in (chunks * 8)..a.len() { total += a[i] * b[i]; }
    total
}

#[magetypes(v4, v3, neon, wasm128, scalar)]
fn dot_impl(token: Token, a: &[f32], b: &[f32]) -> f32 {
    dot_kernel(token, a, b)   // T inferred — token has concrete type per tier
}

pub fn dot(a: &[f32], b: &[f32]) -> f32 {
    incant!(dot_impl(a, b))
}

Why `#[inline(always)]` is mandatory on the generic kernel

dot_kernel has no #[target_feature] of its own. It inherits the caller's features through inlining. When dot_impl_v3 (the V3 variant generated by #[magetypes]) calls dot_kernel(token, ...), inlining puts the kernel's body inside the V3 variant's #[target_feature(enable = "avx2,fma,...")] region, and LLVM emits native AVX2 instructions.

Without #[inline(always)], the kernel body stays in its own default-features region, intrinsics become function calls, and performance regresses ~18× even inside a #[magetypes]-generated variant. See magetypes/benches/generic_vs_concrete.rs for the measurement.

When the Algorithm Genuinely Differs Per Platform

Some kernels benefit from per-platform shapes — different widths, different instruction sequences, different memory layouts. Write one #[arcane] per affected tier. Name each variant fn_<tier> (e.g., fn_v3, fn_neon). incant! resolves by suffix and doesn't care which macro or hand-written function produced which variant.

use archmage::prelude::*;
use magetypes::simd::generic::{f32x8, f32x4};

// x86-64 v3: native 256-bit f32x8
#[arcane(import_intrinsics)]
fn dot_product_v3(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let va = f32x8::<X64V3Token>::from_array(token, *a);
    let vb = f32x8::<X64V3Token>::from_array(token, *b);
    (va * vb).reduce_add()
}

// AArch64: native 128-bit f32x4, processed in two halves
#[arcane(import_intrinsics)]
fn dot_product_neon(token: NeonToken, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let sum1 = (f32x4::<NeonToken>::from_slice(token, &a[0..4])
             * f32x4::<NeonToken>::from_slice(token, &b[0..4])).reduce_add();
    let sum2 = (f32x4::<NeonToken>::from_slice(token, &a[4..8])
             * f32x4::<NeonToken>::from_slice(token, &b[4..8])).reduce_add();
    sum1 + sum2
}

fn dot_product_scalar(_token: ScalarToken, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    a.iter().zip(b.iter()).map(|(x, y)| x * y).sum()
}

pub fn dot_product(a: &[f32; 8], b: &[f32; 8]) -> f32 {
    incant!(dot_product(a, b), [v3, neon, scalar])
}

In almost every case the "default" pattern above is fine — the polyfill path on NEON and WASM is already two-half processing, identical to what you'd hand-write. Reach for per-platform shapes only when a platform has an instruction the generic API can't express (e.g., AVX-512 mask-register shuffles on _v4x, cross-lane NEON permutes, WASM relaxed-SIMD ops).

Slotting One Hand-Tuned Tier Into an Existing `#[magetypes]` Family

When only one tier needs hand-tuning, leave the rest to #[magetypes] and add a standalone #[arcane] for the special tier. incant! picks it up by suffix convention — it doesn't care which macro (or hand-writing) produced which variant.

Full recipe — reusing scale_plane_impl from the first section above, adding a _v4x specialization that uses native 512-bit f32x16:

use archmage::prelude::*;
use magetypes::simd::{
    generic::f32x8 as GenericF32x8,
    generic::f32x16 as GenericF32x16,
};

// Family generated by #[magetypes]: scale_plane_impl_v4, _v3, _neon, _wasm128, _scalar.
// Omit v4x from this list — we hand-tune it below.
#[magetypes(v4, v3, neon, wasm128, scalar)]
fn scale_plane_impl(token: Token, plane: &mut [f32], factor: f32) {
    #[allow(non_camel_case_types)]
    type f32x8 = GenericF32x8<Token>;

    let factor_v = f32x8::splat(token, factor);
    let (chunks, tail) = f32x8::partition_slice_mut(token, plane);
    for chunk in chunks {
        (f32x8::load(token, chunk) * factor_v).store(chunk);
    }
    for v in tail { *v *= factor; }
}

// Hand-tuned _v4x using native 512-bit f32x16 (on top of the avx512 feature gate).
// The name `scale_plane_impl_v4x` matches incant!'s suffix convention exactly —
// that's how it joins the family.
#[cfg(all(target_arch = "x86_64", feature = "avx512"))]
#[arcane]
fn scale_plane_impl_v4x(token: X64V4xToken, plane: &mut [f32], factor: f32) {
    let factor_v = GenericF32x16::<X64V4xToken>::splat(token, factor);
    let (chunks, tail) = GenericF32x16::<X64V4xToken>::partition_slice_mut(token, plane);
    for chunk in chunks {
        (GenericF32x16::<X64V4xToken>::load(token, chunk) * factor_v).store(chunk);
    }
    for v in tail { *v *= factor; }
}

// One public entry. The tier list includes the hand-tuned _v4x ahead of the
// #[magetypes]-generated family. incant! routes by suffix.
pub fn scale_plane(plane: &mut [f32], factor: f32) {
    incant!(
        scale_plane_impl(plane, factor),
        [v4x(cfg(avx512)), v4(cfg(avx512)), v3, neon, wasm128, scalar]
    )
}

Key rules for this pattern:

The hand-written function name MUST be <base>_<tier> matching the existing family's prefix (here: scale_plane_impl_v4x).
Do NOT also list that tier in #[magetypes]'s tier list — the macro would generate a conflicting _v4x. (v4x is not in the #[magetypes] default list, but if you were hand-tuning _v3, you'd write #[magetypes(v4, neon, wasm128, scalar)].)
The incant! tier list at the public entry includes all tiers you want considered — hand-written and generated alike.

Nested `incant!` is Zero-Overhead

When incant!(foo(args)) appears inside a tier-annotated body (#[magetypes], #[autoversion], #[arcane], or token-based #[rite]), the outer macro rewrites it to the direct tier-matching call at compile time. No dispatcher branch, no cache probe — the inner function inlines into the outer's #[target_feature] region.

#[magetypes(v4, v3, neon, wasm128, scalar)]
fn pipeline_impl(token: Token, plane: &mut [f32], bias: f32, factor: f32) {
    #[allow(non_camel_case_types)]
    type f32x8 = GenericF32x8<Token>;

    let bias_v = f32x8::splat(token, bias);
    let (chunks, tail) = f32x8::partition_slice_mut(token, plane);
    for chunk in chunks {
        (f32x8::load(token, chunk) + bias_v).store(chunk);
    }
    for v in tail { *v += bias; }

    // Both of these are rewritten at compile time to direct _v3/_v4/_neon/
    // _wasm128/_scalar calls matching the current variant. Zero dispatcher hops.
    incant!(clamp01_impl(plane));
    incant!(scale_plane_impl(plane, factor));
}

See SPEC-INCANT-REWRITING.md for the rewriting rules and measurements (0.94 ns vs 5.6 ns with re-dispatch).

Passthrough Dispatch

When you already have a token (e.g., inside a generic wrapper) and want to dispatch to specialized variants without re-summoning:

use archmage::{incant, IntoConcreteToken};

fn process_inner<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    incant!(compute(data) with token, [v3, neon, wasm128, scalar])
}

See incant! Passthrough Mode for details.

The `#[rite]` Escape Hatch

#[rite(v3, v4, neon, wasm128)] generates per-tier copies of a function with #[target_feature] + #[inline] attached directly — no wrapper, no optimization boundary. Callable by suffix from matching contexts.

#[rite] does NOT substitute Token. Each variant is the same body with a different #[target_feature]; the signature is taken verbatim (tokenless or with a concrete token type). #[rite] has no scalar tier — scalar has no features to enable. If you need scalar in a dispatch, use #[magetypes] or write a plain fn foo_scalar(_: ScalarToken, ...) yourself.

Reach for #[rite] for short inner helpers inside an #[arcane] or #[magetypes] body where you want explicit #[target_feature] control without delegating through the generic-bound pattern. For most magetypes code, extracted generic kernels (above) are cleaner.

Choosing the Pattern

You have	Use
One algorithm, works on every platform	`#[magetypes]` + `incant!` — inline body with `GenericF32x8<Token>`
One algorithm reused from multiple entry points	Extracted `#[inline(always)] fn<T: F32x8Backend>` + `#[magetypes]` + `incant!`
Algorithm genuinely differs per platform	Hand-written `#[arcane]` per tier + `incant!`
Mostly uniform + one tier needs hand-tuning	`#[magetypes]` for most + standalone `#[arcane]` for the one tier — both routed by one `incant!`
Scalar loop the compiler auto-vectorizes well	`#[autoversion]` (own dispatcher — no `incant!` needed)
Inner helper called from a specific tier's context	`#[rite(v3)]` or `#[rite(v3, v4, neon, wasm128)]` multi-tier

Found an error or it needs a clarification? Open an issue on GitHub.

Substantiated corrections will be incorporated with attribution.

Found a typo? Fork, modify and open a PR.

Types and Dispatch

The Default: #[magetypes] + incant!

Extract a Generic Kernel When You Need Reuse

Why #[inline(always)] is mandatory on the generic kernel

When the Algorithm Genuinely Differs Per Platform

Slotting One Hand-Tuned Tier Into an Existing #[magetypes] Family

Nested incant! is Zero-Overhead