#[autoversion]

#[autoversion] generates architecture-specific function variants and a runtime dispatcher from a single annotated function. Write scalar code, let the compiler auto-vectorize it for each target.

Quick start

use archmage::autoversion;

#[autoversion]
fn sum_of_squares(data: &[f32]) -> f32 {
    let mut sum = 0.0f32;
    for &x in data {
        sum += x * x;
    }
    sum
}

// Call directly — no token, no unsafe:
let result = sum_of_squares(&my_data);

The macro generates one variant per architecture tier — each compiled with #[target_feature] via #[arcane] — plus a dispatcher that calls the best variant the CPU supports.

On x86-64 with AVX2+FMA, that loop compiles to vfmadd231ps — fused multiply-add on 8 floats per cycle. On aarch64 with NEON, you get fmla. The _scalar fallback compiles without SIMD target features as a safety net.

SimdToken — deprecated placeholder

Deprecated since 0.9.11. Use the tokenless form (above) or ScalarToken for incant! nesting.

The legacy _token: SimdToken parameter is still recognized — the macro strips it from the dispatcher. But SimdToken is a trait, not a type, so the parameter was never valid Rust on its own. New code should not use it.

What gets generated

With default tiers, #[autoversion] fn process(data: &[f32]) -> f32 generates:

FunctionTokenArchitectureFeature gate
process_v4X64V4Tokenx86-64avx512
process_v3X64V3Tokenx86-64
process_neonNeonTokenaarch64
process_wasm128Wasm128Tokenwasm32
process_scalarScalarTokenall
processall

The dispatcher (process) does runtime CPU feature detection via Token::summon() and calls the best match. When compiled with -Ctarget-cpu=native, detection compiles away entirely.

 flowchart TD
    CALL["process(data)"] --> V4{"X64V4Token::summon()?"}
    V4 -->|Some| PV4["process_v4(token, data)"]
    V4 -->|None / wrong arch| V3{"X64V3Token::summon()?"}
    V3 -->|Some| PV3["process_v3(token, data)"]
    V3 -->|None / wrong arch| NEON{"NeonToken::summon()?"}
    NEON -->|Some| PN["process_neon(token, data)"]
    NEON -->|None / wrong arch| WASM{"Wasm128Token::summon()?"}
    WASM -->|Some| PW["process_wasm128(token, data)"]
    WASM -->|None / wrong arch| PS["process_scalar(ScalarToken, data)"]

    style CALL fill:#5a3d1e,color:#fff
    style PV4 fill:#2d5a27,color:#fff
    style PV3 fill:#2d5a27,color:#fff
    style PN fill:#2d5a27,color:#fff
    style PW fill:#2d5a27,color:#fff
    style PS fill:#1a4a6e,color:#fff

Variants are always private — only the dispatcher gets the original function's visibility. Within the same module, variants are accessible for testing and benchmarking.

Variants for other architectures are excluded at compile time by #[cfg(target_arch)]. On x86-64, only _v4, _v3, and _scalar exist in the binary.

Name collision patterns

#[autoversion] generates functions named {fn_name}_{tier_suffix}. If you have other functions with those names, you get collisions:

// COLLISION: both autoversion AND you define process_v3
#[autoversion]
fn process(data: &[f32]) -> f32 { data.iter().sum() }

#[arcane]
fn process_v3(_token: X64V3Token, data: &[f32]) -> f32 { /* hand-written */ }
// ERROR: duplicate definition of process_v3

Resolution: Use different base names. For the pattern where you want hand-written SIMD for some tiers and auto-vectorization for others, see Nesting with incant!.

Generated names by tier

TierSuffixExample
v1_v1process_v1
v2_v2process_v2
x64_crypto_x64_cryptoprocess_x64_crypto
v3_v3process_v3
v3_crypto_v3_cryptoprocess_v3_crypto
v4_v4process_v4
v4x_v4xprocess_v4x
neon_neonprocess_neon
arm_v2_arm_v2process_arm_v2
arm_v3_arm_v3process_arm_v3
neon_aes_neon_aesprocess_neon_aes
neon_sha3_neon_sha3process_neon_sha3
neon_crc_neon_crcprocess_neon_crc
wasm128_wasm128process_wasm128
wasm128_relaxed_wasm128_relaxedprocess_wasm128_relaxed
scalar_scalarprocess_scalar
default_defaultprocess_default

scalar and default are mutually exclusive fallback tiers. scalar passes ScalarToken; default is tokenless.

Nesting with incant!

Hand-written SIMD for specific tiers, autoversioned auto-vectorization for the rest. Two approaches:

The default tier calls _default(args) without any token — a direct match for a tokenless autoversion dispatcher:

use archmage::prelude::*;

pub fn process(data: &[f32]) -> f32 {
    incant!(process(data), [v4, default])
}

/// Hand-written AVX-512 — #[arcane] handles #[cfg(target_arch)]
#[arcane(import_intrinsics)]
fn process_v4(_token: X64V4Token, data: &[f32]) -> f32 {
    // ... AVX-512 intrinsics ...
    todo!()
}

/// Auto-vectorized fallback — gets V3/NEON for free
#[autoversion(v3, neon)]
fn process_default(data: &[f32]) -> f32 {
    data.iter().sum()
}

incant! tries V4 first. If unavailable, calls process_default(data) — no token, no bridge. The autoversion dispatcher internally tries V3 → NEON → scalar.

ScalarToken param (alternative)

If you prefer the scalar tier name, give your autoversioned function a ScalarToken parameter. Autoversion keeps it in the dispatcher, matching what incant! passes:

pub fn process(data: &[f32]) -> f32 {
    incant!(process(data), [v4, scalar])
}

#[arcane(import_intrinsics)]
fn process_v4(_token: X64V4Token, data: &[f32]) -> f32 { todo!() }

#[autoversion(v3, neon)]
fn process_scalar(_: ScalarToken, data: &[f32]) -> f32 {
    data.iter().sum()
}

Why not just tokenless _scalar?

A tokenless #[autoversion] fn process_scalar(data: &[f32]) generates a dispatcher with no token parameter. But incant! calls process_scalar(ScalarToken, data) — signature mismatch. Use default or ScalarToken to avoid this.

Explicit tiers

#[autoversion(v3, neon)]
fn process(data: &[f32]) -> f32 { ... }

Only generates the listed tiers plus scalar (always implicit). Use this when you don't need every platform, or when you want tiers beyond the defaults. Tier names accept the _ prefix — _v3 is identical to v3, matching the suffix pattern on generated names.

Default tiers (when no list given): v4, v3, neon, wasm128, scalar.

Tier list modifiers

Instead of replacing the entire default list, use + and - to modify it:

// Add arm_v2 to defaults
#[autoversion(+arm_v2)]
fn process(data: &[f32]) -> f32 { ... }

// Remove tiers you don't need
#[autoversion(-neon, -wasm128)]
fn process(data: &[f32]) -> f32 { ... }

// Make v4 unconditional (overrides the default avx512 gate)
#[autoversion(+v4)]
fn process(data: &[f32]) -> f32 { ... }

// Combine freely
#[autoversion(-wasm128, +arm_v2)]
fn process(data: &[f32]) -> f32 { ... }

All entries must be +/- (modifier mode) or none (override mode) — mixing is a compile error.

Feature-gated tiers

Per-tier gates: tier(cfg(feature))

#[autoversion(v4(cfg(avx512)), v3, neon)]
fn process(data: &[f32]) -> f32 { ... }

The v4 variant and its dispatch arm are wrapped in #[cfg(feature = "avx512")] — checked against the calling crate's features, not archmage's. If the crate doesn't define avx512, v4 is silently excluded.

The shorthand v4(avx512) also works and produces identical output. The cfg() form is canonical.

Whole-dispatch gate: cfg(feature)

#[autoversion(cfg(simd))]
fn process(data: &[f32]) -> f32 { ... }

With --features simd: full dispatch (v4 → v3 → neon → wasm128 → scalar). Without: direct scalar call, zero dispatch overhead.

Methods

Inherent methods — self works naturally

impl ImageBuffer {
    #[autoversion]
    fn normalize(&mut self, gamma: f32) {
        for pixel in &mut self.data {
            *pixel = (*pixel / 255.0).powf(gamma);
        }
    }
}

buffer.normalize(2.2);

All receiver types work: self, &self, &mut self. The generated variants use #[arcane] in sibling mode where self/Self resolve naturally.

_self = Type — for trait method delegation

Required when you need #[arcane]'s nested mode (trait impls can't expand to sibling functions):

impl MyType {
    #[autoversion(_self = MyType)]
    fn compute_impl(&self, data: &[f32]) -> f32 {
        _self.weights.iter().zip(data).map(|(w, d)| w * d).sum()
    }
}

Use _self (not self) in the body. Non-scalar variants get #[arcane(_self = Type)]; the scalar variant gets let _self = self; injected.

Trait methods — delegation pattern

Trait methods can't use #[autoversion] directly. Delegate:

trait Processor {
    fn process(&self, data: &[f32]) -> f32;
}

impl Processor for MyType {
    fn process(&self, data: &[f32]) -> f32 {
        self.process_impl(data)
    }
}

impl MyType {
    #[autoversion]
    fn process_impl(&self, data: &[f32]) -> f32 {
        self.weights.iter().zip(data).map(|(w, d)| w * d).sum()
    }
}

Const generics

#[autoversion]
fn sum_array<const N: usize>(data: &[f32; N]) -> f32 {
    let mut sum = 0.0f32;
    for &x in data { sum += x; }
    sum
}

let result = sum_array(&[1.0, 2.0, 3.0, 4.0]);
// Variant call with turbofish:
let s = sum_array_scalar::<4>(ScalarToken, &data);

Works with multiple const generics, type generics, lifetimes, and all combinations.

Benchmarking variants

Variants are real functions. Call them directly to measure auto-vectorization speedup:

use criterion::{Criterion, black_box, criterion_group, criterion_main};

#[autoversion]
fn sum_squares(data: &[f32]) -> f32 {
    data.iter().map(|&x| x * x).fold(0.0f32, |a, b| a + b)
}

fn bench(c: &mut Criterion) {
    let data: Vec<f32> = (0..4096).map(|i| i as f32 * 0.01).collect();
    let mut group = c.benchmark_group("sum_squares");

    // Dispatched — picks best available at runtime
    group.bench_function("dispatched", |b| {
        b.iter(|| sum_squares(black_box(&data)))
    });

    // Scalar baseline
    group.bench_function("scalar", |b| {
        b.iter(|| sum_squares_scalar(archmage::ScalarToken, black_box(&data)))
    });

    // Specific tier
    #[cfg(target_arch = "x86_64")]
    if let Some(t) = archmage::X64V3Token::summon() {
        group.bench_function("v3_avx2_fma", |b| {
            b.iter(|| sum_squares_v3(t, black_box(&data)));
        });
    }

    group.finish();
}

criterion_group!(benches, bench);
criterion_main!(benches);

For tight numeric loops on x86-64, the _v3 variant typically runs 4-8x faster than _scalar.

When to use what

#[autoversion]#[magetypes] + incant!Manual dispatch
Generates variantsYesYes (#[magetypes])No
Generates dispatcherYesNo (incant! separately)No
Body touchedNo (signature only)Yes (text substitution)N/A
Best forScalar auto-vectorizationHand-written SIMD typesDifferent algorithms per arch
Lines of code1 attribute2+Many

Use #[autoversion] for scalar code the compiler can auto-vectorize — tight numeric loops, element-wise transforms, reductions.

Use #[magetypes] + incant! when you need f32x8, u8x32, and hand-tuned SIMD per architecture.

Use manual dispatch when different tiers need fundamentally different algorithms.

Nest them when some tiers need hand-written intrinsics and others benefit from auto-vectorization.

Found an error or it needs a clarification? Open an issue on GitHub.
Substantiated corrections will be incorporated with attribution.