Archmage & Magetypes

Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.

Archmage makes SIMD programming in Rust safe and ergonomic. Instead of scattering unsafe blocks throughout your code, you prove CPU feature availability once with a capability token, then write safe code that the compiler optimizes into raw SIMD instructions.

Magetypes provides SIMD vector types (f32x8, i32x4, etc.) with natural Rust operators that integrate with archmage tokens.

Zero Overhead

Archmage is never slower than equivalent unsafe code. The safety abstractions exist only at compile time. At runtime, you get the exact same assembly as hand-written #[target_feature] + unsafe code.

Benchmark: 1000 iterations of 8-float vector operations
  Manual unsafe code:     570 ns
  #[rite] in #[arcane]:   572 ns  ← identical
  #[arcane] in loop:     2320 ns  ← wrong pattern (see below)

The key is using the right pattern: put loops inside #[arcane], use #[rite] for helpers. See Token Hoisting and The #[rite] Macro.

The Problem

Raw SIMD in Rust requires unsafe:

#![allow(unused)]
fn main() {
use std::arch::x86_64::*;

// Every. Single. Call.
unsafe {
    let a = _mm256_loadu_ps(data.as_ptr());
    let b = _mm256_set1_ps(2.0);
    let c = _mm256_mul_ps(a, b);
    _mm256_storeu_ps(out.as_mut_ptr(), c);
}
}

This is tedious and error-prone. Miss a feature check? Undefined behavior on older CPUs.

The Solution

Archmage separates proof of capability from use of capability:

use archmage::prelude::*;

#[arcane]
fn multiply(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // Safe! Token proves AVX2+FMA, safe_unaligned_simd takes references
    let a = _mm256_loadu_ps(data);
    let b = _mm256_set1_ps(2.0);
    let c = _mm256_mul_ps(a, b);
    let mut out = [0.0f32; 8];
    _mm256_storeu_ps(&mut out, c);
    out
}

fn main() {
    // Runtime check happens ONCE here
    if let Some(token) = Desktop64::summon() {
        let result = multiply(token, &[1.0; 8]);
        println!("{:?}", result);
    }
}

Key Concepts

  1. Tokens are zero-sized proof types. Desktop64::summon() returns Some(token) only if the CPU supports AVX2+FMA.

  2. #[arcane] generates a #[target_feature] inner function. Inside, SIMD intrinsics are safe.

  3. Token hoisting: Call summon() once at your API boundary, pass the token through. Don't summon in hot loops.

Supported Platforms

PlatformTokensRegister Width
x86-64X64V2Token, X64V3Token/Desktop64, X64V4Token/Server64128-512 bit
AArch64NeonToken/Arm64, NeonAesToken, NeonSha3Token128 bit
WASMSimd128Token128 bit

Next Steps

Installation

Add archmage to your Cargo.toml:

[dependencies]
archmage = "0.4"

For SIMD vector types with natural operators, also add magetypes:

[dependencies]
archmage = "0.4"
magetypes = "0.4"

Feature Flags

archmage

FeatureDefaultDescription
stdStandard library support
macros#[arcane], incant!, etc.
avx512AVX-512 token support

magetypes

FeatureDefaultDescription
stdStandard library support
avx512512-bit types (f32x16, etc.)

Platform Requirements

x86-64

Works out of the box. Tokens detect CPU features at runtime.

For compile-time optimization on known hardware:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release

AArch64

NEON is baseline on 64-bit ARM—NeonToken::summon() always succeeds.

WASM

Enable SIMD128 in your build:

RUSTFLAGS="-Ctarget-feature=+simd128" cargo build --target wasm32-unknown-unknown

Verify Installation

use archmage::{SimdToken, Desktop64, Arm64};

fn main() {
    // Tokens compile everywhere - summon() returns None on unsupported platforms
    match Desktop64::summon() {
        Some(token) => println!("{} available!", token.name()),
        None => println!("No AVX2+FMA"),
    }

    match Arm64::summon() {
        Some(token) => println!("{} available!", token.name()),
        None => println!("No NEON"),
    }
}

Run it:

cargo run

On x86-64 (Haswell+/Zen+): "X64V3 available!" and "No NEON". On AArch64: "No AVX2+FMA" and "Neon available!".

Your First SIMD Function

Let's write a function that squares 8 floats in parallel using AVX2.

Use archmage::prelude::* which includes safe_unaligned_simd for memory operations:

use archmage::prelude::*;

#[arcane]
fn square_f32x8(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // safe_unaligned_simd takes references - fully safe!
    let v = _mm256_loadu_ps(data);
    let squared = _mm256_mul_ps(v, v);

    let mut out = [0.0f32; 8];
    _mm256_storeu_ps(&mut out, squared);
    out
}

fn main() {
    if let Some(token) = Desktop64::summon() {
        let input = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
        let output = square_f32x8(token, &input);
        println!("{:?}", output);
        // [1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0]
    } else {
        println!("AVX2 not available");
    }
}

Using magetypes

For the most ergonomic experience, use magetypes' vector types:

use archmage::{Desktop64, SimdToken};
use magetypes::simd::f32x8;

fn square_f32x8(token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    let v = f32x8::from_array(token, *data);
    let squared = v * v;  // Natural operator!
    squared.to_array()
}

fn main() {
    if let Some(token) = Desktop64::summon() {
        let input = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
        let output = square_f32x8(token, &input);
        println!("{:?}", output);
    }
}

What #[arcane] Does

The macro transforms your function:

#![allow(unused)]
fn main() {
// You write:
#[arcane]
fn square(token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // body
}

// Macro generates:
fn square(token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
        // body - intrinsics are safe here!
    }
    // SAFETY: token proves CPU support
    unsafe { __inner(token, data) }
}
}

The token parameter proves you checked CPU features. The macro enables those features for the inner function, making intrinsics safe to call.

Key Points

  1. Desktop64 = AVX2 + FMA + BMI1 + BMI2 (Haswell 2013+, Zen 1+)
  2. summon() does runtime CPU detection
  3. #[arcane] makes intrinsics safe inside the function
  4. Token is zero-sized — no runtime overhead passing it around

Understanding Tokens

Tokens are the core of archmage's safety model. They're zero-sized proof types that demonstrate CPU feature availability.

The Token Hierarchy

x86-64 Tokens

TokenAliasFeaturesCPUs
X64V2TokenSSE4.2, POPCNTNehalem 2008+
X64V3TokenDesktop64, Avx2FmaToken+ AVX2, FMA, BMI1, BMI2Haswell 2013+, Zen 1+
X64V4TokenServer64, Avx512Token+ AVX-512 F/BW/CD/DQ/VLSkylake-X 2017+, Zen 4+
Avx512ModernToken+ VNNI, VBMI, etc.Ice Lake 2019+, Zen 4+

AArch64 Tokens

TokenAliasFeatures
NeonTokenArm64NEON (baseline, always available)
NeonAesToken+ AES instructions
NeonSha3Token+ SHA3 instructions
NeonCrcToken+ CRC instructions

WASM Token

TokenFeatures
Simd128TokenWASM SIMD128

Summoning Tokens

#![allow(unused)]
fn main() {
use archmage::{Desktop64, SimdToken};

// Runtime detection
if let Some(token) = Desktop64::summon() {
    // CPU has AVX2+FMA
    process_simd(token, data);
} else {
    // Fallback
    process_scalar(data);
}
}

Compile-Time Guarantees

Check if detection is needed:

#![allow(unused)]
fn main() {
use archmage::{Desktop64, SimdToken};

match Desktop64::guaranteed() {
    Some(true) => {
        // Compiled with -Ctarget-cpu=haswell or higher
        // summon() will always succeed, check is elided
        let token = Desktop64::summon().unwrap();
    }
    Some(false) => {
        // Wrong architecture (e.g., running on ARM)
        // summon() will always return None
    }
    None => {
        // Runtime check needed
        if let Some(token) = Desktop64::summon() {
            // ...
        }
    }
}
}

ScalarToken: The Fallback

ScalarToken always succeeds—it's for fallback paths:

#![allow(unused)]
fn main() {
use archmage::{ScalarToken, SimdToken};

// Always works
let token = ScalarToken::summon().unwrap();

// Or just construct it directly (it's a unit struct)
let token = ScalarToken;
}

Token Properties

Tokens are:

  • Zero-sized: No runtime cost to pass around
  • Copy + Clone: Pass by value freely
  • Send + Sync: Safe to share across threads
  • 'static: Can be stored in static variables
#![allow(unused)]
fn main() {
// Zero-sized
assert_eq!(std::mem::size_of::<Desktop64>(), 0);

// Copy
fn takes_token(token: Desktop64) {
    let copy = token;  // No move, just copy
    use_both(token, copy);
}
}

Downcasting Tokens

Higher tokens can be used where lower ones are expected:

#![allow(unused)]
fn main() {
#[arcane]
fn needs_v3(token: X64V3Token, data: &[f32]) { /* ... */ }

if let Some(v4) = X64V4Token::summon() {
    // V4 is a superset of V3 — this works and inlines
    needs_v3(v4, &data);
}
}

V4 includes all V3 features, so the token is valid proof.

Trait Bounds

For generic code, use tier traits:

#![allow(unused)]
fn main() {
use archmage::HasX64V2;

fn process<T: HasX64V2>(token: T, data: &[f32]) {
    // Works with X64V2Token, X64V3Token, X64V4Token, etc.
}
}

Available traits:

  • HasX64V2 — SSE4.2 tier
  • HasX64V4 — AVX-512 tier (requires avx512 feature)
  • HasNeon — NEON baseline
  • HasNeonAes, HasNeonSha3 — NEON extensions

Compile-Time vs Runtime

Understanding when feature detection happens—and how LLVM optimizes across feature boundaries—is crucial for writing correct and fast SIMD code.

The Mechanisms

MechanismWhenWhat It Does
#[cfg(target_arch = "...")]CompileInclude/exclude code from binary
#[cfg(target_feature = "...")]CompileTrue only if feature is in target spec
#[cfg(feature = "...")]CompileCargo feature flag
#[target_feature(enable = "...")]CompileTell LLVM to use these instructions in this function
-Ctarget-cpu=nativeCompileLLVM assumes current CPU's features globally
Token::summon()RuntimeCPUID instruction, returns Option<Token>

The Key Insight: #[target_feature(enable)]

This is the mechanism that makes SIMD work. It tells LLVM: "Inside this function, assume these CPU features are available."

#![allow(unused)]
fn main() {
#[target_feature(enable = "avx2,fma")]
unsafe fn process_avx2(data: &[f32; 8]) -> f32 {
    // LLVM generates AVX2 instructions here
    // _mm256_* intrinsics compile to single instructions
    let v = _mm256_loadu_ps(data.as_ptr());
    // ...
}
}

Why unsafe? The function uses AVX2 instructions, but LLVM doesn't verify the caller checked for AVX2. If you call this on a CPU without AVX2, you get an illegal instruction fault. The unsafe is the contract: "caller must ensure CPU support."

This is what #[arcane] does for you:

#![allow(unused)]
fn main() {
// You write:
#[arcane]
fn process(token: Desktop64, data: &[f32; 8]) -> f32 { /* ... */ }

// Macro generates:
fn process(token: Desktop64, data: &[f32; 8]) -> f32 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, data: &[f32; 8]) -> f32 { /* ... */ }

    // SAFETY: Token existence proves summon() succeeded
    unsafe { __inner(token, data) }
}
}

The token proves the runtime check happened. The inner function gets LLVM's optimizations.

LLVM Optimization and Feature Boundaries

Archmage is never slower than equivalent unsafe code. When you use the right patterns (#[rite] helpers called from #[arcane]), the generated assembly is identical to hand-written #[target_feature] + unsafe code.

Here's why: LLVM's optimization passes respect #[target_feature] boundaries.

Same Features = Full Optimization

When caller and callee have the same target features, LLVM can:

  • Inline fully
  • Propagate constants
  • Eliminate redundant loads/stores
  • Combine operations across the call boundary
#![allow(unused)]
fn main() {
#[arcane]
fn outer(token: Desktop64, data: &[f32; 8]) -> f32 {
    inner(token, data) * 2.0  // Inlines perfectly!
}

#[arcane]
fn inner(token: Desktop64, data: &[f32; 8]) -> f32 {
    let v = f32x8::from_array(token, *data);
    v.reduce_add()
}
}

Both functions have #[target_feature(enable = "avx2,fma,...")]. LLVM sees one optimization region.

Different Features = Optimization Boundary

When features differ, LLVM must be conservative:

#![allow(unused)]
fn main() {
#[arcane]
fn v4_caller(token: X64V4Token, data: &[f32; 8]) -> f32 {
    // token: X64V4Token → avx512f,avx512bw,...
    v3_helper(token, data)  // Different features!
}

#[arcane]
fn v3_helper(token: X64V3Token, data: &[f32; 8]) -> f32 {
    // token: X64V3Token → avx2,fma,...
    // Different target_feature set = optimization boundary
}
}

This still works—V4 is a superset of V3—but LLVM can't fully inline across the boundary because the #[target_feature] annotations differ.

Generic Bounds = Optimization Boundary

Generics create the same problem:

#![allow(unused)]
fn main() {
#[arcane]
fn generic_impl<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
    // LLVM doesn't know what T's features are at compile time
    // Must generate conservative code that works for any HasX64V2
}
}

The compiler generates one version of this function for the trait bound, not specialized versions for each concrete token. This prevents inlining and vectorization across the boundary.

Rule: Use concrete tokens for hot paths.

Downcasting vs Upcasting

Downcasting: Free

Higher tokens can be used where lower tokens are expected:

#![allow(unused)]
fn main() {
#[arcane]
fn v4_kernel(token: X64V4Token, data: &[f32; 8]) -> f32 {
    // V4 → V3 is free: just passing token, same LLVM features (superset)
    v3_sum(token, data)  // Desktop64 accepts X64V4Token
}

#[arcane]
fn v3_sum(token: Desktop64, data: &[f32; 8]) -> f32 {
    // ...
}
}

This works because X64V4Token has all the features of Desktop64 plus more. LLVM's target features are a superset, so optimization flows freely.

Upcasting: Safe but Creates Boundary

Going the other direction requires IntoConcreteToken:

#![allow(unused)]
fn main() {
fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    if let Some(v4) = token.as_x64v4() {
        v4_path(v4, data)  // Uses AVX-512 if available
    } else if let Some(v3) = token.as_x64v3() {
        v3_path(v3, data)  // Falls back to AVX2
    } else {
        scalar_path(data)
    }
}
}

This is safeas_x64v4() returns None if the token doesn't support V4. But it creates an optimization boundary because the generic T becomes a concrete type at the branch point.

Don't upcast in hot loops. Upcast once at your dispatch point, then pass concrete tokens through.

The #[rite] Optimization

#[rite] exists to eliminate wrapper overhead for inner helpers:

#![allow(unused)]
fn main() {
// #[arcane] creates a wrapper:
fn entry(token: Desktop64, data: &[f32; 8]) -> f32 {
    #[target_feature(enable = "avx2,fma,...")]
    unsafe fn __inner(...) { ... }
    unsafe { __inner(...) }
}

// #[rite] is the function directly:
#[target_feature(enable = "avx2,fma,...")]
#[inline]
fn helper(token: Desktop64, data: &[f32; 8]) -> f32 { ... }
}

Since Rust 1.85+, calling a #[target_feature] function from a matching context is safe. So #[arcane] can call #[rite] helpers without unsafe:

#![allow(unused)]
fn main() {
#[arcane]
fn entry(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let prod = mul_vectors(token, a, b);  // Calls #[rite], no unsafe!
    horizontal_sum(token, prod)
}

#[rite]
fn mul_vectors(_: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> __m256 { ... }

#[rite]
fn horizontal_sum(_: Desktop64, v: __m256) -> f32 { ... }
}

All three functions share the same #[target_feature] annotation. LLVM sees one optimization region.

When Detection Compiles Away

With -Ctarget-cpu=native or -Ctarget-cpu=haswell:

#![allow(unused)]
fn main() {
// When compiled with -Ctarget-cpu=haswell:
// - #[cfg(target_feature = "avx2")] is TRUE
// - X64V3Token::guaranteed() returns Some(true)
// - summon() becomes a no-op
// - LLVM eliminates the branch entirely

if let Some(token) = X64V3Token::summon() {
    // This branch is the only code generated
}
}

Check programmatically:

#![allow(unused)]
fn main() {
match X64V3Token::guaranteed() {
    Some(true) => println!("Compile-time guaranteed"),
    Some(false) => println!("Wrong architecture"),
    None => println!("Runtime check needed"),
}
}

Common Mistakes

Mistake 1: Generic Bounds in Hot Paths

#![allow(unused)]
fn main() {
// SLOW: Generic creates optimization boundary
fn process<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
    // Called millions of times, can't inline properly
}

// FAST: Concrete token, full optimization
fn process(token: Desktop64, data: &[f32]) -> f32 {
    // LLVM knows exact features, inlines everything
}
}

Mistake 2: Assuming #[cfg(target_feature)] Is Runtime

#![allow(unused)]
fn main() {
// WRONG: This is compile-time, not runtime!
#[cfg(target_feature = "avx2")]
fn maybe_avx2() {
    // This function doesn't exist unless compiled with -Ctarget-cpu=haswell
}

// RIGHT: Use tokens for runtime detection
fn maybe_avx2() {
    if let Some(token) = Desktop64::summon() {
        avx2_impl(token);
    }
}
}

Mistake 3: Summoning in Hot Loops

#![allow(unused)]
fn main() {
// WRONG: CPUID in every iteration
for chunk in data.chunks(8) {
    if let Some(token) = Desktop64::summon() {  // Don't!
        process(token, chunk);
    }
}

// RIGHT: Summon once, pass through
if let Some(token) = Desktop64::summon() {
    for chunk in data.chunks(8) {
        process(token, chunk);
    }
}
}

How fast is summon() anyway?

Archmage caches detection results in a static atomic, so repeated summon() calls after the first are essentially a single atomic load (~1.2 ns). The first call does the actual feature detection.

OperationTime
Desktop64::summon() (cached)~1.3 ns
First call (actual detection)~2.6 ns
With -Ctarget-cpu=haswell0 ns (compiles to Some(token))

The caching makes summon() fast enough that even calling it frequently won't hurt performance. But the reason to hoist summons isn't performance of summon(), it's keeping the dispatch decision outside your hot loop so LLVM can optimize the inner code.

Mistake 4: Using #[cfg(target_arch)] Unnecessarily

#![allow(unused)]
fn main() {
// UNNECESSARY: Tokens exist everywhere
#[cfg(target_arch = "x86_64")]
{
    if let Some(token) = Desktop64::summon() { ... }
}

// CLEANER: Just use the token
if let Some(token) = Desktop64::summon() {
    // Returns None on non-x86, compiles everywhere
}
}

Summary

QuestionAnswer
"Does this code exist in the binary?"#[cfg(...)] — compile-time
"Can this CPU run AVX2?"Token::summon() — runtime
"What instructions can LLVM use here?"#[target_feature(enable)] — per-function
"Is runtime check needed?"Token::guaranteed() — tells you
"Will these functions inline together?"Same target features + concrete types = yes
"Do generic bounds hurt performance?"Yes, they create optimization boundaries
"Is downcasting (V4→V3) free?"Yes, features are superset
"Is upcasting safe?"Yes, but creates optimization boundary

Token Hoisting

This is the most important performance rule in archmage.

Summon tokens once at your API boundary. Pass them through the call chain. Never summon in hot loops.

The Problem: 42% Performance Regression

#![allow(unused)]
fn main() {
// WRONG: Summoning in inner function
fn distance(a: &[f32; 8], b: &[f32; 8]) -> f32 {
    if let Some(token) = X64V3Token::summon() {  // CPUID every call!
        distance_simd(token, a, b)
    } else {
        distance_scalar(a, b)
    }
}

fn find_closest(points: &[[f32; 8]], query: &[f32; 8]) -> usize {
    let mut best_idx = 0;
    let mut best_dist = f32::MAX;

    for (i, point) in points.iter().enumerate() {
        let d = distance(point, query);  // summon() called N times!
        if d < best_dist {
            best_dist = d;
            best_idx = i;
        }
    }
    best_idx
}
}

This is 42% slower than hoisting the token. CPUID is not free.

The Solution: Hoist to API Boundary

#![allow(unused)]
fn main() {
use archmage::{X64V3Token, SimdToken, arcane};
use magetypes::simd::f32x8;

// RIGHT: Summon once, pass through
fn find_closest(points: &[[f32; 8]], query: &[f32; 8]) -> usize {
    // Summon ONCE at entry
    if let Some(token) = X64V3Token::summon() {
        find_closest_simd(token, points, query)
    } else {
        find_closest_scalar(points, query)
    }
}

#[arcane]
fn find_closest_simd(token: X64V3Token, points: &[[f32; 8]], query: &[f32; 8]) -> usize {
    let mut best_idx = 0;
    let mut best_dist = f32::MAX;

    for (i, point) in points.iter().enumerate() {
        let d = distance_simd(token, point, query);  // Token passed, no summon!
        if d < best_dist {
            best_dist = d;
            best_idx = i;
        }
    }
    best_idx
}

#[arcane]
fn distance_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    // Just use the token - no detection here
    let va = f32x8::from_array(token, *a);
    let vb = f32x8::from_array(token, *b);
    let diff = va - vb;
    (diff * diff).reduce_add().sqrt()
}
}

The Rule

┌─────────────────────────────────────────────────────────┐
│  Public API boundary                                     │
│  ┌─────────────────────────────────────────────────────┐│
│  │  if let Some(token) = Token::summon() {             ││
│  │      // ONLY place summon() is called               ││
│  │      internal_impl(token, ...);                     ││
│  │  }                                                  ││
│  └─────────────────────────────────────────────────────┘│
│                          │                               │
│                          ▼                               │
│  ┌─────────────────────────────────────────────────────┐│
│  │  #[arcane]                                          ││
│  │  fn internal_impl(token: Token, ...) {              ││
│  │      helper(token, ...);  // Pass token through     ││
│  │  }                                                  ││
│  └─────────────────────────────────────────────────────┘│
│                          │                               │
│                          ▼                               │
│  ┌─────────────────────────────────────────────────────┐│
│  │  #[arcane]                                          ││
│  │  fn helper(token: Token, ...) {                     ││
│  │      // Use token, never summon                     ││
│  │  }                                                  ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

Why Tokens Are Zero-Cost to Pass

#![allow(unused)]
fn main() {
// Tokens are zero-sized
assert_eq!(std::mem::size_of::<X64V3Token>(), 0);

// Passing them costs nothing at runtime
fn f(token: X64V3Token) { }  // No actual parameter in compiled code
}

The token exists only at compile time to prove you did the check. At runtime, it's completely erased.

When -Ctarget-cpu=native Helps

With compile-time feature guarantees, summon() becomes a no-op:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release

Now X64V3Token::summon() compiles to:

#![allow(unused)]
fn main() {
// Effectively becomes:
fn summon() -> Option<X64V3Token> {
    Some(X64V3Token)  // No CPUID, unconditional
}
}

But even then, always hoist. It's good practice, and your code works correctly when compiled without target-cpu.

Summary

PatternPerformanceCorrectness
summon() in hot loop42% slowerWorks
summon() at API boundaryOptimalWorks
summon() with -Ctarget-cpuOptimalWorks

The #[arcane] Macro

#[arcane] creates a safe wrapper around SIMD code. Use it at entry points—functions called from non-SIMD code (after summon(), from tests, public APIs).

For internal helpers called from other SIMD functions, use #[rite] instead—it has zero wrapper overhead.

Basic Usage

#![allow(unused)]
fn main() {
use archmage::prelude::*;

#[arcane]
fn add_vectors(_token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] {
    // safe_unaligned_simd takes references - fully safe inside #[arcane]!
    let va = _mm256_loadu_ps(a);
    let vb = _mm256_loadu_ps(b);
    let sum = _mm256_add_ps(va, vb);

    let mut out = [0.0f32; 8];
    _mm256_storeu_ps(&mut out, sum);
    out
}
}

What It Generates

#![allow(unused)]
fn main() {
// Your code:
#[arcane]
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    _mm256_add_ps(a, b)
}

// Generated:
fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, a: __m256, b: __m256) -> __m256 {
        _mm256_add_ps(a, b)  // Safe inside #[target_feature]!
    }
    // SAFETY: Token proves CPU support was verified
    unsafe { __inner(token, a, b) }
}
}

Token-to-Features Mapping

TokenEnabled Features
X64V2Tokensse3, ssse3, sse4.1, sse4.2, popcnt
X64V3Token / Desktop64+ avx, avx2, fma, bmi1, bmi2, f16c
X64V4Token / Server64+ avx512f, avx512bw, avx512cd, avx512dq, avx512vl
NeonToken / Arm64neon
Simd128Tokensimd128

Nesting #[arcane] Functions

Functions with the same token type inline into each other:

#![allow(unused)]
fn main() {
use magetypes::simd::f32x8;

#[arcane]
fn outer(token: Desktop64, data: &[f32; 8]) -> f32 {
    let sum = inner(token, data);  // Inlines!
    sum * 2.0
}

#[arcane]
fn inner(token: Desktop64, data: &[f32; 8]) -> f32 {
    // Both functions share the same #[target_feature] region
    // LLVM optimizes across both
    let v = f32x8::from_array(token, *data);
    v.reduce_add()
}
}

Downcasting Tokens

Higher tokens can call functions expecting lower tokens:

#![allow(unused)]
fn main() {
use magetypes::simd::f32x8;

#[arcane]
fn v4_kernel(token: X64V4Token, data: &[f32; 8]) -> f32 {
    // V4 ⊃ V3, so this works and inlines properly
    v3_sum(token, data)
    // ... could do AVX-512 specific work too ...
}

#[arcane]
fn v3_sum(token: X64V3Token, data: &[f32; 8]) -> f32 {
    // Actual SIMD: load 8 floats, horizontal sum
    let v = f32x8::from_array(token, *data);
    v.reduce_add()
}
}

Cross-Platform Stubs

On non-matching architectures, #[arcane] generates a stub:

#![allow(unused)]
fn main() {
// On ARM, this becomes:
#[cfg(not(target_arch = "x86_64"))]
fn add(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] {
    unreachable!("Desktop64 cannot exist on this architecture")
}
}

The stub compiles but can never be reached—Desktop64::summon() returns None on ARM.

Options

inline_always

Force aggressive inlining (requires nightly):

#![allow(unused)]
#![feature(target_feature_inline_always)]

fn main() {
#[arcane(inline_always)]
fn hot_path(token: Desktop64, data: &[f32]) -> f32 {
    // Uses #[inline(always)] instead of #[inline]
}
}

Common Patterns

Public API with Internal Implementation

#![allow(unused)]
fn main() {
pub fn process(data: &mut [f32]) {
    if let Some(token) = Desktop64::summon() {
        process_simd(token, data);
    } else {
        process_scalar(data);
    }
}

#[arcane]
fn process_simd(token: Desktop64, data: &mut [f32]) {
    // SIMD implementation
}

fn process_scalar(data: &mut [f32]) {
    // Fallback
}
}

Generic Over Tokens

#![allow(unused)]
fn main() {
use archmage::HasX64V2;
use std::arch::x86_64::*;

#[arcane]
fn generic_impl<T: HasX64V2>(token: T, a: __m128, b: __m128) -> __m128 {
    // Works with X64V2Token, X64V3Token, X64V4Token
    // Note: generic bounds create optimization boundaries
    _mm_add_ps(a, b)
}
}

Warning: Generic bounds prevent inlining across the boundary. Prefer concrete tokens for hot paths.

The #[rite] Macro

#[rite] should be your default choice for SIMD functions. It adds #[target_feature] + #[inline] directly—no wrapper overhead.

Use #[arcane] only at entry points where the token comes from the outside world.

The Rule

CallerUse
Called from #[arcane] or #[rite] with same/compatible token#[rite]
Called from non-SIMD code (tests, public API, after summon())#[arcane]

Default to #[rite]}. Only use #[arcane] when you need the safe wrapper.

#![allow(unused)]
fn main() {
use archmage::prelude::*;

// ENTRY POINT: receives token from caller
#[arcane]
pub fn dot_product(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let products = mul_vectors(token, a, b);  // Calls #[rite] helper
    horizontal_sum(token, products)
}

// INNER HELPER: only called from #[arcane] context
#[rite]
fn mul_vectors(_token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> __m256 {
    // safe_unaligned_simd takes references - no unsafe needed!
    let va = _mm256_loadu_ps(a);
    let vb = _mm256_loadu_ps(b);
    _mm256_mul_ps(va, vb)
}

// INNER HELPER: only called from #[arcane] context
#[rite]
fn horizontal_sum(_token: Desktop64, v: __m256) -> f32 {
    let sum = _mm256_hadd_ps(v, v);
    let sum = _mm256_hadd_ps(sum, sum);
    let low = _mm256_castps256_ps128(sum);
    let high = _mm256_extractf128_ps::<1>(sum);
    _mm_cvtss_f32(_mm_add_ss(low, high))
}
}

What It Generates

#![allow(unused)]
fn main() {
// Your code:
#[rite]
fn helper(_token: Desktop64, v: __m256) -> __m256 {
    _mm256_add_ps(v, v)
}

// Generated (NO wrapper function):
#[target_feature(enable = "avx2,fma,bmi1,bmi2")]
#[inline]
fn helper(_token: Desktop64, v: __m256) -> __m256 {
    _mm256_add_ps(v, v)
}
}

Compare to #[arcane] which creates:

#![allow(unused)]
fn main() {
fn helper(_token: Desktop64, v: __m256) -> __m256 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(_token: Desktop64, v: __m256) -> __m256 {
        _mm256_add_ps(v, v)
    }
    unsafe { __inner(_token, v) }
}
}

Why This Works (Rust 1.85+)

Since Rust 1.85, calling a #[target_feature] function from another function with matching or superset features is safe—no unsafe block needed:

#![allow(unused)]
fn main() {
#[target_feature(enable = "avx2,fma")]
fn outer(data: &[f32; 8]) -> f32 {
    inner_add(data) + inner_mul(data)  // Safe! No unsafe needed!
}

#[target_feature(enable = "avx2")]
#[inline]
fn inner_add(data: &[f32; 8]) -> f32 { /* ... */ }

#[target_feature(enable = "avx2")]
#[inline]
fn inner_mul(data: &[f32; 8]) -> f32 { /* ... */ }
}

The caller's features (avx2,fma) are a superset of the callee's (avx2), so the compiler knows the call is safe.

Direct Calls Require Unsafe

If you call a #[rite] function from outside a #[target_feature] context, you need unsafe:

#![allow(unused)]
fn main() {
#[test]
fn test_helper() {
    if let Some(token) = Desktop64::summon() {
        // Direct call from test (no target_feature) requires unsafe
        let result = unsafe { helper(token, data) };
        assert_eq!(result, expected);
    }
}
}

This is correct—the test function doesn't have #[target_feature], so the compiler can't verify safety at compile time. The unsafe block says "I checked at runtime via summon()."

Benefits

  1. Zero wrapper overhead: No extra function call indirection
  2. Better inlining: LLVM sees the actual function, not a wrapper
  3. Cleaner stack traces: No __inner functions in backtraces
  4. Syntactic sugar: No need to manually maintain feature strings

Choosing Between #[arcane] and #[rite]

Default to #[rite] — only use #[arcane] when necessary.

SituationUseWhy
Internal helper#[rite]Zero overhead, inlines fully
Composable building blocks#[rite]Same target features = one optimization region
Most SIMD functions#[rite]This should be your default
Entry point (receives token from outside)#[arcane]Needs safe wrapper
Public API#[arcane]Callers aren't in target_feature context
Called from tests#[arcane]Tests aren't in target_feature context

Composing Helpers

#[rite] helpers compose naturally:

#![allow(unused)]
fn main() {
#[rite]
fn complex_op(token: Desktop64, a: &[f32; 8], b: &[f32; 8], c: &[f32; 8]) -> f32 {
    let ab = mul_vectors(token, a, b);       // Calls another #[rite]
    let vc = load_vector(token, c);          // Calls another #[rite]
    let sum = add_vectors_raw(token, ab, vc); // Calls another #[rite]
    horizontal_sum(token, sum)                // Calls another #[rite]
}
}

All helpers inline into the caller with zero overhead.

Inlining Behavior

#[rite] uses #[inline] which is sufficient for full inlining when called from matching #[target_feature] context. Benchmarks show #[rite] with #[inline] performs identically to manually inlined code.

Note: #[inline(always)] combined with #[target_feature] is not allowed on stable Rust, so we can't use it anyway. The good news is we don't need it—#[inline] works perfectly.

Benchmark results (1000 iterations, 8-float vector add):
  arcane_in_loop:     2.32 µs  (4.1x slower - wrapper overhead)
  rite_in_arcane:     572 ns   (baseline - full inlining)
  manual_inline:      570 ns   (baseline)

Cross-Platform Stubs

Archmage lets you write x86 SIMD code that compiles on ARM and vice versa. Functions become unreachable stubs on non-matching architectures.

How It Works

When you write:

#![allow(unused)]
fn main() {
use archmage::prelude::*;

#[arcane]
fn avx2_kernel(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    // x86-64 SIMD code - safe_unaligned_simd takes references
    let v = _mm256_loadu_ps(data);
    // ...
}
}

On x86-64, you get the real implementation.

On ARM/WASM, you get:

#![allow(unused)]
fn main() {
fn avx2_kernel(token: Desktop64, data: &[f32; 8]) -> [f32; 8] {
    unreachable!("Desktop64 cannot exist on this architecture")
}
}

Why This Is Safe

The stub can never execute because:

  1. Desktop64::summon() returns None on ARM
  2. You can't construct Desktop64 any other way (safely)
  3. The only path to avx2_kernel is through a token you can't obtain
#![allow(unused)]
fn main() {
fn process(data: &[f32; 8]) -> [f32; 8] {
    if let Some(token) = Desktop64::summon() {
        avx2_kernel(token, data)  // Never reached on ARM
    } else {
        scalar_fallback(data)     // ARM takes this path
    }
}
}

Writing Cross-Platform Libraries

Structure your code with platform-specific implementations:

#![allow(unused)]
fn main() {
// Public API - works everywhere
pub fn process(data: &mut [f32]) {
    #[cfg(target_arch = "x86_64")]
    if let Some(token) = Desktop64::summon() {
        return process_avx2(token, data);
    }

    #[cfg(target_arch = "aarch64")]
    if let Some(token) = NeonToken::summon() {
        return process_neon(token, data);
    }

    process_scalar(data);
}

#[cfg(target_arch = "x86_64")]
#[arcane]
fn process_avx2(token: Desktop64, data: &mut [f32]) {
    // AVX2 implementation
}

#[cfg(target_arch = "aarch64")]
#[arcane]
fn process_neon(token: NeonToken, data: &mut [f32]) {
    // NEON implementation
}

fn process_scalar(data: &mut [f32]) {
    // Works everywhere
}
}

Token Existence vs Token Availability

All token types exist on all platforms:

#![allow(unused)]
fn main() {
// These types compile on ARM:
use archmage::{Desktop64, X64V3Token, X64V4Token};

// But summon() returns None:
assert!(Desktop64::summon().is_none());  // On ARM

// And guaranteed() tells you:
assert_eq!(Desktop64::guaranteed(), Some(false));  // Wrong arch
}

This enables cross-platform code without #[cfg] soup:

#![allow(unused)]
fn main() {
// Compiles everywhere, dispatches at runtime
fn process<T: IntoConcreteToken>(token: T, data: &[f32]) {
    if let Some(t) = token.as_x64v3() {
        process_v3(t, data);
    } else if let Some(t) = token.as_neon() {
        process_neon(t, data);
    } else {
        process_scalar(data);
    }
}
}

The ScalarToken Escape Hatch

ScalarToken works everywhere:

#![allow(unused)]
fn main() {
use archmage::{ScalarToken, SimdToken};

// Always succeeds, any platform
let token = ScalarToken::summon().unwrap();

// Or just construct it
let token = ScalarToken;
}

Use it for fallback paths that need a token for API consistency:

#![allow(unused)]
fn main() {
fn must_have_token<T: SimdToken>(token: T, data: &[f32]) -> f32 {
    // ...
}

// On platforms without SIMD:
let result = must_have_token(ScalarToken, &data);
}

Testing Cross-Platform Code

Test your dispatch logic without needing every CPU:

#![allow(unused)]
fn main() {
#[test]
fn test_scalar_fallback() {
    // Force scalar path even on AVX2 machine
    let token = ScalarToken;
    let result = process_with_token(token, &data);
    assert_eq!(result, expected);
}

#[test]
#[cfg(target_arch = "x86_64")]
fn test_avx2_path() {
    if let Some(token) = Desktop64::summon() {
        let result = process_with_token(token, &data);
        assert_eq!(result, expected);
    }
}
}

Manual Dispatch

The simplest dispatch pattern: check for tokens explicitly, call the appropriate implementation.

Basic Pattern

#![allow(unused)]
fn main() {
use archmage::{Desktop64, SimdToken};

pub fn process(data: &mut [f32]) {
    if let Some(token) = Desktop64::summon() {
        process_avx2(token, data);
    } else {
        process_scalar(data);
    }
}

#[arcane]
fn process_avx2(token: Desktop64, data: &mut [f32]) {
    // AVX2 implementation
}

fn process_scalar(data: &mut [f32]) {
    // Scalar fallback
}
}

That's it. No #[cfg(target_arch)] needed—this compiles and runs everywhere.

No Architecture Guards Needed

Tokens exist on all platforms. On unsupported architectures, summon() returns None and #[arcane] functions become unreachable stubs. You write one dispatch block:

#![allow(unused)]
fn main() {
use archmage::{Desktop64, Arm64, Simd128Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // Try x86 AVX2
    if let Some(token) = Desktop64::summon() {
        return process_x86(token, data);
    }

    // Try ARM NEON
    if let Some(token) = Arm64::summon() {
        return process_arm(token, data);
    }

    // Try WASM SIMD
    if let Some(token) = Simd128Token::summon() {
        return process_wasm(token, data);
    }

    // Scalar fallback
    process_scalar(data);
}

#[arcane]
fn process_x86(token: Desktop64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_arm(token: Arm64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_wasm(token: Simd128Token, data: &mut [f32]) { /* ... */ }

fn process_scalar(data: &mut [f32]) { /* ... */ }
}

On x86-64: Desktop64::summon() may succeed, others return None. On ARM: Arm64::summon() succeeds, others return None. On WASM: Simd128Token::summon() may succeed, others return None.

The #[arcane] functions for other architectures compile to unreachable stubs—the code exists but can never be called.

Multi-Tier x86 Dispatch

Check from highest to lowest capability:

#![allow(unused)]
fn main() {
use archmage::{X64V4Token, Desktop64, X64V2Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // AVX-512 (requires avx512 feature)
    #[cfg(feature = "avx512")]
    if let Some(token) = X64V4Token::summon() {
        return process_v4(token, data);
    }

    // AVX2+FMA (Haswell+, Zen+)
    if let Some(token) = Desktop64::summon() {
        return process_v3(token, data);
    }

    // SSE4.2 (Nehalem+)
    if let Some(token) = X64V2Token::summon() {
        return process_v2(token, data);
    }

    process_scalar(data);
}
}

Note: #[cfg(feature = "avx512")] is a Cargo feature gate (compile-time opt-in), not an architecture check. The actual CPU detection is still runtime via summon().

When to Use Manual Dispatch

Use manual dispatch when:

  • You have 2-3 tiers
  • You want explicit, readable control flow
  • Different tiers have different APIs

Consider incant! when:

  • You have many tiers
  • All implementations have the same signature
  • You want automatic best-available selection

Avoiding Common Mistakes

Don't Dispatch in Hot Loops

#![allow(unused)]
fn main() {
// WRONG - CPUID every iteration
for chunk in data.chunks_mut(8) {
    if let Some(token) = Desktop64::summon() {
        process_chunk(token, chunk);
    }
}

// BETTER - hoist token outside loop
if let Some(token) = Desktop64::summon() {
    for chunk in data.chunks_mut(8) {
        process_chunk(token, chunk);  // But still has #[arcane] wrapper overhead
    }
} else {
    for chunk in data.chunks_mut(8) {
        process_chunk_scalar(chunk);
    }
}

// BEST - put the loop inside #[arcane], call #[rite] helpers
if let Some(token) = Desktop64::summon() {
    process_all_chunks(token, data);
} else {
    process_all_chunks_scalar(data);
}

#[arcane]
fn process_all_chunks(token: Desktop64, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(8) {
        process_chunk(token, chunk.try_into().unwrap());  // #[rite] inlines fully!
    }
}

#[rite]
fn process_chunk(_: Desktop64, chunk: &mut [f32; 8]) {
    // This inlines into process_all_chunks with zero overhead
}
}

The "BETTER" pattern still calls through an #[arcane] wrapper each iteration—an LLVM optimization barrier. The "BEST" pattern puts the loop inside #[arcane] and uses #[rite] for the inner work, so LLVM sees one optimization region for the entire loop.

Don't Forget Early Returns

#![allow(unused)]
fn main() {
// WRONG - falls through to scalar even when SIMD available
if let Some(token) = Desktop64::summon() {
    process_avx2(token, data);
    // Missing return!
}
process_scalar(data);  // Always runs!

// RIGHT
if let Some(token) = Desktop64::summon() {
    return process_avx2(token, data);
}
process_scalar(data);
}

incant! Macro

incant! automates dispatch to suffixed function variants. Write one call, get automatic fallback through capability tiers.

Basic Usage

#![allow(unused)]
fn main() {
use archmage::{incant, arcane};
use magetypes::simd::f32x8;

// Define variants with standard suffixes
#[arcane]
fn sum_v3(token: X64V3Token, data: &[f32; 8]) -> f32 {
    f32x8::from_array(token, *data).reduce_add()
}

#[arcane]
fn sum_neon(token: NeonToken, data: &[f32; 4]) -> f32 {
    // NEON implementation
}

fn sum_scalar(data: &[f32]) -> f32 {
    data.iter().sum()
}

// Dispatch automatically
pub fn sum(data: &[f32; 8]) -> f32 {
    incant!(sum(data))
    // Tries: sum_v4 → sum_v3 → sum_neon → sum_wasm128 → sum_scalar
}
}

How It Works

incant! expands to a dispatch chain:

#![allow(unused)]
fn main() {
// incant!(process(data)) expands to approximately:
{
    #[cfg(all(target_arch = "x86_64", feature = "avx512"))]
    if let Some(token) = X64V4Token::summon() {
        return process_v4(token, data);
    }

    #[cfg(target_arch = "x86_64")]
    if let Some(token) = X64V3Token::summon() {
        return process_v3(token, data);
    }

    #[cfg(target_arch = "aarch64")]
    if let Some(token) = NeonToken::summon() {
        return process_neon(token, data);
    }

    #[cfg(target_arch = "wasm32")]
    if let Some(token) = Simd128Token::summon() {
        return process_wasm128(token, data);
    }

    process_scalar(data)
}
}

Suffix Conventions

SuffixTokenPlatform
_v4X64V4Tokenx86-64 AVX-512
_v3X64V3Tokenx86-64 AVX2+FMA
_v2X64V2Tokenx86-64 SSE4.2
_neonNeonTokenAArch64
_wasm128Simd128TokenWASM
_scalarFallback

You don't need all variants—incant! skips missing ones.

Passthrough Mode

When you already have a token and want to dispatch to specialized variants:

#![allow(unused)]
fn main() {
fn outer<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    // Passthrough: token already obtained, dispatch to best variant
    incant!(token => process(data))
}
}

This uses IntoConcreteToken to check the token's actual type and dispatch accordingly, without re-summoning.

Example: Complete Implementation

#![allow(unused)]
fn main() {
use archmage::{arcane, incant, X64V3Token, NeonToken, SimdToken};
use magetypes::simd::f32x8;

// AVX2 variant
#[cfg(target_arch = "x86_64")]
#[arcane]
fn dot_product_v3(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let va = f32x8::from_array(token, *a);
    let vb = f32x8::from_array(token, *b);
    (va * vb).reduce_add()
}

// NEON variant (128-bit, so process 4 at a time)
#[cfg(target_arch = "aarch64")]
#[arcane]
fn dot_product_neon(token: NeonToken, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    use magetypes::simd::f32x4;
    let sum1 = {
        let va = f32x4::from_slice(token, &a[0..4]);
        let vb = f32x4::from_slice(token, &b[0..4]);
        (va * vb).reduce_add()
    };
    let sum2 = {
        let va = f32x4::from_slice(token, &a[4..8]);
        let vb = f32x4::from_slice(token, &b[4..8]);
        (va * vb).reduce_add()
    };
    sum1 + sum2
}

// Scalar fallback
fn dot_product_scalar(a: &[f32; 8], b: &[f32; 8]) -> f32 {
    a.iter().zip(b.iter()).map(|(x, y)| x * y).sum()
}

// Public API
pub fn dot_product(a: &[f32; 8], b: &[f32; 8]) -> f32 {
    incant!(dot_product(a, b))
}
}

When to Use incant!

Use incant! when:

  • You have multiple platform-specific implementations
  • You want automatic fallback through tiers
  • Function signatures are similar across variants

Use manual dispatch when:

  • You need custom fallback logic
  • Variants have different signatures
  • You want more explicit control

IntoConcreteToken Trait

IntoConcreteToken enables compile-time dispatch via monomorphization. Each token type returns Some(self) for its own type and None for others.

Basic Usage

#![allow(unused)]
fn main() {
use archmage::{IntoConcreteToken, SimdToken, X64V3Token, NeonToken, ScalarToken};

fn process<T: IntoConcreteToken>(token: T, data: &mut [f32]) {
    // Compiler eliminates non-matching branches via monomorphization
    if let Some(t) = token.as_x64v3() {
        process_avx2(t, data);
    } else if let Some(t) = token.as_neon() {
        process_neon(t, data);
    } else if let Some(_) = token.as_scalar() {
        process_scalar(data);
    }
}
}

When called with X64V3Token, the compiler sees:

  • as_x64v3()Some(token) (takes this branch)
  • as_neon()None (eliminated)
  • as_scalar()None (eliminated)

Available Methods

#![allow(unused)]
fn main() {
pub trait IntoConcreteToken: SimdToken {
    fn as_x64v2(self) -> Option<X64V2Token> { None }
    fn as_x64v3(self) -> Option<X64V3Token> { None }
    fn as_x64v4(self) -> Option<X64V4Token> { None }      // requires avx512
    fn as_avx512_modern(self) -> Option<Avx512ModernToken> { None }
    fn as_neon(self) -> Option<NeonToken> { None }
    fn as_neon_aes(self) -> Option<NeonAesToken> { None }
    fn as_neon_sha3(self) -> Option<NeonSha3Token> { None }
    fn as_wasm128(self) -> Option<Simd128Token> { None }
    fn as_scalar(self) -> Option<ScalarToken> { None }
}
}

Each concrete token overrides its own method to return Some(self).

Upcasting with IntoConcreteToken

You can check if a token supports higher capabilities:

#![allow(unused)]
fn main() {
fn maybe_use_avx512<T: IntoConcreteToken>(token: T, data: &mut [f32]) {
    // Check if we actually have AVX-512
    if let Some(v4) = token.as_x64v4() {
        fast_path_avx512(v4, data);
    } else if let Some(v3) = token.as_x64v3() {
        normal_path_avx2(v3, data);
    }
}
}

Note: This creates an LLVM optimization boundary. The generic caller and feature-enabled callee have different target settings. Do this dispatch at entry points, not in hot code.

Dispatch Order

Check from highest to lowest capability:

#![allow(unused)]
fn main() {
fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    // Highest first
    #[cfg(feature = "avx512")]
    if let Some(t) = token.as_x64v4() {
        return process_v4(t, data);
    }

    if let Some(t) = token.as_x64v3() {
        return process_v3(t, data);
    }

    if let Some(t) = token.as_neon() {
        return process_neon(t, data);
    }

    if let Some(t) = token.as_wasm128() {
        return process_wasm(t, data);
    }

    // Scalar fallback
    process_scalar(data)
}
}

vs incant!

FeatureIntoConcreteTokenincant!
Dispatch styleExplicit if/elseMacro-generated
Token passingToken already obtainedSummons tokens
FlexibilityFull controlConvention-based
VerbosityMore codeLess code

Use IntoConcreteToken when you already have a token and need to specialize. Use incant! for entry-point dispatch.

Example: Library with Generic Token API

#![allow(unused)]
fn main() {
use archmage::{IntoConcreteToken, SimdToken, arcane};

/// Public API accepts any token
pub fn transform<T: IntoConcreteToken>(token: T, data: &mut [f32]) {
    if let Some(t) = token.as_x64v3() {
        transform_avx2(t, data);
    } else if let Some(t) = token.as_neon() {
        transform_neon(t, data);
    } else {
        transform_scalar(data);
    }
}

#[cfg(target_arch = "x86_64")]
#[arcane]
fn transform_avx2(token: X64V3Token, data: &mut [f32]) {
    // AVX2 implementation
}

#[cfg(target_arch = "aarch64")]
#[arcane]
fn transform_neon(token: NeonToken, data: &mut [f32]) {
    // NEON implementation
}

fn transform_scalar(data: &mut [f32]) {
    // Scalar fallback
}
}

Callers can pass any token:

#![allow(unused)]
fn main() {
if let Some(token) = Desktop64::summon() {
    transform(token, &mut data);  // Uses AVX2 path
}

// Or force scalar for testing
transform(ScalarToken, &mut data);
}

Tiered Fallback

Real applications need graceful degradation across capability tiers. Here's how to structure robust fallback chains.

The Tier Hierarchy

x86-64:
  X64V4Token (AVX-512)
      ↓
  X64V3Token (AVX2+FMA)  ← Most common target
      ↓
  X64V2Token (SSE4.2)
      ↓
  ScalarToken

AArch64:
  NeonSha3Token
      ↓
  NeonAesToken
      ↓
  NeonToken  ← Baseline (always available)
      ↓
  ScalarToken

WASM:
  Simd128Token
      ↓
  ScalarToken

Pattern: Capability Waterfall

#![allow(unused)]
fn main() {
use archmage::*;

pub fn process(data: &mut [f32]) -> f32 {
    // x86-64 path
    #[cfg(target_arch = "x86_64")]
    {
        #[cfg(feature = "avx512")]
        if let Some(token) = X64V4Token::summon() {
            return process_v4(token, data);
        }

        if let Some(token) = X64V3Token::summon() {
            return process_v3(token, data);
        }

        if let Some(token) = X64V2Token::summon() {
            return process_v2(token, data);
        }
    }

    // AArch64 path
    #[cfg(target_arch = "aarch64")]
    if let Some(token) = NeonToken::summon() {
        return process_neon(token, data);
    }

    // WASM path
    #[cfg(target_arch = "wasm32")]
    if let Some(token) = Simd128Token::summon() {
        return process_wasm(token, data);
    }

    // Universal fallback
    process_scalar(data)
}
}

Pattern: Width-Based Tiers

When your algorithm naturally works at different widths:

#![allow(unused)]
fn main() {
use magetypes::*;

pub fn sum_f32(data: &[f32]) -> f32 {
    #[cfg(target_arch = "x86_64")]
    {
        // Try 256-bit first
        if let Some(token) = X64V3Token::summon() {
            return sum_f32x8(token, data);
        }

        // Fall back to 128-bit
        if let Some(token) = X64V2Token::summon() {
            return sum_f32x4(token, data);
        }
    }

    #[cfg(target_arch = "aarch64")]
    if let Some(token) = NeonToken::summon() {
        return sum_f32x4_neon(token, data);
    }

    sum_scalar(data)
}

#[cfg(target_arch = "x86_64")]
#[arcane]
fn sum_f32x8(token: X64V3Token, data: &[f32]) -> f32 {
    let mut acc = f32x8::zero(token);
    for chunk in data.chunks_exact(8) {
        let v = f32x8::from_slice(token, chunk);
        acc = acc + v;
    }
    // Handle remainder
    let mut sum = acc.reduce_add();
    for &x in data.chunks_exact(8).remainder() {
        sum += x;
    }
    sum
}
}

Pattern: Feature Detection Cache

For hot paths where you dispatch many times:

#![allow(unused)]
fn main() {
use std::sync::OnceLock;

#[derive(Clone, Copy)]
enum SimdLevel {
    Avx512,
    Avx2,
    Sse42,
    Scalar,
}

static SIMD_LEVEL: OnceLock<SimdLevel> = OnceLock::new();

fn detect_level() -> SimdLevel {
    *SIMD_LEVEL.get_or_init(|| {
        #[cfg(all(target_arch = "x86_64", feature = "avx512"))]
        if X64V4Token::summon().is_some() {
            return SimdLevel::Avx512;
        }

        #[cfg(target_arch = "x86_64")]
        if X64V3Token::summon().is_some() {
            return SimdLevel::Avx2;
        }

        #[cfg(target_arch = "x86_64")]
        if X64V2Token::summon().is_some() {
            return SimdLevel::Sse42;
        }

        SimdLevel::Scalar
    })
}

pub fn process(data: &mut [f32]) {
    match detect_level() {
        SimdLevel::Avx512 => {
            #[cfg(all(target_arch = "x86_64", feature = "avx512"))]
            process_v4(X64V4Token::summon().unwrap(), data);
        }
        SimdLevel::Avx2 => {
            #[cfg(target_arch = "x86_64")]
            process_v3(X64V3Token::summon().unwrap(), data);
        }
        // ... etc
    }
}
}

Anti-Pattern: Over-Engineering

Don't create tiers you don't need:

#![allow(unused)]
fn main() {
// WRONG: Too many tiers, most are never used
pub fn simple_add(a: f32, b: f32) -> f32 {
    if let Some(t) = X64V4Token::summon() { ... }
    else if let Some(t) = X64V3Token::summon() { ... }
    else if let Some(t) = X64V2Token::summon() { ... }
    else if let Some(t) = NeonToken::summon() { ... }
    else { a + b }
}

// RIGHT: Just do the simple thing
pub fn simple_add(a: f32, b: f32) -> f32 {
    a + b
}
}

SIMD only helps with bulk operations. For scalar math, just use scalar math.

Recommendations

  1. Desktop64 is usually enough for x86 — it covers 99% of modern PCs
  2. NeonToken is baseline on AArch64 — no fallback needed
  3. Test your scalar path — it's your safety net
  4. Profile before adding tiers — each tier is code to maintain

magetypes Type Overview

magetypes provides SIMD vector types with natural Rust operators. Each type wraps platform intrinsics and requires an archmage token for construction.

Available Types

x86-64 Types

TypeElementsWidthMin Token
f32x44 × f32128-bitX64V2Token
f32x88 × f32256-bitX64V3Token
f32x1616 × f32512-bitX64V4Token*
f64x22 × f64128-bitX64V2Token
f64x44 × f64256-bitX64V3Token
f64x88 × f64512-bitX64V4Token*
i32x44 × i32128-bitX64V2Token
i32x88 × i32256-bitX64V3Token
i32x1616 × i32512-bitX64V4Token*
i8x1616 × i8128-bitX64V2Token
i8x3232 × i8256-bitX64V3Token
............

*Requires avx512 feature

AArch64 Types (NEON)

TypeElementsWidthToken
f32x44 × f32128-bitNeonToken
f64x22 × f64128-bitNeonToken
i32x44 × i32128-bitNeonToken
i16x88 × i16128-bitNeonToken
i8x1616 × i8128-bitNeonToken
u32x44 × u32128-bitNeonToken
............

WASM Types (SIMD128)

TypeElementsWidthToken
f32x44 × f32128-bitSimd128Token
f64x22 × f64128-bitSimd128Token
i32x44 × i32128-bitSimd128Token
............

Basic Usage

#![allow(unused)]
fn main() {
use archmage::{Desktop64, SimdToken};
use magetypes::simd::f32x8;

fn example() {
    if let Some(token) = Desktop64::summon() {
        // Construct from array
        let a = f32x8::from_array(token, [1.0; 8]);

        // Splat a single value
        let b = f32x8::splat(token, 2.0);

        // Natural operators
        let c = a + b;
        let d = c * c;

        // Extract result
        let result: [f32; 8] = d.to_array();
    }
}
}

Type Properties

All magetypes SIMD types are:

  • Copy — Pass by value freely
  • Clone — Explicit cloning works
  • Debug — Print for debugging
  • Send + Sync — Thread-safe
  • Token-gated construction — Cannot create without proving CPU support
#![allow(unused)]
fn main() {
// Zero-cost copies
let a = f32x8::splat(token, 1.0);
let b = a;  // Copy, not move
let c = a + b;  // Both still valid
}

Why no Pod/Zeroable? Implementing bytemuck traits would let users bypass token-gated construction (e.g., bytemuck::zeroed::<f32x8>()), creating vectors without proving CPU support. Use the token-gated cast_slice and from_bytes methods instead.

Using the Prelude

For convenience, import everything:

#![allow(unused)]
fn main() {
use magetypes::prelude::*;

// Now you have all types and archmage re-exports
if let Some(token) = Desktop64::summon() {
    let v = f32x8::splat(token, 1.0);
}
}

Platform-Specific Imports

If you need just one platform:

#![allow(unused)]
fn main() {
// x86-64 only
#[cfg(target_arch = "x86_64")]
use magetypes::simd::x86::*;

// AArch64 only
#[cfg(target_arch = "aarch64")]
use magetypes::simd::arm::*;

// WASM only
#[cfg(target_arch = "wasm32")]
use magetypes::simd::wasm::*;
}

Feature Flags

FeatureEffect
avx512Enable 512-bit types on x86-64
stdStandard library support (default)

Construction & Operators

magetypes provides natural Rust syntax for SIMD operations.

Construction

From Array

#![allow(unused)]
fn main() {
let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let v = f32x8::from_array(token, data);
}

From Slice

#![allow(unused)]
fn main() {
let slice = &[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let v = f32x8::from_slice(token, slice);
}

Splat (Broadcast)

#![allow(unused)]
fn main() {
let v = f32x8::splat(token, 3.14159);  // All lanes = π
}

Zero

#![allow(unused)]
fn main() {
let v = f32x8::zero(token);  // All lanes = 0
}

Load from Memory

#![allow(unused)]
fn main() {
// Unaligned load
let v = f32x8::load(token, ptr);

// From reference
let v = f32x8::from_array(token, *array_ref);
}

Extraction

To Array

#![allow(unused)]
fn main() {
let arr: [f32; 8] = v.to_array();
}

Store to Memory

#![allow(unused)]
fn main() {
v.store(ptr);          // Unaligned store
v.store_aligned(ptr);  // Aligned store (UB if misaligned)
}

Extract Single Lane

#![allow(unused)]
fn main() {
let first = v.extract::<0>();
let third = v.extract::<2>();
}

Arithmetic Operators

All standard operators work:

#![allow(unused)]
fn main() {
let a = f32x8::splat(token, 2.0);
let b = f32x8::splat(token, 3.0);

let sum = a + b;        // [5.0; 8]
let diff = a - b;       // [-1.0; 8]
let prod = a * b;       // [6.0; 8]
let quot = a / b;       // [0.666...; 8]
let neg = -a;           // [-2.0; 8]
}

Compound Assignment

#![allow(unused)]
fn main() {
let mut v = f32x8::splat(token, 1.0);
v += f32x8::splat(token, 2.0);  // v = [3.0; 8]
v *= f32x8::splat(token, 2.0);  // v = [6.0; 8]
}

Fused Multiply-Add

FMA is faster and more precise than separate multiply and add:

#![allow(unused)]
fn main() {
// a * b + c (single instruction on AVX2/NEON)
let result = a.mul_add(b, c);

// a * b - c
let result = a.mul_sub(b, c);

// -(a * b) + c  (negated multiply-add)
let result = a.neg_mul_add(b, c);
}

Comparisons

Comparisons return mask types:

#![allow(unused)]
fn main() {
let a = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
let b = f32x8::splat(token, 4.0);

let lt = a.simd_lt(b);   // [true, true, true, false, false, false, false, false]
let eq = a.simd_eq(b);   // [false, false, false, true, false, false, false, false]
let ge = a.simd_ge(b);   // [false, false, false, true, true, true, true, true]
}

Blend with Mask

#![allow(unused)]
fn main() {
let mask = a.simd_lt(b);
let result = mask.blend(true_values, false_values);
}

Min/Max

#![allow(unused)]
fn main() {
let min = a.min(b);  // Element-wise minimum
let max = a.max(b);  // Element-wise maximum

// With scalar
let clamped = v.max(f32x8::splat(token, 0.0))
               .min(f32x8::splat(token, 1.0));
}

Absolute Value

#![allow(unused)]
fn main() {
let abs = v.abs();  // |v| for each lane
}

Reductions

Horizontal operations across lanes:

#![allow(unused)]
fn main() {
let sum = v.reduce_add();      // Sum of all lanes
let max = v.reduce_max();      // Maximum lane
let min = v.reduce_min();      // Minimum lane
}

Integer Operations

For integer types (i32x8, u8x16, etc.):

#![allow(unused)]
fn main() {
let a = i32x8::splat(token, 10);
let b = i32x8::splat(token, 3);

// Arithmetic
let sum = a + b;
let diff = a - b;
let prod = a * b;

// Bitwise
let and = a & b;
let or = a | b;
let xor = a ^ b;
let not = !a;

// Shifts
let shl = a << 2;           // Shift left by constant
let shr = a >> 1;           // Shift right by constant
let shr_arith = a.shr_arithmetic(1);  // Sign-extending shift
}

Example: Dot Product

#![allow(unused)]
fn main() {
use archmage::{Desktop64, SimdToken, arcane};
use magetypes::simd::f32x8;

#[arcane]
fn dot_product(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let va = f32x8::from_array(token, *a);
    let vb = f32x8::from_array(token, *b);
    (va * vb).reduce_add()
}
}

Example: Vector Normalization

#![allow(unused)]
fn main() {
#[arcane]
fn normalize(token: Desktop64, v: &mut [f32; 8]) {
    let vec = f32x8::from_array(token, *v);
    let len_sq = (vec * vec).reduce_add();
    let len = len_sq.sqrt();

    if len > 0.0 {
        let inv_len = f32x8::splat(token, 1.0 / len);
        let normalized = vec * inv_len;
        *v = normalized.to_array();
    }
}
}

Type Conversions

magetypes provides conversions between SIMD types and between SIMD and scalar types.

Float ↔ Integer Conversions

Float to Integer

#![allow(unused)]
fn main() {
let floats = f32x8::from_array(token, [1.5, 2.7, -3.2, 4.0, 5.9, 6.1, 7.0, 8.5]);

// Truncate toward zero (like `as i32`)
let ints = floats.to_i32x8();  // [1, 2, -3, 4, 5, 6, 7, 8]

// Round to nearest
let rounded = floats.to_i32x8_round();  // [2, 3, -3, 4, 6, 6, 7, 8]
}

Integer to Float

#![allow(unused)]
fn main() {
let ints = i32x8::from_array(token, [1, 2, 3, 4, 5, 6, 7, 8]);
let floats = ints.to_f32x8();  // [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
}

Width Conversions

Narrowing (Wider → Narrower)

#![allow(unused)]
fn main() {
// f64x4 → f32x4 (lose precision)
let doubles = f64x4::from_array(token, [1.0, 2.0, 3.0, 4.0]);
let floats = doubles.to_f32x4();

// i32x8 → i16x8 (pack with saturation)
let wide = i32x8::from_array(token, [1, 2, 3, 4, 5, 6, 7, 8]);
let narrow = wide.pack_i16();
}

Widening (Narrower → Wider)

#![allow(unused)]
fn main() {
// f32x4 → f64x4 (extend precision, lower half)
let floats = f32x4::from_array(token, [1.0, 2.0, 3.0, 4.0]);
let doubles = floats.to_f64x4_low();  // Converts first 2 elements

// i16x8 → i32x8 (sign-extend lower half)
let narrow = i16x8::from_array(token, [1, 2, 3, 4, 5, 6, 7, 8]);
let wide = narrow.extend_i32_low();  // [1, 2, 3, 4]
}

Bitcast (Reinterpret)

Reinterpret bits as a different type (same size):

#![allow(unused)]
fn main() {
// f32x8 → i32x8 (view float bits as integers)
let floats = f32x8::splat(token, 1.0);
let bits = floats.bitcast_i32x8();

// i32x8 → f32x8
let ints = i32x8::splat(token, 0x3f800000);  // IEEE 754 for 1.0
let floats = ints.bitcast_f32x8();
}

Warning: Bitcast doesn't convert values—it reinterprets the raw bits.

Signed ↔ Unsigned

#![allow(unused)]
fn main() {
// i32x8 → u32x8 (reinterpret, no conversion)
let signed = i32x8::from_array(token, [-1, 0, 1, 2, 3, 4, 5, 6]);
let unsigned = signed.to_u32x8();  // [0xFFFFFFFF, 0, 1, 2, 3, 4, 5, 6]

// u32x8 → i32x8
let unsigned = u32x8::splat(token, 0xFFFFFFFF);
let signed = unsigned.to_i32x8();  // [-1; 8]
}

Lane Extraction and Insertion

#![allow(unused)]
fn main() {
// Extract single lane
let v = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
let third = v.extract::<2>();  // 3.0

// Insert single lane
let v = v.insert::<2>(99.0);  // [1.0, 2.0, 99.0, 4.0, 5.0, 6.0, 7.0, 8.0]
}

Half-Width Operations

Split or combine vectors:

#![allow(unused)]
fn main() {
// Split f32x8 into two f32x4
let full = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
let (low, high) = full.split();
// low  = [1.0, 2.0, 3.0, 4.0]
// high = [5.0, 6.0, 7.0, 8.0]

// Combine two f32x4 into f32x8
let combined = f32x8::from_halves(token, low, high);
}

Slice Casting (Token-Gated)

magetypes provides safe, token-gated slice casting as an alternative to bytemuck:

#![allow(unused)]
fn main() {
// Cast aligned &[f32] to &[f32x8]
let data: &[f32] = &[1.0; 64];
if let Some(chunks) = f32x8::cast_slice(token, data) {
    // chunks: &[f32x8] with 8 elements
    for chunk in chunks {
        // ...
    }
}

// Mutable version
let data: &mut [f32] = &mut [0.0; 64];
if let Some(chunks) = f32x8::cast_slice_mut(token, data) {
    // chunks: &mut [f32x8]
}
}

Why not bytemuck? Implementing Pod/Zeroable would let users bypass token-gated construction:

#![allow(unused)]
fn main() {
// bytemuck would allow this (BAD):
let v: f32x8 = bytemuck::Zeroable::zeroed();  // No token check!

// magetypes requires token (GOOD):
let v = f32x8::zero(token);  // Token proves CPU support
}

The token-gated cast_slice returns None if alignment or length is wrong—no UB possible.

Byte-Level Access

View vectors as bytes (no token needed—you already have the vector):

#![allow(unused)]
fn main() {
let v = f32x8::splat(token, 1.0);

// View as bytes (zero-cost)
let bytes: &[u8; 32] = v.as_bytes();

// Mutable view
let mut v = f32x8::splat(token, 0.0);
let bytes: &mut [u8; 32] = v.as_bytes_mut();

// Create from bytes (token-gated)
let bytes = [0u8; 32];
let v = f32x8::from_bytes(token, &bytes);
}

Conversion Example: Image Processing

#![allow(unused)]
fn main() {
#[arcane]
fn brighten(token: Desktop64, pixels: &mut [u8]) {
    // Process 32 bytes at a time
    for chunk in pixels.chunks_exact_mut(32) {
        let v = u8x32::from_slice(token, chunk);

        // Convert to wider type for arithmetic
        let (lo, hi) = v.split();
        let lo_wide = lo.extend_u16_low();
        let hi_wide = hi.extend_u16_low();

        // Add brightness (with saturation)
        let brightness = u16x16::splat(token, 20);
        let lo_bright = lo_wide.saturating_add(brightness);
        let hi_bright = hi_wide.saturating_add(brightness);

        // Pack back to u8 with saturation
        let result = u8x32::from_halves(
            token,
            lo_bright.pack_u8_saturate(),
            hi_bright.pack_u8_saturate()
        );

        result.store_slice(chunk);
    }
}
}

Transcendental Functions

magetypes provides SIMD implementations of common mathematical functions. These are polynomial approximations optimized for speed.

Precision Levels

Functions come in multiple precision variants:

SuffixPrecisionSpeedUse Case
_lowp~12 bitsFastestGraphics, audio
_midp~20 bitsBalancedGeneral use
(none)FullSlowestScientific
#![allow(unused)]
fn main() {
let v = f32x8::splat(token, 2.0);

let fast = v.exp2_lowp();      // ~12-bit precision, fastest
let balanced = v.exp2_midp();  // ~20-bit precision
let precise = v.exp2();        // Full precision
}

Exponential Functions

#![allow(unused)]
fn main() {
// Base-2 exponential: 2^x
let v = f32x8::splat(token, 3.0);
let result = v.exp2();  // [8.0; 8]

// Natural exponential: e^x
let result = v.exp();   // [e³; 8] ≈ [20.09; 8]
}

Logarithms

#![allow(unused)]
fn main() {
let v = f32x8::splat(token, 8.0);

// Base-2 logarithm: log₂(x)
let result = v.log2();  // [3.0; 8]

// Natural logarithm: ln(x)
let result = v.ln();    // [ln(8); 8] ≈ [2.08; 8]

// Base-10 logarithm: log₁₀(x)
let result = v.log10(); // [log₁₀(8); 8] ≈ [0.90; 8]
}

Power Functions

#![allow(unused)]
fn main() {
let base = f32x8::splat(token, 2.0);
let exp = f32x8::splat(token, 3.0);

// x^y (uses exp2(y * log2(x)))
let result = base.pow(exp);  // [8.0; 8]
}

Root Functions

#![allow(unused)]
fn main() {
let v = f32x8::splat(token, 9.0);

// Square root
let result = v.sqrt();  // [3.0; 8]

// Cube root
let result = v.cbrt();  // [∛9; 8] ≈ [2.08; 8]

// Reciprocal square root: 1/√x
let result = v.rsqrt();  // [1/3; 8] ≈ [0.33; 8]
}

Approximations

Fast approximations for graphics/games:

#![allow(unused)]
fn main() {
// Reciprocal: 1/x (approximate)
let v = f32x8::splat(token, 4.0);
let result = v.rcp();  // ≈ [0.25; 8]

// Reciprocal square root (approximate)
let result = v.rsqrt();  // ≈ [0.5; 8]
}

For higher precision, use Newton-Raphson refinement:

#![allow(unused)]
fn main() {
// One Newton-Raphson iteration for rsqrt
let approx = v.rsqrt();
let refined = approx * (f32x8::splat(token, 1.5) - v * approx * approx * f32x8::splat(token, 0.5));
}

Special Handling

Domain Errors

#![allow(unused)]
fn main() {
let v = f32x8::from_array(token, [-1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]);

// sqrt of negative → NaN
let sqrt = v.sqrt();  // [NaN, 0.0, 1.0, 1.41, 1.73, 2.0, 2.24, 2.45]

// log of non-positive → NaN or -inf
let log = v.ln();     // [NaN, -inf, 0.0, 0.69, 1.10, 1.39, 1.61, 1.79]
}

Unchecked Variants

Some functions have _unchecked variants that skip domain validation:

#![allow(unused)]
fn main() {
// Assumes all inputs are valid (positive for sqrt/log)
let result = v.sqrt_unchecked();  // Faster, UB if negative
let result = v.ln_unchecked();    // Faster, UB if ≤ 0
}

Example: Gaussian Function

#![allow(unused)]
fn main() {
#[arcane]
fn gaussian(token: Desktop64, x: &[f32; 8], sigma: f32) -> [f32; 8] {
    let v = f32x8::from_array(token, *x);
    let sigma_v = f32x8::splat(token, sigma);
    let two = f32x8::splat(token, 2.0);

    // exp(-x² / (2σ²))
    let x_sq = v * v;
    let two_sigma_sq = two * sigma_v * sigma_v;
    let exponent = -(x_sq / two_sigma_sq);
    let result = exponent.exp_midp();  // Good precision, fast

    result.to_array()
}
}

Example: Softmax

#![allow(unused)]
fn main() {
#[arcane]
fn softmax(token: Desktop64, logits: &[f32; 8]) -> [f32; 8] {
    let v = f32x8::from_array(token, *logits);

    // Subtract max for numerical stability
    let max = v.reduce_max();
    let shifted = v - f32x8::splat(token, max);

    // exp(x - max)
    let exp = shifted.exp_midp();

    // Normalize
    let sum = exp.reduce_add();
    let result = exp / f32x8::splat(token, sum);

    result.to_array()
}
}

Platform Notes

  • x86-64: All functions available for f32x4, f32x8, f64x2, f64x4
  • AArch64: Full support via NEON
  • WASM: Most functions available, some via scalar fallback

The implementation uses polynomial approximations tuned per platform for best performance.

Memory Operations

Efficiently moving data between memory and SIMD registers is critical for performance.

Load Operations

Unaligned Load

#![allow(unused)]
fn main() {
use magetypes::simd::f32x8;

// From array reference
let arr = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let v = f32x8::from_array(token, arr);

// From slice (must have enough elements)
let slice = &[1.0f32; 16];
let v = f32x8::from_slice(token, &slice[0..8]);
}

Aligned Load

If you know your data is aligned:

#![allow(unused)]
fn main() {
// Aligned load (UB if not aligned to 32 bytes for f32x8)
let v = unsafe { f32x8::load_aligned(ptr) };
}

Partial Load

Load fewer elements than the vector width:

#![allow(unused)]
fn main() {
// Load 4 elements into lower half, zero upper half
let v = f32x8::load_low(token, &[1.0, 2.0, 3.0, 4.0]);
// v = [1.0, 2.0, 3.0, 4.0, 0.0, 0.0, 0.0, 0.0]
}

Store Operations

Unaligned Store

#![allow(unused)]
fn main() {
let v = f32x8::splat(token, 42.0);

// To array
let arr: [f32; 8] = v.to_array();

// To slice
let mut buf = [0.0f32; 8];
v.store_slice(&mut buf);
}

Aligned Store

#![allow(unused)]
fn main() {
// Aligned store (UB if not aligned)
unsafe { v.store_aligned(ptr) };
}

Partial Store

Store only some elements:

#![allow(unused)]
fn main() {
// Store lower 4 elements
v.store_low(&mut buf[0..4]);
}

Streaming Stores

For large data where you won't read back soon:

#![allow(unused)]
fn main() {
// Non-temporal store (bypasses cache)
unsafe { v.stream(ptr) };
}

Use streaming stores when:

  • Writing large arrays sequentially
  • Data won't be read again soon
  • Avoiding cache pollution is important

Gather and Scatter

Load/store non-contiguous elements:

#![allow(unused)]
fn main() {
// Gather: load from scattered indices
let data = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0];
let indices = i32x8::from_array(token, [0, 2, 4, 6, 8, 1, 3, 5]);
let gathered = f32x8::gather(&data, indices);
// gathered = [0.0, 20.0, 40.0, 60.0, 80.0, 10.0, 30.0, 50.0]

// Scatter: store to scattered indices
let mut output = [0.0f32; 10];
let values = f32x8::splat(token, 1.0);
values.scatter(&mut output, indices);
}

Note: Gather/scatter may be slow on some CPUs. Profile before using.

Prefetch

Hint the CPU to load data into cache:

#![allow(unused)]
fn main() {
use std::arch::x86_64::*;

// Prefetch for read
unsafe { _mm_prefetch(ptr as *const i8, _MM_HINT_T0) };

// Prefetch levels:
// _MM_HINT_T0  - All cache levels
// _MM_HINT_T1  - L2 and above
// _MM_HINT_T2  - L3 and above
// _MM_HINT_NTA - Non-temporal (don't pollute cache)
}

Interleaved Data

For RGBARGBA... or similar interleaved formats:

#![allow(unused)]
fn main() {
// Deinterleave 4 channels (RGBA)
let (r, g, b, a) = f32x8::deinterleave_4ch(
    token,
    &rgba_data[0..8],
    &rgba_data[8..16],
    &rgba_data[16..24],
    &rgba_data[24..32]
);

// Process channels separately
let r_bright = r + f32x8::splat(token, 0.1);

// Reinterleave
let (out0, out1, out2, out3) = f32x8::interleave_4ch(token, r_bright, g, b, a);
}

Chunked Processing

Process large arrays in SIMD-sized chunks:

#![allow(unused)]
fn main() {
#[arcane]
fn process_large(token: Desktop64, data: &mut [f32]) {
    // Process full chunks
    for chunk in data.chunks_exact_mut(8) {
        let v = f32x8::from_slice(token, chunk);
        let result = v * v;  // Process
        result.store_slice(chunk);
    }

    // Handle remainder
    for x in data.chunks_exact_mut(8).into_remainder() {
        *x = *x * *x;
    }
}
}

Alignment Tips

  1. Use #[repr(align(32))] for AVX2 data:

    #![allow(unused)]
    fn main() {
    #[repr(C, align(32))]
    struct AlignedData {
        values: [f32; 8],
    }
    }
  2. Allocate aligned memory:

    #![allow(unused)]
    fn main() {
    use std::alloc::{alloc, Layout};
    
    let layout = Layout::from_size_align(size, 32).unwrap();
    let ptr = unsafe { alloc(layout) };
    }
  3. Check alignment at runtime:

    #![allow(unused)]
    fn main() {
    fn is_aligned<T>(ptr: *const T, align: usize) -> bool {
        (ptr as usize) % align == 0
    }
    }

Performance Tips

  1. Minimize loads/stores — Keep data in registers
  2. Prefer unaligned — Modern CPUs handle it well
  3. Use streaming for large writes — Saves cache space
  4. Batch operations — Load once, do multiple ops, store once
  5. Avoid gather/scatter — Sequential access is faster

Methods with #[arcane]

Using #[arcane] on methods requires special handling because of how the macro transforms the function body.

The Problem

#![allow(unused)]
fn main() {
impl MyType {
    // This won't work as expected!
    #[arcane]
    fn process(&self, token: Desktop64) -> f32 {
        self.data[0]  // Error: `self` not available in inner function
    }
}
}

The macro generates an inner function where self becomes a regular parameter.

The Solution: _self = Type

Use the _self argument and reference _self in your code:

#![allow(unused)]
fn main() {
use archmage::{Desktop64, arcane};
use magetypes::simd::f32x8;

struct Vector8([f32; 8]);

impl Vector8 {
    #[arcane(_self = Vector8)]
    fn magnitude(&self, token: Desktop64) -> f32 {
        // Use _self, not self
        let v = f32x8::from_array(token, _self.0);
        (v * v).reduce_add().sqrt()
    }
}
}

All Receiver Types

&self (Shared Reference)

#![allow(unused)]
fn main() {
impl Vector8 {
    #[arcane(_self = Vector8)]
    fn dot(&self, token: Desktop64, other: &Self) -> f32 {
        let a = f32x8::from_array(token, _self.0);
        let b = f32x8::from_array(token, other.0);
        (a * b).reduce_add()
    }
}
}

&mut self (Mutable Reference)

#![allow(unused)]
fn main() {
impl Vector8 {
    #[arcane(_self = Vector8)]
    fn normalize(&mut self, token: Desktop64) {
        let v = f32x8::from_array(token, _self.0);
        let len = (v * v).reduce_add().sqrt();
        if len > 0.0 {
            let inv = f32x8::splat(token, 1.0 / len);
            let normalized = v * inv;
            _self.0 = normalized.to_array();
        }
    }
}
}

self (By Value)

#![allow(unused)]
fn main() {
impl Vector8 {
    #[arcane(_self = Vector8)]
    fn scaled(self, token: Desktop64, factor: f32) -> Self {
        let v = f32x8::from_array(token, _self.0);
        let s = f32x8::splat(token, factor);
        Vector8((v * s).to_array())
    }
}
}

Trait Implementations

Works with traits too:

#![allow(unused)]
fn main() {
trait SimdOps {
    fn double(&self, token: Desktop64) -> Self;
}

impl SimdOps for Vector8 {
    #[arcane(_self = Vector8)]
    fn double(&self, token: Desktop64) -> Self {
        let v = f32x8::from_array(token, _self.0);
        Vector8((v + v).to_array())
    }
}
}

Why _self?

The name _self reminds you that:

  1. You're not using the normal self keyword
  2. The macro has transformed the function
  3. You need to be explicit about the type

It's a deliberate choice to make the transformation visible.

Generated Code

#![allow(unused)]
fn main() {
// You write:
impl Vector8 {
    #[arcane(_self = Vector8)]
    fn process(&self, token: Desktop64) -> f32 {
        f32x8::from_array(token, _self.0).reduce_add()
    }
}

// Macro generates:
impl Vector8 {
    fn process(&self, token: Desktop64) -> f32 {
        #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
        #[inline]
        unsafe fn __inner(_self: &Vector8, token: Desktop64) -> f32 {
            f32x8::from_array(token, _self.0).reduce_add()
        }
        unsafe { __inner(self, token) }
    }
}
}

Common Patterns

Builder Pattern

#![allow(unused)]
fn main() {
impl ImageProcessor {
    #[arcane(_self = ImageProcessor)]
    fn with_brightness(self, token: Desktop64, amount: f32) -> Self {
        let mut result = _self;
        // Process brightness...
        result
    }

    #[arcane(_self = ImageProcessor)]
    fn with_contrast(self, token: Desktop64, amount: f32) -> Self {
        let mut result = _self;
        // Process contrast...
        result
    }
}

// Usage
let processed = processor
    .with_brightness(token, 1.2)
    .with_contrast(token, 1.1);
}

Mutable Iteration

#![allow(unused)]
fn main() {
impl Buffer {
    #[arcane(_self = Buffer)]
    fn process_all(&mut self, token: Desktop64) {
        for chunk in _self.data.chunks_exact_mut(8) {
            let v = f32x8::from_slice(token, chunk);
            let result = v * v;
            result.store_slice(chunk);
        }
    }
}
}

LLVM Optimization Boundaries

Understanding when LLVM can and cannot optimize across function calls is crucial for peak SIMD performance.

The Problem

#[target_feature] changes LLVM's target settings for a function. When caller and callee have different settings, LLVM cannot optimize across the boundary.

#![allow(unused)]
fn main() {
// Generic caller - baseline target settings
fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) {
    if let Some(t) = token.as_x64v3() {
        process_avx2(t, data);  // Different LLVM target!
    }
}

// AVX2 callee - AVX2 target settings
#[arcane]
fn process_avx2(token: X64V3Token, data: &[f32]) {
    // Can't inline back into dispatch()
}
}

Why It Matters

SIMD performance depends heavily on:

  1. Inlining — Avoids function call overhead
  2. Register allocation — Keeps values in SIMD registers
  3. Instruction scheduling — Reorders for pipeline efficiency

All of these break at target feature boundaries.

Good: Same Token Type Chain

#![allow(unused)]
fn main() {
#[arcane]
fn outer(token: X64V3Token, data: &[f32]) -> f32 {
    let a = step1(token, data);     // Same token → inlines
    let b = step2(token, data);     // Same token → inlines
    a + b
}

#[arcane]
fn step1(token: X64V3Token, data: &[f32]) -> f32 {
    // Shares LLVM target settings with outer
    // Can inline, share registers, optimize across boundary
}

#[arcane]
fn step2(token: X64V3Token, data: &[f32]) -> f32 {
    // Same deal
}
}

Good: Downcast (Higher → Lower)

#![allow(unused)]
fn main() {
#[arcane]
fn v4_main(token: X64V4Token, data: &[f32]) -> f32 {
    // Calling V3 function with V4 token
    // V4 is superset of V3, LLVM can still optimize
    v3_helper(token, data)
}

#[arcane]
fn v3_helper(token: X64V3Token, data: &[f32]) -> f32 {
    // This inlines properly
}
}

Bad: Generic Boundary

#![allow(unused)]
fn main() {
// Generic function - no target features
fn generic<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    // This is compiled with baseline settings
    if let Some(t) = token.as_x64v3() {
        concrete(t, data)  // BOUNDARY - can't inline back
    } else {
        0.0
    }
}

#[arcane]
fn concrete(token: X64V3Token, data: &[f32]) -> f32 {
    // This has AVX2 settings
    // LLVM won't inline this into generic()
}
}

Bad: Upcast Check in Hot Code

#![allow(unused)]
fn main() {
#[arcane]
fn process(token: X64V3Token, data: &[f32]) -> f32 {
    // WRONG: Checking for higher tier inside hot function
    for chunk in data.chunks(8) {
        if let Some(v4) = token.as_x64v4() {  // Wait, this always fails!
            // V3 token can't become V4
        }
    }
}
}

Even when the check makes sense, it's an optimization barrier.

Pattern: Dispatch at Entry

#![allow(unused)]
fn main() {
// Public API - dispatch happens here
pub fn process(data: &[f32]) -> f32 {
    // Summon and dispatch ONCE
    #[cfg(feature = "avx512")]
    if let Some(token) = X64V4Token::summon() {
        return process_v4(token, data);
    }

    if let Some(token) = X64V3Token::summon() {
        return process_v3(token, data);
    }

    process_scalar(data)
}

// Each implementation is self-contained
#[arcane]
fn process_v4(token: X64V4Token, data: &[f32]) -> f32 {
    // All V4 code, fully optimizable
    let result = step1_v4(token, data);
    step2_v4(token, result)
}

#[arcane]
fn step1_v4(token: X64V4Token, data: &[f32]) -> f32 { /* ... */ }

#[arcane]
fn step2_v4(token: X64V4Token, result: f32) -> f32 { /* ... */ }
}

Pattern: Trait with Concrete Impls

#![allow(unused)]
fn main() {
trait Processor {
    fn process(&self, data: &[f32]) -> f32;
}

struct V3Processor(X64V3Token);

impl Processor for V3Processor {
    fn process(&self, data: &[f32]) -> f32 {
        // Note: this can't use #[arcane] on trait method
        // Call through to arcane function instead
        process_v3_impl(self.0, data)
    }
}

#[arcane]
fn process_v3_impl(token: X64V3Token, data: &[f32]) -> f32 {
    // Full optimization here
}
}

Measuring the Impact

#![allow(unused)]
fn main() {
// Benchmark both patterns
fn bench_generic_dispatch(c: &mut Criterion) {
    c.bench_function("generic", |b| {
        let token = Desktop64::summon().unwrap();
        b.iter(|| generic_dispatch(token, &data))
    });
}

fn bench_concrete_dispatch(c: &mut Criterion) {
    c.bench_function("concrete", |b| {
        let token = Desktop64::summon().unwrap();
        b.iter(|| concrete_path(token, &data))
    });
}
}

Typical impact: 10-30% performance difference for small functions.

Summary

PatternInliningRecommendation
Same concrete token✅ FullBest for hot paths
Downcast (V4→V3)✅ FullSafe and fast
Generic → concrete❌ BoundaryEntry point only
Upcast check❌ BoundaryAvoid in hot code

AVX-512 Patterns

AVX-512 provides 512-bit vectors and advanced features. Here's how to use it effectively with archmage.

Enabling AVX-512

Add the feature to your Cargo.toml:

[dependencies]
archmage = { version = "0.4", features = ["avx512"] }
magetypes = { version = "0.4", features = ["avx512"] }

AVX-512 Tokens

TokenFeaturesCPUs
X64V4TokenF, BW, CD, DQ, VLSkylake-X, Zen 4
Avx512ModernToken+ VNNI, VBMI, IFMA, etc.Ice Lake+, Zen 4+
Avx512Fp16Token+ FP16Sapphire Rapids

Aliases:

  • Server64 = X64V4Token
  • Avx512Token = X64V4Token

Basic Usage

use archmage::{X64V4Token, SimdToken, arcane};
use magetypes::simd::f32x16;

#[arcane]
fn process_512(token: X64V4Token, data: &[f32; 16]) -> f32 {
    let v = f32x16::from_array(token, *data);
    (v * v).reduce_add()
}

fn main() {
    if let Some(token) = X64V4Token::summon() {
        let data = [1.0f32; 16];
        let result = process_512(token, &data);
        println!("Result: {}", result);
    } else {
        println!("AVX-512 not available");
    }
}

512-bit Types

TypeElementsIntrinsic Type
f32x1616 × f32__m512
f64x88 × f64__m512d
i32x1616 × i32__m512i
i64x88 × i64__m512i
i16x3232 × i16__m512i
i8x6464 × i8__m512i

Masking

AVX-512's killer feature is per-lane masking:

#![allow(unused)]
fn main() {
use std::arch::x86_64::*;

#[arcane]
fn masked_add(token: X64V4Token, a: __m512, b: __m512, mask: __mmask16) -> __m512 {
    // Only add lanes where mask bit is 1
    // Other lanes keep value from `a`
    _mm512_mask_add_ps(a, mask, a, b)
}
}

Tiered Fallback with AVX-512

#![allow(unused)]
fn main() {
pub fn process(data: &mut [f32]) {
    #[cfg(feature = "avx512")]
    if let Some(token) = X64V4Token::summon() {
        return process_avx512(token, data);
    }

    if let Some(token) = X64V3Token::summon() {
        return process_avx2(token, data);
    }

    process_scalar(data);
}

#[cfg(feature = "avx512")]
#[arcane]
fn process_avx512(token: X64V4Token, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(16) {
        let v = f32x16::from_slice(token, chunk);
        let result = v * v;
        result.store_slice(chunk);
    }
    // Handle remainder with AVX2 (V4 can downcast to V3)
    let remainder = data.chunks_exact_mut(16).into_remainder();
    if !remainder.is_empty() {
        process_avx2(token, remainder);  // Downcast works!
    }
}

#[arcane]
fn process_avx2(token: X64V3Token, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(8) {
        let v = f32x8::from_slice(token, chunk);
        let result = v * v;
        result.store_slice(chunk);
    }
    for x in data.chunks_exact_mut(8).into_remainder() {
        *x = *x * *x;
    }
}
}

AVX-512 Performance Considerations

Frequency Throttling

Heavy AVX-512 use can cause CPU frequency throttling:

  • Light AVX-512: Minimal impact
  • Heavy 512-bit ops: Up to 20% frequency reduction
  • Heavy 512-bit + FMA: Up to 30% reduction

For short bursts, this doesn't matter. For sustained workloads, consider if 256-bit is actually faster due to higher frequency.

When AVX-512 Wins

  • Large data: Processing 16 floats vs 8 is 2× work per instruction
  • Masked operations: No equivalent in AVX2
  • Gather/scatter: Much faster than AVX2
  • Specific instructions: VPTERNLOG, conflict detection, etc.

When AVX2 Might Win

  • Short bursts: Throttling overhead not amortized
  • Memory-bound code: Wider vectors don't help if waiting for RAM
  • Mixed workloads: Frequency penalty affects scalar code too

Checking for AVX-512

#![allow(unused)]
fn main() {
use archmage::{X64V4Token, SimdToken};

fn check_avx512() {
    match X64V4Token::guaranteed() {
        Some(true) => println!("Compile-time AVX-512"),
        Some(false) => println!("Not x86-64"),
        None => {
            if X64V4Token::summon().is_some() {
                println!("Runtime AVX-512 available");
            } else {
                println!("No AVX-512");
            }
        }
    }
}
}

Example: Matrix Multiply

#![allow(unused)]
fn main() {
#[cfg(feature = "avx512")]
#[arcane]
fn matmul_4x4_avx512(
    token: X64V4Token,
    a: &[[f32; 4]; 4],
    b: &[[f32; 4]; 4],
    c: &mut [[f32; 4]; 4]
) {
    use std::arch::x86_64::*;

    // Load B columns into registers
    let b_col0 = _mm512_set_ps(
        b[3][0], b[2][0], b[1][0], b[0][0],
        b[3][0], b[2][0], b[1][0], b[0][0],
        b[3][0], b[2][0], b[1][0], b[0][0],
        b[3][0], b[2][0], b[1][0], b[0][0]
    );
    // ... broadcast and FMA pattern
}
}

WASM SIMD

WebAssembly SIMD128 provides 128-bit vectors in the browser and WASI environments.

Setup

Enable SIMD128 in your build:

RUSTFLAGS="-Ctarget-feature=+simd128" cargo build --target wasm32-unknown-unknown

Or in .cargo/config.toml:

[target.wasm32-unknown-unknown]
rustflags = ["-Ctarget-feature=+simd128"]

The Token

#![allow(unused)]
fn main() {
use archmage::{Simd128Token, SimdToken};

fn check_wasm_simd() {
    if let Some(token) = Simd128Token::summon() {
        process_simd(token, &data);
    } else {
        process_scalar(&data);
    }
}
}

Note: On WASM, Simd128Token::summon() succeeds if the binary was compiled with SIMD128 support. There's no runtime feature detection in WASM—the capability is determined at compile time.

Available Types

TypeElements
f32x44 × f32
f64x22 × f64
i32x44 × i32
i64x22 × i64
i16x88 × i16
i8x1616 × i8
u32x44 × u32
u64x22 × u64
u16x88 × u16
u8x1616 × u8

Basic Usage

#![allow(unused)]
fn main() {
use archmage::{Simd128Token, arcane};
use magetypes::simd::f32x4;

#[arcane]
fn dot_product(token: Simd128Token, a: &[f32; 4], b: &[f32; 4]) -> f32 {
    let va = f32x4::from_array(token, *a);
    let vb = f32x4::from_array(token, *b);
    (va * vb).reduce_add()
}
}

Cross-Platform Code

Write once, run on x86, ARM, and WASM:

#![allow(unused)]
fn main() {
use archmage::{Desktop64, NeonToken, Simd128Token, SimdToken, incant};

// Define platform-specific implementations
#[cfg(target_arch = "x86_64")]
#[arcane]
fn sum_v3(token: Desktop64, data: &[f32; 8]) -> f32 {
    use magetypes::simd::f32x8;
    f32x8::from_array(token, *data).reduce_add()
}

#[cfg(target_arch = "aarch64")]
#[arcane]
fn sum_neon(token: NeonToken, data: &[f32; 8]) -> f32 {
    use magetypes::simd::f32x4;
    let a = f32x4::from_slice(token, &data[0..4]);
    let b = f32x4::from_slice(token, &data[4..8]);
    a.reduce_add() + b.reduce_add()
}

#[cfg(target_arch = "wasm32")]
#[arcane]
fn sum_wasm128(token: Simd128Token, data: &[f32; 8]) -> f32 {
    use magetypes::simd::f32x4;
    let a = f32x4::from_slice(token, &data[0..4]);
    let b = f32x4::from_slice(token, &data[4..8]);
    a.reduce_add() + b.reduce_add()
}

fn sum_scalar(data: &[f32; 8]) -> f32 {
    data.iter().sum()
}

// Public API
pub fn sum(data: &[f32; 8]) -> f32 {
    incant!(sum(data))
}
}

WASM-Specific Considerations

No Runtime Detection

Unlike x86/ARM, WASM doesn't have runtime feature detection. The SIMD support is baked in at compile time:

#![allow(unused)]
fn main() {
// On WASM, this is always the same result
// (based on compile-time -Ctarget-feature=+simd128)
let has_simd = Simd128Token::summon().is_some();
}

Browser Compatibility

WASM SIMD is supported in:

  • Chrome 91+ (May 2021)
  • Firefox 89+ (June 2021)
  • Safari 16.4+ (March 2023)
  • Node.js 16.4+

For older browsers, provide a non-SIMD fallback WASM binary.

Relaxed SIMD

WASM also has "relaxed SIMD" with even more instructions. As of 2024, this requires additional flags:

RUSTFLAGS="-Ctarget-feature=+simd128,+relaxed-simd" cargo build

Example: Image Processing in Browser

#![allow(unused)]
fn main() {
use wasm_bindgen::prelude::*;
use archmage::{Simd128Token, SimdToken, arcane};
use magetypes::simd::u8x16;

#[wasm_bindgen]
pub fn brighten_image(pixels: &mut [u8], amount: u8) {
    if let Some(token) = Simd128Token::summon() {
        brighten_simd(token, pixels, amount);
    } else {
        brighten_scalar(pixels, amount);
    }
}

#[arcane]
fn brighten_simd(token: Simd128Token, pixels: &mut [u8], amount: u8) {
    let add = u8x16::splat(token, amount);

    for chunk in pixels.chunks_exact_mut(16) {
        let v = u8x16::from_slice(token, chunk);
        let bright = v.saturating_add(add);
        bright.store_slice(chunk);
    }

    // Handle remainder
    for pixel in pixels.chunks_exact_mut(16).into_remainder() {
        *pixel = pixel.saturating_add(amount);
    }
}

fn brighten_scalar(pixels: &mut [u8], amount: u8) {
    for pixel in pixels {
        *pixel = pixel.saturating_add(amount);
    }
}
}

Testing WASM Code

Use wasm-pack test:

wasm-pack test --node

Or test natively with the scalar fallback:

#![allow(unused)]
fn main() {
#[test]
fn test_sum() {
    let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
    let result = sum(&data);
    assert_eq!(result, 36.0);
}
}

Token Reference

Complete reference for all archmage tokens.

x86-64 Tokens

X64V2Token

Features: SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT

CPUs: Intel Nehalem (2008)+, AMD Bulldozer (2011)+

#![allow(unused)]
fn main() {
use archmage::{X64V2Token, SimdToken};

if let Some(token) = X64V2Token::summon() {
    // 128-bit SSE operations
}
}

X64V3Token / Desktop64 / Avx2FmaToken

Features: All V2 + AVX, AVX2, FMA, BMI1, BMI2, F16C, MOVBE

CPUs: Intel Haswell (2013)+, AMD Zen 1 (2017)+

#![allow(unused)]
fn main() {
use archmage::{X64V3Token, Desktop64, SimdToken};

// These are the same type:
let t1: Option<X64V3Token> = X64V3Token::summon();
let t2: Option<Desktop64> = Desktop64::summon();
}

Aliases:

  • Desktop64 — Friendly name for typical desktop/laptop CPUs
  • Avx2FmaToken — Legacy name (deprecated)

X64V4Token / Server64 / Avx512Token

Features: All V3 + AVX-512F, AVX-512BW, AVX-512CD, AVX-512DQ, AVX-512VL

CPUs: Intel Skylake-X (2017)+, AMD Zen 4 (2022)+

Requires: avx512 feature

#![allow(unused)]
fn main() {
#[cfg(feature = "avx512")]
use archmage::{X64V4Token, Server64, SimdToken};

if let Some(token) = X64V4Token::summon() {
    // 512-bit AVX-512 operations
}
}

Aliases:

  • Server64 — Friendly name for server CPUs
  • Avx512Token — Direct alias

Avx512ModernToken

Features: All V4 + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES

CPUs: Intel Ice Lake (2019)+, AMD Zen 4 (2022)+

Requires: avx512 feature

#![allow(unused)]
fn main() {
#[cfg(feature = "avx512")]
if let Some(token) = Avx512ModernToken::summon() {
    // Modern AVX-512 extensions
}
}

Avx512Fp16Token

Features: AVX-512FP16

CPUs: Intel Sapphire Rapids (2023)+

Requires: avx512 feature

#![allow(unused)]
fn main() {
#[cfg(feature = "avx512")]
if let Some(token) = Avx512Fp16Token::summon() {
    // Native FP16 operations
}
}

AArch64 Tokens

NeonToken / Arm64

Features: NEON (always available on AArch64)

CPUs: All 64-bit ARM processors

#![allow(unused)]
fn main() {
use archmage::{NeonToken, Arm64, SimdToken};

// Always succeeds on AArch64
let token = NeonToken::summon().unwrap();
}

Alias: Arm64

NeonAesToken

Features: NEON + AES

CPUs: Most ARMv8 processors with crypto extensions

#![allow(unused)]
fn main() {
if let Some(token) = NeonAesToken::summon() {
    // AES acceleration available
}
}

NeonSha3Token

Features: NEON + SHA3

CPUs: ARMv8.2+ with SHA3 extension

#![allow(unused)]
fn main() {
if let Some(token) = NeonSha3Token::summon() {
    // SHA3 acceleration available
}
}

NeonCrcToken

Features: NEON + CRC

CPUs: Most ARMv8 processors

#![allow(unused)]
fn main() {
if let Some(token) = NeonCrcToken::summon() {
    // CRC32 acceleration available
}
}

WASM Token

Simd128Token

Features: WASM SIMD128

Requires: Compile with -Ctarget-feature=+simd128

#![allow(unused)]
fn main() {
use archmage::{Simd128Token, SimdToken};

if let Some(token) = Simd128Token::summon() {
    // WASM SIMD128 operations
}
}

Universal Token

ScalarToken

Features: None (pure scalar fallback)

Availability: Always, on all platforms

#![allow(unused)]
fn main() {
use archmage::{ScalarToken, SimdToken};

// Always succeeds
let token = ScalarToken::summon().unwrap();

// Or construct directly
let token = ScalarToken;
}

SimdToken Trait

All tokens implement SimdToken:

#![allow(unused)]
fn main() {
pub trait SimdToken: Copy + Clone + Send + Sync + 'static {
    const NAME: &'static str;

    /// Compile-time guarantee check
    fn guaranteed() -> Option<bool>;

    /// Runtime detection
    fn summon() -> Option<Self>;

    /// Alias for summon()
    fn attempt() -> Option<Self>;

    /// Legacy alias (deprecated)
    fn try_new() -> Option<Self>;

    /// Unsafe construction (deprecated)
    unsafe fn forge_token_dangerously() -> Self;
}
}

guaranteed()

Returns:

  • Some(true) — Feature is compile-time guaranteed (e.g., -Ctarget-cpu=haswell)
  • Some(false) — Wrong architecture (token can never exist)
  • None — Runtime check needed
#![allow(unused)]
fn main() {
match Desktop64::guaranteed() {
    Some(true) => {
        // summon() will always succeed, check is elided
        let token = Desktop64::summon().unwrap();
    }
    Some(false) => {
        // Wrong arch, use fallback
    }
    None => {
        // Need runtime check
        if let Some(token) = Desktop64::summon() {
            // ...
        }
    }
}
}

summon()

Performs runtime CPU feature detection. Returns Some(token) if features are available.

#![allow(unused)]
fn main() {
if let Some(token) = Desktop64::summon() {
    // CPU supports AVX2+FMA
}
}

Token Size

All tokens are zero-sized:

#![allow(unused)]
fn main() {
use std::mem::size_of;

assert_eq!(size_of::<X64V3Token>(), 0);
assert_eq!(size_of::<NeonToken>(), 0);
assert_eq!(size_of::<ScalarToken>(), 0);
}

Passing tokens has zero runtime cost.

Trait Reference

Reference for archmage traits.

SimdToken

The base trait for all capability tokens.

#![allow(unused)]
fn main() {
pub trait SimdToken: Copy + Clone + Send + Sync + 'static {
    const NAME: &'static str;
    fn guaranteed() -> Option<bool>;
    fn summon() -> Option<Self>;
    fn attempt() -> Option<Self>;
}
}

Implementors: All token types

IntoConcreteToken

Enables compile-time dispatch via type checking.

#![allow(unused)]
fn main() {
pub trait IntoConcreteToken: SimdToken {
    fn as_x64v2(self) -> Option<X64V2Token> { None }
    fn as_x64v3(self) -> Option<X64V3Token> { None }
    fn as_x64v4(self) -> Option<X64V4Token> { None }
    fn as_avx512_modern(self) -> Option<Avx512ModernToken> { None }
    fn as_avx512_fp16(self) -> Option<Avx512Fp16Token> { None }
    fn as_neon(self) -> Option<NeonToken> { None }
    fn as_neon_aes(self) -> Option<NeonAesToken> { None }
    fn as_neon_sha3(self) -> Option<NeonSha3Token> { None }
    fn as_neon_crc(self) -> Option<NeonCrcToken> { None }
    fn as_wasm128(self) -> Option<Simd128Token> { None }
    fn as_scalar(self) -> Option<ScalarToken> { None }
}
}

Usage:

#![allow(unused)]
fn main() {
fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    if let Some(t) = token.as_x64v3() {
        process_avx2(t, data)
    } else if let Some(t) = token.as_neon() {
        process_neon(t, data)
    } else {
        process_scalar(data)
    }
}
}

Each concrete token returns Some(self) for its own method, None for others. The compiler eliminates dead branches.

Tier Traits

HasX64V2

Marker trait for tokens that provide x86-64-v2 features (SSE4.2+).

#![allow(unused)]
fn main() {
pub trait HasX64V2: SimdToken {}
}

Implementors: X64V2Token, X64V3Token, X64V4Token, Avx512ModernToken, Avx512Fp16Token

Usage:

#![allow(unused)]
fn main() {
fn process<T: HasX64V2>(token: T, data: &[f32]) {
    // Can use SSE4.2 intrinsics
}
}

HasX64V4

Marker trait for tokens that provide x86-64-v4 features (AVX-512).

#![allow(unused)]
fn main() {
#[cfg(feature = "avx512")]
pub trait HasX64V4: SimdToken {}
}

Implementors: X64V4Token, Avx512ModernToken, Avx512Fp16Token

Requires: avx512 feature

HasNeon

Marker trait for tokens that provide NEON features.

#![allow(unused)]
fn main() {
pub trait HasNeon: SimdToken {}
}

Implementors: NeonToken, NeonAesToken, NeonSha3Token, NeonCrcToken

HasNeonAes

Marker trait for tokens that provide NEON + AES features.

#![allow(unused)]
fn main() {
pub trait HasNeonAes: HasNeon {}
}

Implementors: NeonAesToken

HasNeonSha3

Marker trait for tokens that provide NEON + SHA3 features.

#![allow(unused)]
fn main() {
pub trait HasNeonSha3: HasNeon {}
}

Implementors: NeonSha3Token

Width Traits (Deprecated)

Warning: These traits are misleading and should not be used in new code.

Has128BitSimd (Deprecated)

Only enables SSE/SSE2 (x86 baseline). Does NOT enable SSE4, AVX, or anything useful beyond baseline.

Use instead: HasX64V2 or concrete tokens

Has256BitSimd (Deprecated)

Only enables AVX (NOT AVX2, NOT FMA). This is almost never what you want.

Use instead: X64V3Token or Desktop64

Has512BitSimd (Deprecated)

Only enables AVX-512F. Missing critical AVX-512 extensions.

Use instead: X64V4Token or HasX64V4

magetypes Traits

SimdTypes

Associates SIMD types with a token.

#![allow(unused)]
fn main() {
pub trait SimdTypes {
    type F32: SimdFloat;
    type F64: SimdFloat;
    type I32: SimdInt;
    type I64: SimdInt;
    // ...
}
}

Usage:

#![allow(unused)]
fn main() {
fn process<T: SimdTypes>(token: T, data: &[f32]) {
    let v = T::F32::splat(1.0);
    // ...
}
}

WidthDispatch

Provides access to all SIMD widths from any token.

#![allow(unused)]
fn main() {
pub trait WidthDispatch {
    fn w128(&self) -> W128Types;
    fn w256(&self) -> Option<W256Types>;
    fn w512(&self) -> Option<W512Types>;
}
}

Using Traits Correctly

Prefer Concrete Tokens

#![allow(unused)]
fn main() {
// GOOD: Concrete token, full optimization
fn process(token: X64V3Token, data: &[f32]) { }

// OK: Trait bound, but optimization boundary
fn process<T: HasX64V2>(token: T, data: &[f32]) { }
}

Trait Bounds at API Boundaries

#![allow(unused)]
fn main() {
// Public API can be generic
pub fn process<T: IntoConcreteToken>(token: T, data: &[f32]) {
    // But dispatch to concrete implementations
    if let Some(t) = token.as_x64v3() {
        process_avx2(t, data);
    }
}

// Internal implementations use concrete tokens
#[arcane]
fn process_avx2(token: X64V3Token, data: &[f32]) { }
}

Don't Over-Constrain

#![allow(unused)]
fn main() {
// WRONG: Over-constrained, hard to call
fn process<T: HasX64V2 + HasNeon>(token: T, data: &[f32]) {
    // No token implements both!
}

// RIGHT: Use IntoConcreteToken for multi-platform
fn process<T: IntoConcreteToken>(token: T, data: &[f32]) {
    if let Some(t) = token.as_x64v3() {
        // x86 path
    } else if let Some(t) = token.as_neon() {
        // ARM path
    }
}
}

Feature Flags

Reference for Cargo feature flags in archmage and magetypes.

archmage Features

std (default)

Enables standard library support.

[dependencies]
archmage = "0.4"  # std enabled by default

Disable for no_std:

[dependencies]
archmage = { version = "0.4", default-features = false }

macros (default)

Enables procedural macros: #[arcane], #[rite], #[magetypes], incant!, etc.

# Disable macros (rare)
archmage = { version = "0.4", default-features = false, features = ["std"] }

avx512

Enables AVX-512 tokens and 512-bit types.

archmage = { version = "0.4", features = ["avx512"] }

Unlocks:

  • X64V4Token / Server64 / Avx512Token
  • Avx512ModernToken
  • Avx512Fp16Token
  • HasX64V4 trait

safe_unaligned_simd

Re-exports safe_unaligned_simd crate in the prelude.

archmage = { version = "0.4", features = ["safe_unaligned_simd"] }

Then use:

#![allow(unused)]
fn main() {
use archmage::prelude::*;
// safe_unaligned_simd functions available
}

magetypes Features

std (default)

Standard library support.

avx512

Enables 512-bit types.

magetypes = { version = "0.4", features = ["avx512"] }

Unlocks:

  • f32x16, f64x8
  • i32x16, i64x8
  • i16x32, i8x64
  • u32x16, u64x8, u16x32, u8x64

Feature Combinations

[dependencies]
archmage = { version = "0.4", features = ["avx512", "safe_unaligned_simd"] }
magetypes = { version = "0.4", features = ["avx512"] }

Minimal no_std

[dependencies]
archmage = { version = "0.4", default-features = false, features = ["macros"] }
magetypes = { version = "0.4", default-features = false }

Cross-Platform Library

[dependencies]
archmage = "0.4"
magetypes = "0.4"

[features]
default = ["std"]
std = ["archmage/std", "magetypes/std"]
avx512 = ["archmage/avx512", "magetypes/avx512"]

Cargo Feature vs CPU Feature

Don't confuse Cargo features with CPU features:

Cargo FeatureEffect
avx512Compiles AVX-512 code paths
(none)Code exists but may not be called
CPU FeatureEffect
AVX-512CPU can execute AVX-512 instructions
(none)Runtime fallback to other path
#![allow(unused)]
fn main() {
// Cargo feature controls compilation
#[cfg(feature = "avx512")]
fn avx512_path(token: X64V4Token, data: &[f32]) { }

// Token controls runtime dispatch
if let Some(token) = X64V4Token::summon() {  // Runtime check
    avx512_path(token, data);
}
}

RUSTFLAGS

Not Cargo features, but important compiler flags:

-Ctarget-cpu=native

Compile for current CPU:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release

Effects:

  • Token::guaranteed() returns Some(true) for supported features
  • summon() becomes a no-op
  • LLVM generates optimal code for your CPU

-Ctarget-cpu=<name>

Compile for specific CPU:

# Haswell = AVX2+FMA
RUSTFLAGS="-Ctarget-cpu=haswell" cargo build --release

# Skylake-AVX512 = AVX-512
RUSTFLAGS="-Ctarget-cpu=skylake-avx512" cargo build --release

-Ctarget-feature=+<feature>

Enable specific features:

# Just AVX2
RUSTFLAGS="-Ctarget-feature=+avx2" cargo build

# WASM SIMD
RUSTFLAGS="-Ctarget-feature=+simd128" cargo build --target wasm32-unknown-unknown

docs.rs Configuration

For docs.rs to show all features:

# Cargo.toml
[package.metadata.docs.rs]
all-features = true
rustdoc-args = ["--cfg", "docsrs"]
#![allow(unused)]
fn main() {
// lib.rs
#![cfg_attr(docsrs, feature(doc_cfg))]

#[cfg(feature = "avx512")]
#[cfg_attr(docsrs, doc(cfg(feature = "avx512")))]
pub use tokens::X64V4Token;
}