Manual Dispatch

The simplest dispatch pattern: check for tokens explicitly, call the appropriate implementation.

Basic Pattern

#![allow(unused)]
fn main() {
use archmage::{Desktop64, SimdToken};

pub fn process(data: &mut [f32]) {
    if let Some(token) = Desktop64::summon() {
        process_avx2(token, data);
    } else {
        process_scalar(data);
    }
}

#[arcane]
fn process_avx2(token: Desktop64, data: &mut [f32]) {
    // AVX2 implementation
}

fn process_scalar(data: &mut [f32]) {
    // Scalar fallback
}
}

That's it. No #[cfg(target_arch)] needed—this compiles and runs everywhere.

No Architecture Guards Needed

Tokens exist on all platforms. On unsupported architectures, summon() returns None and #[arcane] functions become unreachable stubs. You write one dispatch block:

#![allow(unused)]
fn main() {
use archmage::{Desktop64, Arm64, Simd128Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // Try x86 AVX2
    if let Some(token) = Desktop64::summon() {
        return process_x86(token, data);
    }

    // Try ARM NEON
    if let Some(token) = Arm64::summon() {
        return process_arm(token, data);
    }

    // Try WASM SIMD
    if let Some(token) = Simd128Token::summon() {
        return process_wasm(token, data);
    }

    // Scalar fallback
    process_scalar(data);
}

#[arcane]
fn process_x86(token: Desktop64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_arm(token: Arm64, data: &mut [f32]) { /* ... */ }

#[arcane]
fn process_wasm(token: Simd128Token, data: &mut [f32]) { /* ... */ }

fn process_scalar(data: &mut [f32]) { /* ... */ }
}

On x86-64: Desktop64::summon() may succeed, others return None. On ARM: Arm64::summon() succeeds, others return None. On WASM: Simd128Token::summon() may succeed, others return None.

The #[arcane] functions for other architectures compile to unreachable stubs—the code exists but can never be called.

Multi-Tier x86 Dispatch

Check from highest to lowest capability:

#![allow(unused)]
fn main() {
use archmage::{X64V4Token, Desktop64, X64V2Token, SimdToken};

pub fn process(data: &mut [f32]) {
    // AVX-512 (requires avx512 feature)
    #[cfg(feature = "avx512")]
    if let Some(token) = X64V4Token::summon() {
        return process_v4(token, data);
    }

    // AVX2+FMA (Haswell+, Zen+)
    if let Some(token) = Desktop64::summon() {
        return process_v3(token, data);
    }

    // SSE4.2 (Nehalem+)
    if let Some(token) = X64V2Token::summon() {
        return process_v2(token, data);
    }

    process_scalar(data);
}
}

Note: #[cfg(feature = "avx512")] is a Cargo feature gate (compile-time opt-in), not an architecture check. The actual CPU detection is still runtime via summon().

When to Use Manual Dispatch

Use manual dispatch when:

  • You have 2-3 tiers
  • You want explicit, readable control flow
  • Different tiers have different APIs

Consider incant! when:

  • You have many tiers
  • All implementations have the same signature
  • You want automatic best-available selection

Avoiding Common Mistakes

Don't Dispatch in Hot Loops

#![allow(unused)]
fn main() {
// WRONG - CPUID every iteration
for chunk in data.chunks_mut(8) {
    if let Some(token) = Desktop64::summon() {
        process_chunk(token, chunk);
    }
}

// BETTER - hoist token outside loop
if let Some(token) = Desktop64::summon() {
    for chunk in data.chunks_mut(8) {
        process_chunk(token, chunk);  // But still has #[arcane] wrapper overhead
    }
} else {
    for chunk in data.chunks_mut(8) {
        process_chunk_scalar(chunk);
    }
}

// BEST - put the loop inside #[arcane], call #[rite] helpers
if let Some(token) = Desktop64::summon() {
    process_all_chunks(token, data);
} else {
    process_all_chunks_scalar(data);
}

#[arcane]
fn process_all_chunks(token: Desktop64, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(8) {
        process_chunk(token, chunk.try_into().unwrap());  // #[rite] inlines fully!
    }
}

#[rite]
fn process_chunk(_: Desktop64, chunk: &mut [f32; 8]) {
    // This inlines into process_all_chunks with zero overhead
}
}

The "BETTER" pattern still calls through an #[arcane] wrapper each iteration—an LLVM optimization barrier. The "BEST" pattern puts the loop inside #[arcane] and uses #[rite] for the inner work, so LLVM sees one optimization region for the entire loop.

Don't Forget Early Returns

#![allow(unused)]
fn main() {
// WRONG - falls through to scalar even when SIMD available
if let Some(token) = Desktop64::summon() {
    process_avx2(token, data);
    // Missing return!
}
process_scalar(data);  // Always runs!

// RIGHT
if let Some(token) = Desktop64::summon() {
    return process_avx2(token, data);
}
process_scalar(data);
}