Token Hoisting

This is the most important performance rule in archmage.

Summon tokens once at your API boundary. Pass them through the call chain. Never summon in hot loops.

The Problem: 42% Performance Regression

#![allow(unused)]
fn main() {
// WRONG: Summoning in inner function
fn distance(a: &[f32; 8], b: &[f32; 8]) -> f32 {
    if let Some(token) = X64V3Token::summon() {  // CPUID every call!
        distance_simd(token, a, b)
    } else {
        distance_scalar(a, b)
    }
}

fn find_closest(points: &[[f32; 8]], query: &[f32; 8]) -> usize {
    let mut best_idx = 0;
    let mut best_dist = f32::MAX;

    for (i, point) in points.iter().enumerate() {
        let d = distance(point, query);  // summon() called N times!
        if d < best_dist {
            best_dist = d;
            best_idx = i;
        }
    }
    best_idx
}
}

This is 42% slower than hoisting the token. CPUID is not free.

The Solution: Hoist to API Boundary

#![allow(unused)]
fn main() {
use archmage::{X64V3Token, SimdToken, arcane};
use magetypes::simd::f32x8;

// RIGHT: Summon once, pass through
fn find_closest(points: &[[f32; 8]], query: &[f32; 8]) -> usize {
    // Summon ONCE at entry
    if let Some(token) = X64V3Token::summon() {
        find_closest_simd(token, points, query)
    } else {
        find_closest_scalar(points, query)
    }
}

#[arcane]
fn find_closest_simd(token: X64V3Token, points: &[[f32; 8]], query: &[f32; 8]) -> usize {
    let mut best_idx = 0;
    let mut best_dist = f32::MAX;

    for (i, point) in points.iter().enumerate() {
        let d = distance_simd(token, point, query);  // Token passed, no summon!
        if d < best_dist {
            best_dist = d;
            best_idx = i;
        }
    }
    best_idx
}

#[arcane]
fn distance_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    // Just use the token - no detection here
    let va = f32x8::from_array(token, *a);
    let vb = f32x8::from_array(token, *b);
    let diff = va - vb;
    (diff * diff).reduce_add().sqrt()
}
}

The Rule

┌─────────────────────────────────────────────────────────┐
│  Public API boundary                                     │
│  ┌─────────────────────────────────────────────────────┐│
│  │  if let Some(token) = Token::summon() {             ││
│  │      // ONLY place summon() is called               ││
│  │      internal_impl(token, ...);                     ││
│  │  }                                                  ││
│  └─────────────────────────────────────────────────────┘│
│                          │                               │
│                          ▼                               │
│  ┌─────────────────────────────────────────────────────┐│
│  │  #[arcane]                                          ││
│  │  fn internal_impl(token: Token, ...) {              ││
│  │      helper(token, ...);  // Pass token through     ││
│  │  }                                                  ││
│  └─────────────────────────────────────────────────────┘│
│                          │                               │
│                          ▼                               │
│  ┌─────────────────────────────────────────────────────┐│
│  │  #[arcane]                                          ││
│  │  fn helper(token: Token, ...) {                     ││
│  │      // Use token, never summon                     ││
│  │  }                                                  ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘

Why Tokens Are Zero-Cost to Pass

#![allow(unused)]
fn main() {
// Tokens are zero-sized
assert_eq!(std::mem::size_of::<X64V3Token>(), 0);

// Passing them costs nothing at runtime
fn f(token: X64V3Token) { }  // No actual parameter in compiled code
}

The token exists only at compile time to prove you did the check. At runtime, it's completely erased.

When `-Ctarget-cpu=native` Helps

With compile-time feature guarantees, summon() becomes a no-op:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release

Now X64V3Token::summon() compiles to:

#![allow(unused)]
fn main() {
// Effectively becomes:
fn summon() -> Option<X64V3Token> {
    Some(X64V3Token)  // No CPUID, unconditional
}
}

But even then, always hoist. It's good practice, and your code works correctly when compiled without target-cpu.

Summary

Pattern	Performance	Correctness
`summon()` in hot loop	42% slower	Works
`summon()` at API boundary	Optimal	Works
`summon()` with `-Ctarget-cpu`	Optimal	Works

Archmage & Magetypes