Compile-Time vs Runtime

Understanding when feature detection happens—and how LLVM optimizes across feature boundaries—is crucial for writing correct and fast SIMD code.

The Mechanisms

Mechanism	When	What It Does
`#[cfg(target_arch = "...")]`	Compile	Include/exclude code from binary
`#[cfg(target_feature = "...")]`	Compile	True only if feature is in target spec
`#[cfg(feature = "...")]`	Compile	Cargo feature flag
`#[target_feature(enable = "...")]`	Compile	Tell LLVM to use these instructions in this function
`-Ctarget-cpu=native`	Compile	LLVM assumes current CPU's features globally
`Token::summon()`	Runtime	CPUID instruction, returns `Option<Token>`

The Key Insight: `#[target_feature(enable)]`

This is the mechanism that makes SIMD work. It tells LLVM: "Inside this function, assume these CPU features are available."

#![allow(unused)]
fn main() {
#[target_feature(enable = "avx2,fma")]
unsafe fn process_avx2(data: &[f32; 8]) -> f32 {
    // LLVM generates AVX2 instructions here
    // _mm256_* intrinsics compile to single instructions
    let v = _mm256_loadu_ps(data.as_ptr());
    // ...
}
}

Why unsafe? The function uses AVX2 instructions, but LLVM doesn't verify the caller checked for AVX2. If you call this on a CPU without AVX2, you get an illegal instruction fault. The unsafe is the contract: "caller must ensure CPU support."

This is what #[arcane] does for you:

#![allow(unused)]
fn main() {
// You write:
#[arcane]
fn process(token: Desktop64, data: &[f32; 8]) -> f32 { /* ... */ }

// Macro generates:
fn process(token: Desktop64, data: &[f32; 8]) -> f32 {
    #[target_feature(enable = "avx2,fma,bmi1,bmi2")]
    #[inline]
    unsafe fn __inner(token: Desktop64, data: &[f32; 8]) -> f32 { /* ... */ }

    // SAFETY: Token existence proves summon() succeeded
    unsafe { __inner(token, data) }
}
}

The token proves the runtime check happened. The inner function gets LLVM's optimizations.

LLVM Optimization and Feature Boundaries

Archmage is never slower than equivalent unsafe code. When you use the right patterns (#[rite] helpers called from #[arcane]), the generated assembly is identical to hand-written #[target_feature] + unsafe code.

Here's why: LLVM's optimization passes respect #[target_feature] boundaries.

Same Features = Full Optimization

When caller and callee have the same target features, LLVM can:

Inline fully
Propagate constants
Eliminate redundant loads/stores
Combine operations across the call boundary

#![allow(unused)]
fn main() {
#[arcane]
fn outer(token: Desktop64, data: &[f32; 8]) -> f32 {
    inner(token, data) * 2.0  // Inlines perfectly!
}

#[arcane]
fn inner(token: Desktop64, data: &[f32; 8]) -> f32 {
    let v = f32x8::from_array(token, *data);
    v.reduce_add()
}
}

Both functions have #[target_feature(enable = "avx2,fma,...")]. LLVM sees one optimization region.

Different Features = Optimization Boundary

When features differ, LLVM must be conservative:

#![allow(unused)]
fn main() {
#[arcane]
fn v4_caller(token: X64V4Token, data: &[f32; 8]) -> f32 {
    // token: X64V4Token → avx512f,avx512bw,...
    v3_helper(token, data)  // Different features!
}

#[arcane]
fn v3_helper(token: X64V3Token, data: &[f32; 8]) -> f32 {
    // token: X64V3Token → avx2,fma,...
    // Different target_feature set = optimization boundary
}
}

This still works—V4 is a superset of V3—but LLVM can't fully inline across the boundary because the #[target_feature] annotations differ.

Generic Bounds = Optimization Boundary

Generics create the same problem:

#![allow(unused)]
fn main() {
#[arcane]
fn generic_impl<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
    // LLVM doesn't know what T's features are at compile time
    // Must generate conservative code that works for any HasX64V2
}
}

The compiler generates one version of this function for the trait bound, not specialized versions for each concrete token. This prevents inlining and vectorization across the boundary.

Rule: Use concrete tokens for hot paths.

Downcasting vs Upcasting

Downcasting: Free

Higher tokens can be used where lower tokens are expected:

#![allow(unused)]
fn main() {
#[arcane]
fn v4_kernel(token: X64V4Token, data: &[f32; 8]) -> f32 {
    // V4 → V3 is free: just passing token, same LLVM features (superset)
    v3_sum(token, data)  // Desktop64 accepts X64V4Token
}

#[arcane]
fn v3_sum(token: Desktop64, data: &[f32; 8]) -> f32 {
    // ...
}
}

This works because X64V4Token has all the features of Desktop64 plus more. LLVM's target features are a superset, so optimization flows freely.

Upcasting: Safe but Creates Boundary

Going the other direction requires IntoConcreteToken:

#![allow(unused)]
fn main() {
fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 {
    if let Some(v4) = token.as_x64v4() {
        v4_path(v4, data)  // Uses AVX-512 if available
    } else if let Some(v3) = token.as_x64v3() {
        v3_path(v3, data)  // Falls back to AVX2
    } else {
        scalar_path(data)
    }
}
}

This is safe—as_x64v4() returns None if the token doesn't support V4. But it creates an optimization boundary because the generic T becomes a concrete type at the branch point.

Don't upcast in hot loops. Upcast once at your dispatch point, then pass concrete tokens through.

The `#[rite]` Optimization

#[rite] exists to eliminate wrapper overhead for inner helpers:

#![allow(unused)]
fn main() {
// #[arcane] creates a wrapper:
fn entry(token: Desktop64, data: &[f32; 8]) -> f32 {
    #[target_feature(enable = "avx2,fma,...")]
    unsafe fn __inner(...) { ... }
    unsafe { __inner(...) }
}

// #[rite] is the function directly:
#[target_feature(enable = "avx2,fma,...")]
#[inline]
fn helper(token: Desktop64, data: &[f32; 8]) -> f32 { ... }
}

Since Rust 1.85+, calling a #[target_feature] function from a matching context is safe. So #[arcane] can call #[rite] helpers without unsafe:

#![allow(unused)]
fn main() {
#[arcane]
fn entry(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 {
    let prod = mul_vectors(token, a, b);  // Calls #[rite], no unsafe!
    horizontal_sum(token, prod)
}

#[rite]
fn mul_vectors(_: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> __m256 { ... }

#[rite]
fn horizontal_sum(_: Desktop64, v: __m256) -> f32 { ... }
}

All three functions share the same #[target_feature] annotation. LLVM sees one optimization region.

When Detection Compiles Away

With -Ctarget-cpu=native or -Ctarget-cpu=haswell:

#![allow(unused)]
fn main() {
// When compiled with -Ctarget-cpu=haswell:
// - #[cfg(target_feature = "avx2")] is TRUE
// - X64V3Token::guaranteed() returns Some(true)
// - summon() becomes a no-op
// - LLVM eliminates the branch entirely

if let Some(token) = X64V3Token::summon() {
    // This branch is the only code generated
}
}

Check programmatically:

#![allow(unused)]
fn main() {
match X64V3Token::guaranteed() {
    Some(true) => println!("Compile-time guaranteed"),
    Some(false) => println!("Wrong architecture"),
    None => println!("Runtime check needed"),
}
}

Common Mistakes

Mistake 1: Generic Bounds in Hot Paths

#![allow(unused)]
fn main() {
// SLOW: Generic creates optimization boundary
fn process<T: HasX64V2>(token: T, data: &[f32]) -> f32 {
    // Called millions of times, can't inline properly
}

// FAST: Concrete token, full optimization
fn process(token: Desktop64, data: &[f32]) -> f32 {
    // LLVM knows exact features, inlines everything
}
}

Mistake 2: Assuming `#[cfg(target_feature)]` Is Runtime

#![allow(unused)]
fn main() {
// WRONG: This is compile-time, not runtime!
#[cfg(target_feature = "avx2")]
fn maybe_avx2() {
    // This function doesn't exist unless compiled with -Ctarget-cpu=haswell
}

// RIGHT: Use tokens for runtime detection
fn maybe_avx2() {
    if let Some(token) = Desktop64::summon() {
        avx2_impl(token);
    }
}
}

Mistake 3: Summoning in Hot Loops

#![allow(unused)]
fn main() {
// WRONG: CPUID in every iteration
for chunk in data.chunks(8) {
    if let Some(token) = Desktop64::summon() {  // Don't!
        process(token, chunk);
    }
}

// RIGHT: Summon once, pass through
if let Some(token) = Desktop64::summon() {
    for chunk in data.chunks(8) {
        process(token, chunk);
    }
}
}

How fast is summon() anyway?

Archmage caches detection results in a static atomic, so repeated summon() calls after the first are essentially a single atomic load (~1.2 ns). The first call does the actual feature detection.

Operation	Time
`Desktop64::summon()` (cached)	~1.3 ns
First call (actual detection)	~2.6 ns
With `-Ctarget-cpu=haswell`	0 ns (compiles to `Some(token)`)

The caching makes summon() fast enough that even calling it frequently won't hurt performance. But the reason to hoist summons isn't performance of summon(), it's keeping the dispatch decision outside your hot loop so LLVM can optimize the inner code.

Mistake 4: Using `#[cfg(target_arch)]` Unnecessarily

#![allow(unused)]
fn main() {
// UNNECESSARY: Tokens exist everywhere
#[cfg(target_arch = "x86_64")]
{
    if let Some(token) = Desktop64::summon() { ... }
}

// CLEANER: Just use the token
if let Some(token) = Desktop64::summon() {
    // Returns None on non-x86, compiles everywhere
}
}

Summary

Question	Answer
"Does this code exist in the binary?"	`#[cfg(...)]` — compile-time
"Can this CPU run AVX2?"	`Token::summon()` — runtime
"What instructions can LLVM use here?"	`#[target_feature(enable)]` — per-function
"Is runtime check needed?"	`Token::guaranteed()` — tells you
"Will these functions inline together?"	Same target features + concrete types = yes
"Do generic bounds hurt performance?"	Yes, they create optimization boundaries
"Is downcasting (V4→V3) free?"	Yes, features are superset
"Is upcasting safe?"	Yes, but creates optimization boundary

Archmage & Magetypes