Archmage & Magetypes
Safely invoke your intrinsic power, using the tokens granted to you by the CPU. Cast primitive magics faster than any mage alive.
Archmage makes SIMD programming in Rust safe and ergonomic. Instead of scattering unsafe blocks throughout your code, you prove CPU feature availability once with a capability token, then write safe code that the compiler optimizes into raw SIMD instructions.
Magetypes provides SIMD vector types (f32x8, i32x4, etc.) with natural Rust operators that integrate with archmage tokens.
Zero Overhead
Archmage is never slower than equivalent unsafe code. The safety abstractions exist only at compile time. At runtime, you get the exact same assembly as hand-written #[target_feature] + unsafe code.
Benchmark: 1000 iterations of 8-float vector operations
Manual unsafe code: 570 ns
#[rite] in #[arcane]: 572 ns ← identical
#[arcane] in loop: 2320 ns ← wrong pattern (see below)
The key is using the right pattern: put loops inside #[arcane], use #[rite] for helpers. See Token Hoisting and The #[rite] Macro.
The Problem
Raw SIMD in Rust requires unsafe:
#![allow(unused)] fn main() { use std::arch::x86_64::*; // Every. Single. Call. unsafe { let a = _mm256_loadu_ps(data.as_ptr()); let b = _mm256_set1_ps(2.0); let c = _mm256_mul_ps(a, b); _mm256_storeu_ps(out.as_mut_ptr(), c); } }
This is tedious and error-prone. Miss a feature check? Undefined behavior on older CPUs.
The Solution
Archmage separates proof of capability from use of capability:
use archmage::prelude::*; #[arcane] fn multiply(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] { // Safe! Token proves AVX2+FMA, safe_unaligned_simd takes references let a = _mm256_loadu_ps(data); let b = _mm256_set1_ps(2.0); let c = _mm256_mul_ps(a, b); let mut out = [0.0f32; 8]; _mm256_storeu_ps(&mut out, c); out } fn main() { // Runtime check happens ONCE here if let Some(token) = Desktop64::summon() { let result = multiply(token, &[1.0; 8]); println!("{:?}", result); } }
Key Concepts
-
Tokens are zero-sized proof types.
Desktop64::summon()returnsSome(token)only if the CPU supports AVX2+FMA. -
#[arcane]generates a#[target_feature]inner function. Inside, SIMD intrinsics are safe. -
Token hoisting: Call
summon()once at your API boundary, pass the token through. Don't summon in hot loops.
Supported Platforms
| Platform | Tokens | Register Width |
|---|---|---|
| x86-64 | X64V2Token, X64V3Token/Desktop64, X64V4Token/Server64 | 128-512 bit |
| AArch64 | NeonToken/Arm64, NeonAesToken, NeonSha3Token | 128 bit |
| WASM | Simd128Token | 128 bit |
Next Steps
- Installation — Add archmage to your project
- Your First SIMD Function — Write real SIMD code
- Understanding Tokens — Learn the token system
Installation
Add archmage to your Cargo.toml:
[dependencies]
archmage = "0.4"
For SIMD vector types with natural operators, also add magetypes:
[dependencies]
archmage = "0.4"
magetypes = "0.4"
Feature Flags
archmage
| Feature | Default | Description |
|---|---|---|
std | ✓ | Standard library support |
macros | ✓ | #[arcane], incant!, etc. |
avx512 | ✗ | AVX-512 token support |
magetypes
| Feature | Default | Description |
|---|---|---|
std | ✓ | Standard library support |
avx512 | ✗ | 512-bit types (f32x16, etc.) |
Platform Requirements
x86-64
Works out of the box. Tokens detect CPU features at runtime.
For compile-time optimization on known hardware:
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
AArch64
NEON is baseline on 64-bit ARM—NeonToken::summon() always succeeds.
WASM
Enable SIMD128 in your build:
RUSTFLAGS="-Ctarget-feature=+simd128" cargo build --target wasm32-unknown-unknown
Verify Installation
use archmage::{SimdToken, Desktop64, Arm64}; fn main() { // Tokens compile everywhere - summon() returns None on unsupported platforms match Desktop64::summon() { Some(token) => println!("{} available!", token.name()), None => println!("No AVX2+FMA"), } match Arm64::summon() { Some(token) => println!("{} available!", token.name()), None => println!("No NEON"), } }
Run it:
cargo run
On x86-64 (Haswell+/Zen+): "X64V3 available!" and "No NEON". On AArch64: "No AVX2+FMA" and "Neon available!".
Your First SIMD Function
Let's write a function that squares 8 floats in parallel using AVX2.
The Recommended Way
Use archmage::prelude::* which includes safe_unaligned_simd for memory operations:
use archmage::prelude::*; #[arcane] fn square_f32x8(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] { // safe_unaligned_simd takes references - fully safe! let v = _mm256_loadu_ps(data); let squared = _mm256_mul_ps(v, v); let mut out = [0.0f32; 8]; _mm256_storeu_ps(&mut out, squared); out } fn main() { if let Some(token) = Desktop64::summon() { let input = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let output = square_f32x8(token, &input); println!("{:?}", output); // [1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0] } else { println!("AVX2 not available"); } }
Using magetypes
For the most ergonomic experience, use magetypes' vector types:
use archmage::{Desktop64, SimdToken}; use magetypes::simd::f32x8; fn square_f32x8(token: Desktop64, data: &[f32; 8]) -> [f32; 8] { let v = f32x8::from_array(token, *data); let squared = v * v; // Natural operator! squared.to_array() } fn main() { if let Some(token) = Desktop64::summon() { let input = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let output = square_f32x8(token, &input); println!("{:?}", output); } }
What #[arcane] Does
The macro transforms your function:
#![allow(unused)] fn main() { // You write: #[arcane] fn square(token: Desktop64, data: &[f32; 8]) -> [f32; 8] { // body } // Macro generates: fn square(token: Desktop64, data: &[f32; 8]) -> [f32; 8] { #[target_feature(enable = "avx2,fma,bmi1,bmi2")] #[inline] unsafe fn __inner(token: Desktop64, data: &[f32; 8]) -> [f32; 8] { // body - intrinsics are safe here! } // SAFETY: token proves CPU support unsafe { __inner(token, data) } } }
The token parameter proves you checked CPU features. The macro enables those features for the inner function, making intrinsics safe to call.
Key Points
Desktop64= AVX2 + FMA + BMI1 + BMI2 (Haswell 2013+, Zen 1+)summon()does runtime CPU detection#[arcane]makes intrinsics safe inside the function- Token is zero-sized — no runtime overhead passing it around
Understanding Tokens
Tokens are the core of archmage's safety model. They're zero-sized proof types that demonstrate CPU feature availability.
The Token Hierarchy
x86-64 Tokens
| Token | Alias | Features | CPUs |
|---|---|---|---|
X64V2Token | — | SSE4.2, POPCNT | Nehalem 2008+ |
X64V3Token | Desktop64, Avx2FmaToken | + AVX2, FMA, BMI1, BMI2 | Haswell 2013+, Zen 1+ |
X64V4Token | Server64, Avx512Token | + AVX-512 F/BW/CD/DQ/VL | Skylake-X 2017+, Zen 4+ |
Avx512ModernToken | — | + VNNI, VBMI, etc. | Ice Lake 2019+, Zen 4+ |
AArch64 Tokens
| Token | Alias | Features |
|---|---|---|
NeonToken | Arm64 | NEON (baseline, always available) |
NeonAesToken | — | + AES instructions |
NeonSha3Token | — | + SHA3 instructions |
NeonCrcToken | — | + CRC instructions |
WASM Token
| Token | Features |
|---|---|
Simd128Token | WASM SIMD128 |
Summoning Tokens
#![allow(unused)] fn main() { use archmage::{Desktop64, SimdToken}; // Runtime detection if let Some(token) = Desktop64::summon() { // CPU has AVX2+FMA process_simd(token, data); } else { // Fallback process_scalar(data); } }
Compile-Time Guarantees
Check if detection is needed:
#![allow(unused)] fn main() { use archmage::{Desktop64, SimdToken}; match Desktop64::guaranteed() { Some(true) => { // Compiled with -Ctarget-cpu=haswell or higher // summon() will always succeed, check is elided let token = Desktop64::summon().unwrap(); } Some(false) => { // Wrong architecture (e.g., running on ARM) // summon() will always return None } None => { // Runtime check needed if let Some(token) = Desktop64::summon() { // ... } } } }
ScalarToken: The Fallback
ScalarToken always succeeds—it's for fallback paths:
#![allow(unused)] fn main() { use archmage::{ScalarToken, SimdToken}; // Always works let token = ScalarToken::summon().unwrap(); // Or just construct it directly (it's a unit struct) let token = ScalarToken; }
Token Properties
Tokens are:
- Zero-sized: No runtime cost to pass around
- Copy + Clone: Pass by value freely
- Send + Sync: Safe to share across threads
- 'static: Can be stored in static variables
#![allow(unused)] fn main() { // Zero-sized assert_eq!(std::mem::size_of::<Desktop64>(), 0); // Copy fn takes_token(token: Desktop64) { let copy = token; // No move, just copy use_both(token, copy); } }
Downcasting Tokens
Higher tokens can be used where lower ones are expected:
#![allow(unused)] fn main() { #[arcane] fn needs_v3(token: X64V3Token, data: &[f32]) { /* ... */ } if let Some(v4) = X64V4Token::summon() { // V4 is a superset of V3 — this works and inlines needs_v3(v4, &data); } }
V4 includes all V3 features, so the token is valid proof.
Trait Bounds
For generic code, use tier traits:
#![allow(unused)] fn main() { use archmage::HasX64V2; fn process<T: HasX64V2>(token: T, data: &[f32]) { // Works with X64V2Token, X64V3Token, X64V4Token, etc. } }
Available traits:
HasX64V2— SSE4.2 tierHasX64V4— AVX-512 tier (requiresavx512feature)HasNeon— NEON baselineHasNeonAes,HasNeonSha3— NEON extensions
Compile-Time vs Runtime
Understanding when feature detection happens—and how LLVM optimizes across feature boundaries—is crucial for writing correct and fast SIMD code.
The Mechanisms
| Mechanism | When | What It Does |
|---|---|---|
#[cfg(target_arch = "...")] | Compile | Include/exclude code from binary |
#[cfg(target_feature = "...")] | Compile | True only if feature is in target spec |
#[cfg(feature = "...")] | Compile | Cargo feature flag |
#[target_feature(enable = "...")] | Compile | Tell LLVM to use these instructions in this function |
-Ctarget-cpu=native | Compile | LLVM assumes current CPU's features globally |
Token::summon() | Runtime | CPUID instruction, returns Option<Token> |
The Key Insight: #[target_feature(enable)]
This is the mechanism that makes SIMD work. It tells LLVM: "Inside this function, assume these CPU features are available."
#![allow(unused)] fn main() { #[target_feature(enable = "avx2,fma")] unsafe fn process_avx2(data: &[f32; 8]) -> f32 { // LLVM generates AVX2 instructions here // _mm256_* intrinsics compile to single instructions let v = _mm256_loadu_ps(data.as_ptr()); // ... } }
Why unsafe? The function uses AVX2 instructions, but LLVM doesn't verify the caller checked for AVX2. If you call this on a CPU without AVX2, you get an illegal instruction fault. The unsafe is the contract: "caller must ensure CPU support."
This is what #[arcane] does for you:
#![allow(unused)] fn main() { // You write: #[arcane] fn process(token: Desktop64, data: &[f32; 8]) -> f32 { /* ... */ } // Macro generates: fn process(token: Desktop64, data: &[f32; 8]) -> f32 { #[target_feature(enable = "avx2,fma,bmi1,bmi2")] #[inline] unsafe fn __inner(token: Desktop64, data: &[f32; 8]) -> f32 { /* ... */ } // SAFETY: Token existence proves summon() succeeded unsafe { __inner(token, data) } } }
The token proves the runtime check happened. The inner function gets LLVM's optimizations.
LLVM Optimization and Feature Boundaries
Archmage is never slower than equivalent unsafe code. When you use the right patterns (#[rite] helpers called from #[arcane]), the generated assembly is identical to hand-written #[target_feature] + unsafe code.
Here's why: LLVM's optimization passes respect #[target_feature] boundaries.
Same Features = Full Optimization
When caller and callee have the same target features, LLVM can:
- Inline fully
- Propagate constants
- Eliminate redundant loads/stores
- Combine operations across the call boundary
#![allow(unused)] fn main() { #[arcane] fn outer(token: Desktop64, data: &[f32; 8]) -> f32 { inner(token, data) * 2.0 // Inlines perfectly! } #[arcane] fn inner(token: Desktop64, data: &[f32; 8]) -> f32 { let v = f32x8::from_array(token, *data); v.reduce_add() } }
Both functions have #[target_feature(enable = "avx2,fma,...")]. LLVM sees one optimization region.
Different Features = Optimization Boundary
When features differ, LLVM must be conservative:
#![allow(unused)] fn main() { #[arcane] fn v4_caller(token: X64V4Token, data: &[f32; 8]) -> f32 { // token: X64V4Token → avx512f,avx512bw,... v3_helper(token, data) // Different features! } #[arcane] fn v3_helper(token: X64V3Token, data: &[f32; 8]) -> f32 { // token: X64V3Token → avx2,fma,... // Different target_feature set = optimization boundary } }
This still works—V4 is a superset of V3—but LLVM can't fully inline across the boundary because the #[target_feature] annotations differ.
Generic Bounds = Optimization Boundary
Generics create the same problem:
#![allow(unused)] fn main() { #[arcane] fn generic_impl<T: HasX64V2>(token: T, data: &[f32]) -> f32 { // LLVM doesn't know what T's features are at compile time // Must generate conservative code that works for any HasX64V2 } }
The compiler generates one version of this function for the trait bound, not specialized versions for each concrete token. This prevents inlining and vectorization across the boundary.
Rule: Use concrete tokens for hot paths.
Downcasting vs Upcasting
Downcasting: Free
Higher tokens can be used where lower tokens are expected:
#![allow(unused)] fn main() { #[arcane] fn v4_kernel(token: X64V4Token, data: &[f32; 8]) -> f32 { // V4 → V3 is free: just passing token, same LLVM features (superset) v3_sum(token, data) // Desktop64 accepts X64V4Token } #[arcane] fn v3_sum(token: Desktop64, data: &[f32; 8]) -> f32 { // ... } }
This works because X64V4Token has all the features of Desktop64 plus more. LLVM's target features are a superset, so optimization flows freely.
Upcasting: Safe but Creates Boundary
Going the other direction requires IntoConcreteToken:
#![allow(unused)] fn main() { fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 { if let Some(v4) = token.as_x64v4() { v4_path(v4, data) // Uses AVX-512 if available } else if let Some(v3) = token.as_x64v3() { v3_path(v3, data) // Falls back to AVX2 } else { scalar_path(data) } } }
This is safe—as_x64v4() returns None if the token doesn't support V4. But it creates an optimization boundary because the generic T becomes a concrete type at the branch point.
Don't upcast in hot loops. Upcast once at your dispatch point, then pass concrete tokens through.
The #[rite] Optimization
#[rite] exists to eliminate wrapper overhead for inner helpers:
#![allow(unused)] fn main() { // #[arcane] creates a wrapper: fn entry(token: Desktop64, data: &[f32; 8]) -> f32 { #[target_feature(enable = "avx2,fma,...")] unsafe fn __inner(...) { ... } unsafe { __inner(...) } } // #[rite] is the function directly: #[target_feature(enable = "avx2,fma,...")] #[inline] fn helper(token: Desktop64, data: &[f32; 8]) -> f32 { ... } }
Since Rust 1.85+, calling a #[target_feature] function from a matching context is safe. So #[arcane] can call #[rite] helpers without unsafe:
#![allow(unused)] fn main() { #[arcane] fn entry(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 { let prod = mul_vectors(token, a, b); // Calls #[rite], no unsafe! horizontal_sum(token, prod) } #[rite] fn mul_vectors(_: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> __m256 { ... } #[rite] fn horizontal_sum(_: Desktop64, v: __m256) -> f32 { ... } }
All three functions share the same #[target_feature] annotation. LLVM sees one optimization region.
When Detection Compiles Away
With -Ctarget-cpu=native or -Ctarget-cpu=haswell:
#![allow(unused)] fn main() { // When compiled with -Ctarget-cpu=haswell: // - #[cfg(target_feature = "avx2")] is TRUE // - X64V3Token::guaranteed() returns Some(true) // - summon() becomes a no-op // - LLVM eliminates the branch entirely if let Some(token) = X64V3Token::summon() { // This branch is the only code generated } }
Check programmatically:
#![allow(unused)] fn main() { match X64V3Token::guaranteed() { Some(true) => println!("Compile-time guaranteed"), Some(false) => println!("Wrong architecture"), None => println!("Runtime check needed"), } }
Common Mistakes
Mistake 1: Generic Bounds in Hot Paths
#![allow(unused)] fn main() { // SLOW: Generic creates optimization boundary fn process<T: HasX64V2>(token: T, data: &[f32]) -> f32 { // Called millions of times, can't inline properly } // FAST: Concrete token, full optimization fn process(token: Desktop64, data: &[f32]) -> f32 { // LLVM knows exact features, inlines everything } }
Mistake 2: Assuming #[cfg(target_feature)] Is Runtime
#![allow(unused)] fn main() { // WRONG: This is compile-time, not runtime! #[cfg(target_feature = "avx2")] fn maybe_avx2() { // This function doesn't exist unless compiled with -Ctarget-cpu=haswell } // RIGHT: Use tokens for runtime detection fn maybe_avx2() { if let Some(token) = Desktop64::summon() { avx2_impl(token); } } }
Mistake 3: Summoning in Hot Loops
#![allow(unused)] fn main() { // WRONG: CPUID in every iteration for chunk in data.chunks(8) { if let Some(token) = Desktop64::summon() { // Don't! process(token, chunk); } } // RIGHT: Summon once, pass through if let Some(token) = Desktop64::summon() { for chunk in data.chunks(8) { process(token, chunk); } } }
How fast is summon() anyway?
Archmage caches detection results in a static atomic, so repeated summon() calls after the first are essentially a single atomic load (~1.2 ns). The first call does the actual feature detection.
| Operation | Time |
|---|---|
Desktop64::summon() (cached) | ~1.3 ns |
| First call (actual detection) | ~2.6 ns |
With -Ctarget-cpu=haswell | 0 ns (compiles to Some(token)) |
The caching makes summon() fast enough that even calling it frequently won't hurt performance. But the reason to hoist summons isn't performance of summon(), it's keeping the dispatch decision outside your hot loop so LLVM can optimize the inner code.
Mistake 4: Using #[cfg(target_arch)] Unnecessarily
#![allow(unused)] fn main() { // UNNECESSARY: Tokens exist everywhere #[cfg(target_arch = "x86_64")] { if let Some(token) = Desktop64::summon() { ... } } // CLEANER: Just use the token if let Some(token) = Desktop64::summon() { // Returns None on non-x86, compiles everywhere } }
Summary
| Question | Answer |
|---|---|
| "Does this code exist in the binary?" | #[cfg(...)] — compile-time |
| "Can this CPU run AVX2?" | Token::summon() — runtime |
| "What instructions can LLVM use here?" | #[target_feature(enable)] — per-function |
| "Is runtime check needed?" | Token::guaranteed() — tells you |
| "Will these functions inline together?" | Same target features + concrete types = yes |
| "Do generic bounds hurt performance?" | Yes, they create optimization boundaries |
| "Is downcasting (V4→V3) free?" | Yes, features are superset |
| "Is upcasting safe?" | Yes, but creates optimization boundary |
Token Hoisting
This is the most important performance rule in archmage.
Summon tokens once at your API boundary. Pass them through the call chain. Never summon in hot loops.
The Problem: 42% Performance Regression
#![allow(unused)] fn main() { // WRONG: Summoning in inner function fn distance(a: &[f32; 8], b: &[f32; 8]) -> f32 { if let Some(token) = X64V3Token::summon() { // CPUID every call! distance_simd(token, a, b) } else { distance_scalar(a, b) } } fn find_closest(points: &[[f32; 8]], query: &[f32; 8]) -> usize { let mut best_idx = 0; let mut best_dist = f32::MAX; for (i, point) in points.iter().enumerate() { let d = distance(point, query); // summon() called N times! if d < best_dist { best_dist = d; best_idx = i; } } best_idx } }
This is 42% slower than hoisting the token. CPUID is not free.
The Solution: Hoist to API Boundary
#![allow(unused)] fn main() { use archmage::{X64V3Token, SimdToken, arcane}; use magetypes::simd::f32x8; // RIGHT: Summon once, pass through fn find_closest(points: &[[f32; 8]], query: &[f32; 8]) -> usize { // Summon ONCE at entry if let Some(token) = X64V3Token::summon() { find_closest_simd(token, points, query) } else { find_closest_scalar(points, query) } } #[arcane] fn find_closest_simd(token: X64V3Token, points: &[[f32; 8]], query: &[f32; 8]) -> usize { let mut best_idx = 0; let mut best_dist = f32::MAX; for (i, point) in points.iter().enumerate() { let d = distance_simd(token, point, query); // Token passed, no summon! if d < best_dist { best_dist = d; best_idx = i; } } best_idx } #[arcane] fn distance_simd(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 { // Just use the token - no detection here let va = f32x8::from_array(token, *a); let vb = f32x8::from_array(token, *b); let diff = va - vb; (diff * diff).reduce_add().sqrt() } }
The Rule
┌─────────────────────────────────────────────────────────┐
│ Public API boundary │
│ ┌─────────────────────────────────────────────────────┐│
│ │ if let Some(token) = Token::summon() { ││
│ │ // ONLY place summon() is called ││
│ │ internal_impl(token, ...); ││
│ │ } ││
│ └─────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ #[arcane] ││
│ │ fn internal_impl(token: Token, ...) { ││
│ │ helper(token, ...); // Pass token through ││
│ │ } ││
│ └─────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐│
│ │ #[arcane] ││
│ │ fn helper(token: Token, ...) { ││
│ │ // Use token, never summon ││
│ │ } ││
│ └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘
Why Tokens Are Zero-Cost to Pass
#![allow(unused)] fn main() { // Tokens are zero-sized assert_eq!(std::mem::size_of::<X64V3Token>(), 0); // Passing them costs nothing at runtime fn f(token: X64V3Token) { } // No actual parameter in compiled code }
The token exists only at compile time to prove you did the check. At runtime, it's completely erased.
When -Ctarget-cpu=native Helps
With compile-time feature guarantees, summon() becomes a no-op:
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
Now X64V3Token::summon() compiles to:
#![allow(unused)] fn main() { // Effectively becomes: fn summon() -> Option<X64V3Token> { Some(X64V3Token) // No CPUID, unconditional } }
But even then, always hoist. It's good practice, and your code works correctly when compiled without target-cpu.
Summary
| Pattern | Performance | Correctness |
|---|---|---|
summon() in hot loop | 42% slower | Works |
summon() at API boundary | Optimal | Works |
summon() with -Ctarget-cpu | Optimal | Works |
The #[arcane] Macro
#[arcane] creates a safe wrapper around SIMD code. Use it at entry points—functions called from non-SIMD code (after summon(), from tests, public APIs).
For internal helpers called from other SIMD functions, use #[rite] instead—it has zero wrapper overhead.
Basic Usage
#![allow(unused)] fn main() { use archmage::prelude::*; #[arcane] fn add_vectors(_token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] { // safe_unaligned_simd takes references - fully safe inside #[arcane]! let va = _mm256_loadu_ps(a); let vb = _mm256_loadu_ps(b); let sum = _mm256_add_ps(va, vb); let mut out = [0.0f32; 8]; _mm256_storeu_ps(&mut out, sum); out } }
What It Generates
#![allow(unused)] fn main() { // Your code: #[arcane] fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 { _mm256_add_ps(a, b) } // Generated: fn add(token: Desktop64, a: __m256, b: __m256) -> __m256 { #[target_feature(enable = "avx2,fma,bmi1,bmi2")] #[inline] unsafe fn __inner(token: Desktop64, a: __m256, b: __m256) -> __m256 { _mm256_add_ps(a, b) // Safe inside #[target_feature]! } // SAFETY: Token proves CPU support was verified unsafe { __inner(token, a, b) } } }
Token-to-Features Mapping
| Token | Enabled Features |
|---|---|
X64V2Token | sse3, ssse3, sse4.1, sse4.2, popcnt |
X64V3Token / Desktop64 | + avx, avx2, fma, bmi1, bmi2, f16c |
X64V4Token / Server64 | + avx512f, avx512bw, avx512cd, avx512dq, avx512vl |
NeonToken / Arm64 | neon |
Simd128Token | simd128 |
Nesting #[arcane] Functions
Functions with the same token type inline into each other:
#![allow(unused)] fn main() { use magetypes::simd::f32x8; #[arcane] fn outer(token: Desktop64, data: &[f32; 8]) -> f32 { let sum = inner(token, data); // Inlines! sum * 2.0 } #[arcane] fn inner(token: Desktop64, data: &[f32; 8]) -> f32 { // Both functions share the same #[target_feature] region // LLVM optimizes across both let v = f32x8::from_array(token, *data); v.reduce_add() } }
Downcasting Tokens
Higher tokens can call functions expecting lower tokens:
#![allow(unused)] fn main() { use magetypes::simd::f32x8; #[arcane] fn v4_kernel(token: X64V4Token, data: &[f32; 8]) -> f32 { // V4 ⊃ V3, so this works and inlines properly v3_sum(token, data) // ... could do AVX-512 specific work too ... } #[arcane] fn v3_sum(token: X64V3Token, data: &[f32; 8]) -> f32 { // Actual SIMD: load 8 floats, horizontal sum let v = f32x8::from_array(token, *data); v.reduce_add() } }
Cross-Platform Stubs
On non-matching architectures, #[arcane] generates a stub:
#![allow(unused)] fn main() { // On ARM, this becomes: #[cfg(not(target_arch = "x86_64"))] fn add(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> [f32; 8] { unreachable!("Desktop64 cannot exist on this architecture") } }
The stub compiles but can never be reached—Desktop64::summon() returns None on ARM.
Options
inline_always
Force aggressive inlining (requires nightly):
#![allow(unused)] #![feature(target_feature_inline_always)] fn main() { #[arcane(inline_always)] fn hot_path(token: Desktop64, data: &[f32]) -> f32 { // Uses #[inline(always)] instead of #[inline] } }
Common Patterns
Public API with Internal Implementation
#![allow(unused)] fn main() { pub fn process(data: &mut [f32]) { if let Some(token) = Desktop64::summon() { process_simd(token, data); } else { process_scalar(data); } } #[arcane] fn process_simd(token: Desktop64, data: &mut [f32]) { // SIMD implementation } fn process_scalar(data: &mut [f32]) { // Fallback } }
Generic Over Tokens
#![allow(unused)] fn main() { use archmage::HasX64V2; use std::arch::x86_64::*; #[arcane] fn generic_impl<T: HasX64V2>(token: T, a: __m128, b: __m128) -> __m128 { // Works with X64V2Token, X64V3Token, X64V4Token // Note: generic bounds create optimization boundaries _mm_add_ps(a, b) } }
Warning: Generic bounds prevent inlining across the boundary. Prefer concrete tokens for hot paths.
The #[rite] Macro
#[rite] should be your default choice for SIMD functions. It adds #[target_feature] + #[inline] directly—no wrapper overhead.
Use #[arcane] only at entry points where the token comes from the outside world.
The Rule
| Caller | Use |
|---|---|
Called from #[arcane] or #[rite] with same/compatible token | #[rite] |
Called from non-SIMD code (tests, public API, after summon()) | #[arcane] |
Default to #[rite]}. Only use #[arcane] when you need the safe wrapper.
#![allow(unused)] fn main() { use archmage::prelude::*; // ENTRY POINT: receives token from caller #[arcane] pub fn dot_product(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 { let products = mul_vectors(token, a, b); // Calls #[rite] helper horizontal_sum(token, products) } // INNER HELPER: only called from #[arcane] context #[rite] fn mul_vectors(_token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> __m256 { // safe_unaligned_simd takes references - no unsafe needed! let va = _mm256_loadu_ps(a); let vb = _mm256_loadu_ps(b); _mm256_mul_ps(va, vb) } // INNER HELPER: only called from #[arcane] context #[rite] fn horizontal_sum(_token: Desktop64, v: __m256) -> f32 { let sum = _mm256_hadd_ps(v, v); let sum = _mm256_hadd_ps(sum, sum); let low = _mm256_castps256_ps128(sum); let high = _mm256_extractf128_ps::<1>(sum); _mm_cvtss_f32(_mm_add_ss(low, high)) } }
What It Generates
#![allow(unused)] fn main() { // Your code: #[rite] fn helper(_token: Desktop64, v: __m256) -> __m256 { _mm256_add_ps(v, v) } // Generated (NO wrapper function): #[target_feature(enable = "avx2,fma,bmi1,bmi2")] #[inline] fn helper(_token: Desktop64, v: __m256) -> __m256 { _mm256_add_ps(v, v) } }
Compare to #[arcane] which creates:
#![allow(unused)] fn main() { fn helper(_token: Desktop64, v: __m256) -> __m256 { #[target_feature(enable = "avx2,fma,bmi1,bmi2")] #[inline] unsafe fn __inner(_token: Desktop64, v: __m256) -> __m256 { _mm256_add_ps(v, v) } unsafe { __inner(_token, v) } } }
Why This Works (Rust 1.85+)
Since Rust 1.85, calling a #[target_feature] function from another function with matching or superset features is safe—no unsafe block needed:
#![allow(unused)] fn main() { #[target_feature(enable = "avx2,fma")] fn outer(data: &[f32; 8]) -> f32 { inner_add(data) + inner_mul(data) // Safe! No unsafe needed! } #[target_feature(enable = "avx2")] #[inline] fn inner_add(data: &[f32; 8]) -> f32 { /* ... */ } #[target_feature(enable = "avx2")] #[inline] fn inner_mul(data: &[f32; 8]) -> f32 { /* ... */ } }
The caller's features (avx2,fma) are a superset of the callee's (avx2), so the compiler knows the call is safe.
Direct Calls Require Unsafe
If you call a #[rite] function from outside a #[target_feature] context, you need unsafe:
#![allow(unused)] fn main() { #[test] fn test_helper() { if let Some(token) = Desktop64::summon() { // Direct call from test (no target_feature) requires unsafe let result = unsafe { helper(token, data) }; assert_eq!(result, expected); } } }
This is correct—the test function doesn't have #[target_feature], so the compiler can't verify safety at compile time. The unsafe block says "I checked at runtime via summon()."
Benefits
- Zero wrapper overhead: No extra function call indirection
- Better inlining: LLVM sees the actual function, not a wrapper
- Cleaner stack traces: No
__innerfunctions in backtraces - Syntactic sugar: No need to manually maintain feature strings
Choosing Between #[arcane] and #[rite]
Default to #[rite] — only use #[arcane] when necessary.
| Situation | Use | Why |
|---|---|---|
| Internal helper | #[rite] | Zero overhead, inlines fully |
| Composable building blocks | #[rite] | Same target features = one optimization region |
| Most SIMD functions | #[rite] | This should be your default |
| Entry point (receives token from outside) | #[arcane] | Needs safe wrapper |
| Public API | #[arcane] | Callers aren't in target_feature context |
| Called from tests | #[arcane] | Tests aren't in target_feature context |
Composing Helpers
#[rite] helpers compose naturally:
#![allow(unused)] fn main() { #[rite] fn complex_op(token: Desktop64, a: &[f32; 8], b: &[f32; 8], c: &[f32; 8]) -> f32 { let ab = mul_vectors(token, a, b); // Calls another #[rite] let vc = load_vector(token, c); // Calls another #[rite] let sum = add_vectors_raw(token, ab, vc); // Calls another #[rite] horizontal_sum(token, sum) // Calls another #[rite] } }
All helpers inline into the caller with zero overhead.
Inlining Behavior
#[rite] uses #[inline] which is sufficient for full inlining when called from matching #[target_feature] context. Benchmarks show #[rite] with #[inline] performs identically to manually inlined code.
Note: #[inline(always)] combined with #[target_feature] is not allowed on stable Rust, so we can't use it anyway. The good news is we don't need it—#[inline] works perfectly.
Benchmark results (1000 iterations, 8-float vector add):
arcane_in_loop: 2.32 µs (4.1x slower - wrapper overhead)
rite_in_arcane: 572 ns (baseline - full inlining)
manual_inline: 570 ns (baseline)
Cross-Platform Stubs
Archmage lets you write x86 SIMD code that compiles on ARM and vice versa. Functions become unreachable stubs on non-matching architectures.
How It Works
When you write:
#![allow(unused)] fn main() { use archmage::prelude::*; #[arcane] fn avx2_kernel(_token: Desktop64, data: &[f32; 8]) -> [f32; 8] { // x86-64 SIMD code - safe_unaligned_simd takes references let v = _mm256_loadu_ps(data); // ... } }
On x86-64, you get the real implementation.
On ARM/WASM, you get:
#![allow(unused)] fn main() { fn avx2_kernel(token: Desktop64, data: &[f32; 8]) -> [f32; 8] { unreachable!("Desktop64 cannot exist on this architecture") } }
Why This Is Safe
The stub can never execute because:
Desktop64::summon()returnsNoneon ARM- You can't construct
Desktop64any other way (safely) - The only path to
avx2_kernelis through a token you can't obtain
#![allow(unused)] fn main() { fn process(data: &[f32; 8]) -> [f32; 8] { if let Some(token) = Desktop64::summon() { avx2_kernel(token, data) // Never reached on ARM } else { scalar_fallback(data) // ARM takes this path } } }
Writing Cross-Platform Libraries
Structure your code with platform-specific implementations:
#![allow(unused)] fn main() { // Public API - works everywhere pub fn process(data: &mut [f32]) { #[cfg(target_arch = "x86_64")] if let Some(token) = Desktop64::summon() { return process_avx2(token, data); } #[cfg(target_arch = "aarch64")] if let Some(token) = NeonToken::summon() { return process_neon(token, data); } process_scalar(data); } #[cfg(target_arch = "x86_64")] #[arcane] fn process_avx2(token: Desktop64, data: &mut [f32]) { // AVX2 implementation } #[cfg(target_arch = "aarch64")] #[arcane] fn process_neon(token: NeonToken, data: &mut [f32]) { // NEON implementation } fn process_scalar(data: &mut [f32]) { // Works everywhere } }
Token Existence vs Token Availability
All token types exist on all platforms:
#![allow(unused)] fn main() { // These types compile on ARM: use archmage::{Desktop64, X64V3Token, X64V4Token}; // But summon() returns None: assert!(Desktop64::summon().is_none()); // On ARM // And guaranteed() tells you: assert_eq!(Desktop64::guaranteed(), Some(false)); // Wrong arch }
This enables cross-platform code without #[cfg] soup:
#![allow(unused)] fn main() { // Compiles everywhere, dispatches at runtime fn process<T: IntoConcreteToken>(token: T, data: &[f32]) { if let Some(t) = token.as_x64v3() { process_v3(t, data); } else if let Some(t) = token.as_neon() { process_neon(t, data); } else { process_scalar(data); } } }
The ScalarToken Escape Hatch
ScalarToken works everywhere:
#![allow(unused)] fn main() { use archmage::{ScalarToken, SimdToken}; // Always succeeds, any platform let token = ScalarToken::summon().unwrap(); // Or just construct it let token = ScalarToken; }
Use it for fallback paths that need a token for API consistency:
#![allow(unused)] fn main() { fn must_have_token<T: SimdToken>(token: T, data: &[f32]) -> f32 { // ... } // On platforms without SIMD: let result = must_have_token(ScalarToken, &data); }
Testing Cross-Platform Code
Test your dispatch logic without needing every CPU:
#![allow(unused)] fn main() { #[test] fn test_scalar_fallback() { // Force scalar path even on AVX2 machine let token = ScalarToken; let result = process_with_token(token, &data); assert_eq!(result, expected); } #[test] #[cfg(target_arch = "x86_64")] fn test_avx2_path() { if let Some(token) = Desktop64::summon() { let result = process_with_token(token, &data); assert_eq!(result, expected); } } }
Manual Dispatch
The simplest dispatch pattern: check for tokens explicitly, call the appropriate implementation.
Basic Pattern
#![allow(unused)] fn main() { use archmage::{Desktop64, SimdToken}; pub fn process(data: &mut [f32]) { if let Some(token) = Desktop64::summon() { process_avx2(token, data); } else { process_scalar(data); } } #[arcane] fn process_avx2(token: Desktop64, data: &mut [f32]) { // AVX2 implementation } fn process_scalar(data: &mut [f32]) { // Scalar fallback } }
That's it. No #[cfg(target_arch)] needed—this compiles and runs everywhere.
No Architecture Guards Needed
Tokens exist on all platforms. On unsupported architectures, summon() returns None and #[arcane] functions become unreachable stubs. You write one dispatch block:
#![allow(unused)] fn main() { use archmage::{Desktop64, Arm64, Simd128Token, SimdToken}; pub fn process(data: &mut [f32]) { // Try x86 AVX2 if let Some(token) = Desktop64::summon() { return process_x86(token, data); } // Try ARM NEON if let Some(token) = Arm64::summon() { return process_arm(token, data); } // Try WASM SIMD if let Some(token) = Simd128Token::summon() { return process_wasm(token, data); } // Scalar fallback process_scalar(data); } #[arcane] fn process_x86(token: Desktop64, data: &mut [f32]) { /* ... */ } #[arcane] fn process_arm(token: Arm64, data: &mut [f32]) { /* ... */ } #[arcane] fn process_wasm(token: Simd128Token, data: &mut [f32]) { /* ... */ } fn process_scalar(data: &mut [f32]) { /* ... */ } }
On x86-64: Desktop64::summon() may succeed, others return None.
On ARM: Arm64::summon() succeeds, others return None.
On WASM: Simd128Token::summon() may succeed, others return None.
The #[arcane] functions for other architectures compile to unreachable stubs—the code exists but can never be called.
Multi-Tier x86 Dispatch
Check from highest to lowest capability:
#![allow(unused)] fn main() { use archmage::{X64V4Token, Desktop64, X64V2Token, SimdToken}; pub fn process(data: &mut [f32]) { // AVX-512 (requires avx512 feature) #[cfg(feature = "avx512")] if let Some(token) = X64V4Token::summon() { return process_v4(token, data); } // AVX2+FMA (Haswell+, Zen+) if let Some(token) = Desktop64::summon() { return process_v3(token, data); } // SSE4.2 (Nehalem+) if let Some(token) = X64V2Token::summon() { return process_v2(token, data); } process_scalar(data); } }
Note: #[cfg(feature = "avx512")] is a Cargo feature gate (compile-time opt-in), not an architecture check. The actual CPU detection is still runtime via summon().
When to Use Manual Dispatch
Use manual dispatch when:
- You have 2-3 tiers
- You want explicit, readable control flow
- Different tiers have different APIs
Consider incant! when:
- You have many tiers
- All implementations have the same signature
- You want automatic best-available selection
Avoiding Common Mistakes
Don't Dispatch in Hot Loops
#![allow(unused)] fn main() { // WRONG - CPUID every iteration for chunk in data.chunks_mut(8) { if let Some(token) = Desktop64::summon() { process_chunk(token, chunk); } } // BETTER - hoist token outside loop if let Some(token) = Desktop64::summon() { for chunk in data.chunks_mut(8) { process_chunk(token, chunk); // But still has #[arcane] wrapper overhead } } else { for chunk in data.chunks_mut(8) { process_chunk_scalar(chunk); } } // BEST - put the loop inside #[arcane], call #[rite] helpers if let Some(token) = Desktop64::summon() { process_all_chunks(token, data); } else { process_all_chunks_scalar(data); } #[arcane] fn process_all_chunks(token: Desktop64, data: &mut [f32]) { for chunk in data.chunks_exact_mut(8) { process_chunk(token, chunk.try_into().unwrap()); // #[rite] inlines fully! } } #[rite] fn process_chunk(_: Desktop64, chunk: &mut [f32; 8]) { // This inlines into process_all_chunks with zero overhead } }
The "BETTER" pattern still calls through an #[arcane] wrapper each iteration—an LLVM optimization barrier. The "BEST" pattern puts the loop inside #[arcane] and uses #[rite] for the inner work, so LLVM sees one optimization region for the entire loop.
Don't Forget Early Returns
#![allow(unused)] fn main() { // WRONG - falls through to scalar even when SIMD available if let Some(token) = Desktop64::summon() { process_avx2(token, data); // Missing return! } process_scalar(data); // Always runs! // RIGHT if let Some(token) = Desktop64::summon() { return process_avx2(token, data); } process_scalar(data); }
incant! Macro
incant! automates dispatch to suffixed function variants. Write one call, get automatic fallback through capability tiers.
Basic Usage
#![allow(unused)] fn main() { use archmage::{incant, arcane}; use magetypes::simd::f32x8; // Define variants with standard suffixes #[arcane] fn sum_v3(token: X64V3Token, data: &[f32; 8]) -> f32 { f32x8::from_array(token, *data).reduce_add() } #[arcane] fn sum_neon(token: NeonToken, data: &[f32; 4]) -> f32 { // NEON implementation } fn sum_scalar(data: &[f32]) -> f32 { data.iter().sum() } // Dispatch automatically pub fn sum(data: &[f32; 8]) -> f32 { incant!(sum(data)) // Tries: sum_v4 → sum_v3 → sum_neon → sum_wasm128 → sum_scalar } }
How It Works
incant! expands to a dispatch chain:
#![allow(unused)] fn main() { // incant!(process(data)) expands to approximately: { #[cfg(all(target_arch = "x86_64", feature = "avx512"))] if let Some(token) = X64V4Token::summon() { return process_v4(token, data); } #[cfg(target_arch = "x86_64")] if let Some(token) = X64V3Token::summon() { return process_v3(token, data); } #[cfg(target_arch = "aarch64")] if let Some(token) = NeonToken::summon() { return process_neon(token, data); } #[cfg(target_arch = "wasm32")] if let Some(token) = Simd128Token::summon() { return process_wasm128(token, data); } process_scalar(data) } }
Suffix Conventions
| Suffix | Token | Platform |
|---|---|---|
_v4 | X64V4Token | x86-64 AVX-512 |
_v3 | X64V3Token | x86-64 AVX2+FMA |
_v2 | X64V2Token | x86-64 SSE4.2 |
_neon | NeonToken | AArch64 |
_wasm128 | Simd128Token | WASM |
_scalar | — | Fallback |
You don't need all variants—incant! skips missing ones.
Passthrough Mode
When you already have a token and want to dispatch to specialized variants:
#![allow(unused)] fn main() { fn outer<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 { // Passthrough: token already obtained, dispatch to best variant incant!(token => process(data)) } }
This uses IntoConcreteToken to check the token's actual type and dispatch accordingly, without re-summoning.
Example: Complete Implementation
#![allow(unused)] fn main() { use archmage::{arcane, incant, X64V3Token, NeonToken, SimdToken}; use magetypes::simd::f32x8; // AVX2 variant #[cfg(target_arch = "x86_64")] #[arcane] fn dot_product_v3(token: X64V3Token, a: &[f32; 8], b: &[f32; 8]) -> f32 { let va = f32x8::from_array(token, *a); let vb = f32x8::from_array(token, *b); (va * vb).reduce_add() } // NEON variant (128-bit, so process 4 at a time) #[cfg(target_arch = "aarch64")] #[arcane] fn dot_product_neon(token: NeonToken, a: &[f32; 8], b: &[f32; 8]) -> f32 { use magetypes::simd::f32x4; let sum1 = { let va = f32x4::from_slice(token, &a[0..4]); let vb = f32x4::from_slice(token, &b[0..4]); (va * vb).reduce_add() }; let sum2 = { let va = f32x4::from_slice(token, &a[4..8]); let vb = f32x4::from_slice(token, &b[4..8]); (va * vb).reduce_add() }; sum1 + sum2 } // Scalar fallback fn dot_product_scalar(a: &[f32; 8], b: &[f32; 8]) -> f32 { a.iter().zip(b.iter()).map(|(x, y)| x * y).sum() } // Public API pub fn dot_product(a: &[f32; 8], b: &[f32; 8]) -> f32 { incant!(dot_product(a, b)) } }
When to Use incant!
Use incant! when:
- You have multiple platform-specific implementations
- You want automatic fallback through tiers
- Function signatures are similar across variants
Use manual dispatch when:
- You need custom fallback logic
- Variants have different signatures
- You want more explicit control
IntoConcreteToken Trait
IntoConcreteToken enables compile-time dispatch via monomorphization. Each token type returns Some(self) for its own type and None for others.
Basic Usage
#![allow(unused)] fn main() { use archmage::{IntoConcreteToken, SimdToken, X64V3Token, NeonToken, ScalarToken}; fn process<T: IntoConcreteToken>(token: T, data: &mut [f32]) { // Compiler eliminates non-matching branches via monomorphization if let Some(t) = token.as_x64v3() { process_avx2(t, data); } else if let Some(t) = token.as_neon() { process_neon(t, data); } else if let Some(_) = token.as_scalar() { process_scalar(data); } } }
When called with X64V3Token, the compiler sees:
as_x64v3()→Some(token)(takes this branch)as_neon()→None(eliminated)as_scalar()→None(eliminated)
Available Methods
#![allow(unused)] fn main() { pub trait IntoConcreteToken: SimdToken { fn as_x64v2(self) -> Option<X64V2Token> { None } fn as_x64v3(self) -> Option<X64V3Token> { None } fn as_x64v4(self) -> Option<X64V4Token> { None } // requires avx512 fn as_avx512_modern(self) -> Option<Avx512ModernToken> { None } fn as_neon(self) -> Option<NeonToken> { None } fn as_neon_aes(self) -> Option<NeonAesToken> { None } fn as_neon_sha3(self) -> Option<NeonSha3Token> { None } fn as_wasm128(self) -> Option<Simd128Token> { None } fn as_scalar(self) -> Option<ScalarToken> { None } } }
Each concrete token overrides its own method to return Some(self).
Upcasting with IntoConcreteToken
You can check if a token supports higher capabilities:
#![allow(unused)] fn main() { fn maybe_use_avx512<T: IntoConcreteToken>(token: T, data: &mut [f32]) { // Check if we actually have AVX-512 if let Some(v4) = token.as_x64v4() { fast_path_avx512(v4, data); } else if let Some(v3) = token.as_x64v3() { normal_path_avx2(v3, data); } } }
Note: This creates an LLVM optimization boundary. The generic caller and feature-enabled callee have different target settings. Do this dispatch at entry points, not in hot code.
Dispatch Order
Check from highest to lowest capability:
#![allow(unused)] fn main() { fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 { // Highest first #[cfg(feature = "avx512")] if let Some(t) = token.as_x64v4() { return process_v4(t, data); } if let Some(t) = token.as_x64v3() { return process_v3(t, data); } if let Some(t) = token.as_neon() { return process_neon(t, data); } if let Some(t) = token.as_wasm128() { return process_wasm(t, data); } // Scalar fallback process_scalar(data) } }
vs incant!
| Feature | IntoConcreteToken | incant! |
|---|---|---|
| Dispatch style | Explicit if/else | Macro-generated |
| Token passing | Token already obtained | Summons tokens |
| Flexibility | Full control | Convention-based |
| Verbosity | More code | Less code |
Use IntoConcreteToken when you already have a token and need to specialize. Use incant! for entry-point dispatch.
Example: Library with Generic Token API
#![allow(unused)] fn main() { use archmage::{IntoConcreteToken, SimdToken, arcane}; /// Public API accepts any token pub fn transform<T: IntoConcreteToken>(token: T, data: &mut [f32]) { if let Some(t) = token.as_x64v3() { transform_avx2(t, data); } else if let Some(t) = token.as_neon() { transform_neon(t, data); } else { transform_scalar(data); } } #[cfg(target_arch = "x86_64")] #[arcane] fn transform_avx2(token: X64V3Token, data: &mut [f32]) { // AVX2 implementation } #[cfg(target_arch = "aarch64")] #[arcane] fn transform_neon(token: NeonToken, data: &mut [f32]) { // NEON implementation } fn transform_scalar(data: &mut [f32]) { // Scalar fallback } }
Callers can pass any token:
#![allow(unused)] fn main() { if let Some(token) = Desktop64::summon() { transform(token, &mut data); // Uses AVX2 path } // Or force scalar for testing transform(ScalarToken, &mut data); }
Tiered Fallback
Real applications need graceful degradation across capability tiers. Here's how to structure robust fallback chains.
The Tier Hierarchy
x86-64:
X64V4Token (AVX-512)
↓
X64V3Token (AVX2+FMA) ← Most common target
↓
X64V2Token (SSE4.2)
↓
ScalarToken
AArch64:
NeonSha3Token
↓
NeonAesToken
↓
NeonToken ← Baseline (always available)
↓
ScalarToken
WASM:
Simd128Token
↓
ScalarToken
Pattern: Capability Waterfall
#![allow(unused)] fn main() { use archmage::*; pub fn process(data: &mut [f32]) -> f32 { // x86-64 path #[cfg(target_arch = "x86_64")] { #[cfg(feature = "avx512")] if let Some(token) = X64V4Token::summon() { return process_v4(token, data); } if let Some(token) = X64V3Token::summon() { return process_v3(token, data); } if let Some(token) = X64V2Token::summon() { return process_v2(token, data); } } // AArch64 path #[cfg(target_arch = "aarch64")] if let Some(token) = NeonToken::summon() { return process_neon(token, data); } // WASM path #[cfg(target_arch = "wasm32")] if let Some(token) = Simd128Token::summon() { return process_wasm(token, data); } // Universal fallback process_scalar(data) } }
Pattern: Width-Based Tiers
When your algorithm naturally works at different widths:
#![allow(unused)] fn main() { use magetypes::*; pub fn sum_f32(data: &[f32]) -> f32 { #[cfg(target_arch = "x86_64")] { // Try 256-bit first if let Some(token) = X64V3Token::summon() { return sum_f32x8(token, data); } // Fall back to 128-bit if let Some(token) = X64V2Token::summon() { return sum_f32x4(token, data); } } #[cfg(target_arch = "aarch64")] if let Some(token) = NeonToken::summon() { return sum_f32x4_neon(token, data); } sum_scalar(data) } #[cfg(target_arch = "x86_64")] #[arcane] fn sum_f32x8(token: X64V3Token, data: &[f32]) -> f32 { let mut acc = f32x8::zero(token); for chunk in data.chunks_exact(8) { let v = f32x8::from_slice(token, chunk); acc = acc + v; } // Handle remainder let mut sum = acc.reduce_add(); for &x in data.chunks_exact(8).remainder() { sum += x; } sum } }
Pattern: Feature Detection Cache
For hot paths where you dispatch many times:
#![allow(unused)] fn main() { use std::sync::OnceLock; #[derive(Clone, Copy)] enum SimdLevel { Avx512, Avx2, Sse42, Scalar, } static SIMD_LEVEL: OnceLock<SimdLevel> = OnceLock::new(); fn detect_level() -> SimdLevel { *SIMD_LEVEL.get_or_init(|| { #[cfg(all(target_arch = "x86_64", feature = "avx512"))] if X64V4Token::summon().is_some() { return SimdLevel::Avx512; } #[cfg(target_arch = "x86_64")] if X64V3Token::summon().is_some() { return SimdLevel::Avx2; } #[cfg(target_arch = "x86_64")] if X64V2Token::summon().is_some() { return SimdLevel::Sse42; } SimdLevel::Scalar }) } pub fn process(data: &mut [f32]) { match detect_level() { SimdLevel::Avx512 => { #[cfg(all(target_arch = "x86_64", feature = "avx512"))] process_v4(X64V4Token::summon().unwrap(), data); } SimdLevel::Avx2 => { #[cfg(target_arch = "x86_64")] process_v3(X64V3Token::summon().unwrap(), data); } // ... etc } } }
Anti-Pattern: Over-Engineering
Don't create tiers you don't need:
#![allow(unused)] fn main() { // WRONG: Too many tiers, most are never used pub fn simple_add(a: f32, b: f32) -> f32 { if let Some(t) = X64V4Token::summon() { ... } else if let Some(t) = X64V3Token::summon() { ... } else if let Some(t) = X64V2Token::summon() { ... } else if let Some(t) = NeonToken::summon() { ... } else { a + b } } // RIGHT: Just do the simple thing pub fn simple_add(a: f32, b: f32) -> f32 { a + b } }
SIMD only helps with bulk operations. For scalar math, just use scalar math.
Recommendations
- Desktop64 is usually enough for x86 — it covers 99% of modern PCs
- NeonToken is baseline on AArch64 — no fallback needed
- Test your scalar path — it's your safety net
- Profile before adding tiers — each tier is code to maintain
magetypes Type Overview
magetypes provides SIMD vector types with natural Rust operators. Each type wraps platform intrinsics and requires an archmage token for construction.
Available Types
x86-64 Types
| Type | Elements | Width | Min Token |
|---|---|---|---|
f32x4 | 4 × f32 | 128-bit | X64V2Token |
f32x8 | 8 × f32 | 256-bit | X64V3Token |
f32x16 | 16 × f32 | 512-bit | X64V4Token* |
f64x2 | 2 × f64 | 128-bit | X64V2Token |
f64x4 | 4 × f64 | 256-bit | X64V3Token |
f64x8 | 8 × f64 | 512-bit | X64V4Token* |
i32x4 | 4 × i32 | 128-bit | X64V2Token |
i32x8 | 8 × i32 | 256-bit | X64V3Token |
i32x16 | 16 × i32 | 512-bit | X64V4Token* |
i8x16 | 16 × i8 | 128-bit | X64V2Token |
i8x32 | 32 × i8 | 256-bit | X64V3Token |
| ... | ... | ... | ... |
*Requires avx512 feature
AArch64 Types (NEON)
| Type | Elements | Width | Token |
|---|---|---|---|
f32x4 | 4 × f32 | 128-bit | NeonToken |
f64x2 | 2 × f64 | 128-bit | NeonToken |
i32x4 | 4 × i32 | 128-bit | NeonToken |
i16x8 | 8 × i16 | 128-bit | NeonToken |
i8x16 | 16 × i8 | 128-bit | NeonToken |
u32x4 | 4 × u32 | 128-bit | NeonToken |
| ... | ... | ... | ... |
WASM Types (SIMD128)
| Type | Elements | Width | Token |
|---|---|---|---|
f32x4 | 4 × f32 | 128-bit | Simd128Token |
f64x2 | 2 × f64 | 128-bit | Simd128Token |
i32x4 | 4 × i32 | 128-bit | Simd128Token |
| ... | ... | ... | ... |
Basic Usage
#![allow(unused)] fn main() { use archmage::{Desktop64, SimdToken}; use magetypes::simd::f32x8; fn example() { if let Some(token) = Desktop64::summon() { // Construct from array let a = f32x8::from_array(token, [1.0; 8]); // Splat a single value let b = f32x8::splat(token, 2.0); // Natural operators let c = a + b; let d = c * c; // Extract result let result: [f32; 8] = d.to_array(); } } }
Type Properties
All magetypes SIMD types are:
- Copy — Pass by value freely
- Clone — Explicit cloning works
- Debug — Print for debugging
- Send + Sync — Thread-safe
- Token-gated construction — Cannot create without proving CPU support
#![allow(unused)] fn main() { // Zero-cost copies let a = f32x8::splat(token, 1.0); let b = a; // Copy, not move let c = a + b; // Both still valid }
Why no Pod/Zeroable? Implementing bytemuck traits would let users bypass token-gated construction (e.g., bytemuck::zeroed::<f32x8>()), creating vectors without proving CPU support. Use the token-gated cast_slice and from_bytes methods instead.
Using the Prelude
For convenience, import everything:
#![allow(unused)] fn main() { use magetypes::prelude::*; // Now you have all types and archmage re-exports if let Some(token) = Desktop64::summon() { let v = f32x8::splat(token, 1.0); } }
Platform-Specific Imports
If you need just one platform:
#![allow(unused)] fn main() { // x86-64 only #[cfg(target_arch = "x86_64")] use magetypes::simd::x86::*; // AArch64 only #[cfg(target_arch = "aarch64")] use magetypes::simd::arm::*; // WASM only #[cfg(target_arch = "wasm32")] use magetypes::simd::wasm::*; }
Feature Flags
| Feature | Effect |
|---|---|
avx512 | Enable 512-bit types on x86-64 |
std | Standard library support (default) |
Construction & Operators
magetypes provides natural Rust syntax for SIMD operations.
Construction
From Array
#![allow(unused)] fn main() { let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let v = f32x8::from_array(token, data); }
From Slice
#![allow(unused)] fn main() { let slice = &[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let v = f32x8::from_slice(token, slice); }
Splat (Broadcast)
#![allow(unused)] fn main() { let v = f32x8::splat(token, 3.14159); // All lanes = π }
Zero
#![allow(unused)] fn main() { let v = f32x8::zero(token); // All lanes = 0 }
Load from Memory
#![allow(unused)] fn main() { // Unaligned load let v = f32x8::load(token, ptr); // From reference let v = f32x8::from_array(token, *array_ref); }
Extraction
To Array
#![allow(unused)] fn main() { let arr: [f32; 8] = v.to_array(); }
Store to Memory
#![allow(unused)] fn main() { v.store(ptr); // Unaligned store v.store_aligned(ptr); // Aligned store (UB if misaligned) }
Extract Single Lane
#![allow(unused)] fn main() { let first = v.extract::<0>(); let third = v.extract::<2>(); }
Arithmetic Operators
All standard operators work:
#![allow(unused)] fn main() { let a = f32x8::splat(token, 2.0); let b = f32x8::splat(token, 3.0); let sum = a + b; // [5.0; 8] let diff = a - b; // [-1.0; 8] let prod = a * b; // [6.0; 8] let quot = a / b; // [0.666...; 8] let neg = -a; // [-2.0; 8] }
Compound Assignment
#![allow(unused)] fn main() { let mut v = f32x8::splat(token, 1.0); v += f32x8::splat(token, 2.0); // v = [3.0; 8] v *= f32x8::splat(token, 2.0); // v = [6.0; 8] }
Fused Multiply-Add
FMA is faster and more precise than separate multiply and add:
#![allow(unused)] fn main() { // a * b + c (single instruction on AVX2/NEON) let result = a.mul_add(b, c); // a * b - c let result = a.mul_sub(b, c); // -(a * b) + c (negated multiply-add) let result = a.neg_mul_add(b, c); }
Comparisons
Comparisons return mask types:
#![allow(unused)] fn main() { let a = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]); let b = f32x8::splat(token, 4.0); let lt = a.simd_lt(b); // [true, true, true, false, false, false, false, false] let eq = a.simd_eq(b); // [false, false, false, true, false, false, false, false] let ge = a.simd_ge(b); // [false, false, false, true, true, true, true, true] }
Blend with Mask
#![allow(unused)] fn main() { let mask = a.simd_lt(b); let result = mask.blend(true_values, false_values); }
Min/Max
#![allow(unused)] fn main() { let min = a.min(b); // Element-wise minimum let max = a.max(b); // Element-wise maximum // With scalar let clamped = v.max(f32x8::splat(token, 0.0)) .min(f32x8::splat(token, 1.0)); }
Absolute Value
#![allow(unused)] fn main() { let abs = v.abs(); // |v| for each lane }
Reductions
Horizontal operations across lanes:
#![allow(unused)] fn main() { let sum = v.reduce_add(); // Sum of all lanes let max = v.reduce_max(); // Maximum lane let min = v.reduce_min(); // Minimum lane }
Integer Operations
For integer types (i32x8, u8x16, etc.):
#![allow(unused)] fn main() { let a = i32x8::splat(token, 10); let b = i32x8::splat(token, 3); // Arithmetic let sum = a + b; let diff = a - b; let prod = a * b; // Bitwise let and = a & b; let or = a | b; let xor = a ^ b; let not = !a; // Shifts let shl = a << 2; // Shift left by constant let shr = a >> 1; // Shift right by constant let shr_arith = a.shr_arithmetic(1); // Sign-extending shift }
Example: Dot Product
#![allow(unused)] fn main() { use archmage::{Desktop64, SimdToken, arcane}; use magetypes::simd::f32x8; #[arcane] fn dot_product(token: Desktop64, a: &[f32; 8], b: &[f32; 8]) -> f32 { let va = f32x8::from_array(token, *a); let vb = f32x8::from_array(token, *b); (va * vb).reduce_add() } }
Example: Vector Normalization
#![allow(unused)] fn main() { #[arcane] fn normalize(token: Desktop64, v: &mut [f32; 8]) { let vec = f32x8::from_array(token, *v); let len_sq = (vec * vec).reduce_add(); let len = len_sq.sqrt(); if len > 0.0 { let inv_len = f32x8::splat(token, 1.0 / len); let normalized = vec * inv_len; *v = normalized.to_array(); } } }
Type Conversions
magetypes provides conversions between SIMD types and between SIMD and scalar types.
Float ↔ Integer Conversions
Float to Integer
#![allow(unused)] fn main() { let floats = f32x8::from_array(token, [1.5, 2.7, -3.2, 4.0, 5.9, 6.1, 7.0, 8.5]); // Truncate toward zero (like `as i32`) let ints = floats.to_i32x8(); // [1, 2, -3, 4, 5, 6, 7, 8] // Round to nearest let rounded = floats.to_i32x8_round(); // [2, 3, -3, 4, 6, 6, 7, 8] }
Integer to Float
#![allow(unused)] fn main() { let ints = i32x8::from_array(token, [1, 2, 3, 4, 5, 6, 7, 8]); let floats = ints.to_f32x8(); // [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0] }
Width Conversions
Narrowing (Wider → Narrower)
#![allow(unused)] fn main() { // f64x4 → f32x4 (lose precision) let doubles = f64x4::from_array(token, [1.0, 2.0, 3.0, 4.0]); let floats = doubles.to_f32x4(); // i32x8 → i16x8 (pack with saturation) let wide = i32x8::from_array(token, [1, 2, 3, 4, 5, 6, 7, 8]); let narrow = wide.pack_i16(); }
Widening (Narrower → Wider)
#![allow(unused)] fn main() { // f32x4 → f64x4 (extend precision, lower half) let floats = f32x4::from_array(token, [1.0, 2.0, 3.0, 4.0]); let doubles = floats.to_f64x4_low(); // Converts first 2 elements // i16x8 → i32x8 (sign-extend lower half) let narrow = i16x8::from_array(token, [1, 2, 3, 4, 5, 6, 7, 8]); let wide = narrow.extend_i32_low(); // [1, 2, 3, 4] }
Bitcast (Reinterpret)
Reinterpret bits as a different type (same size):
#![allow(unused)] fn main() { // f32x8 → i32x8 (view float bits as integers) let floats = f32x8::splat(token, 1.0); let bits = floats.bitcast_i32x8(); // i32x8 → f32x8 let ints = i32x8::splat(token, 0x3f800000); // IEEE 754 for 1.0 let floats = ints.bitcast_f32x8(); }
Warning: Bitcast doesn't convert values—it reinterprets the raw bits.
Signed ↔ Unsigned
#![allow(unused)] fn main() { // i32x8 → u32x8 (reinterpret, no conversion) let signed = i32x8::from_array(token, [-1, 0, 1, 2, 3, 4, 5, 6]); let unsigned = signed.to_u32x8(); // [0xFFFFFFFF, 0, 1, 2, 3, 4, 5, 6] // u32x8 → i32x8 let unsigned = u32x8::splat(token, 0xFFFFFFFF); let signed = unsigned.to_i32x8(); // [-1; 8] }
Lane Extraction and Insertion
#![allow(unused)] fn main() { // Extract single lane let v = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]); let third = v.extract::<2>(); // 3.0 // Insert single lane let v = v.insert::<2>(99.0); // [1.0, 2.0, 99.0, 4.0, 5.0, 6.0, 7.0, 8.0] }
Half-Width Operations
Split or combine vectors:
#![allow(unused)] fn main() { // Split f32x8 into two f32x4 let full = f32x8::from_array(token, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]); let (low, high) = full.split(); // low = [1.0, 2.0, 3.0, 4.0] // high = [5.0, 6.0, 7.0, 8.0] // Combine two f32x4 into f32x8 let combined = f32x8::from_halves(token, low, high); }
Slice Casting (Token-Gated)
magetypes provides safe, token-gated slice casting as an alternative to bytemuck:
#![allow(unused)] fn main() { // Cast aligned &[f32] to &[f32x8] let data: &[f32] = &[1.0; 64]; if let Some(chunks) = f32x8::cast_slice(token, data) { // chunks: &[f32x8] with 8 elements for chunk in chunks { // ... } } // Mutable version let data: &mut [f32] = &mut [0.0; 64]; if let Some(chunks) = f32x8::cast_slice_mut(token, data) { // chunks: &mut [f32x8] } }
Why not bytemuck? Implementing Pod/Zeroable would let users bypass token-gated construction:
#![allow(unused)] fn main() { // bytemuck would allow this (BAD): let v: f32x8 = bytemuck::Zeroable::zeroed(); // No token check! // magetypes requires token (GOOD): let v = f32x8::zero(token); // Token proves CPU support }
The token-gated cast_slice returns None if alignment or length is wrong—no UB possible.
Byte-Level Access
View vectors as bytes (no token needed—you already have the vector):
#![allow(unused)] fn main() { let v = f32x8::splat(token, 1.0); // View as bytes (zero-cost) let bytes: &[u8; 32] = v.as_bytes(); // Mutable view let mut v = f32x8::splat(token, 0.0); let bytes: &mut [u8; 32] = v.as_bytes_mut(); // Create from bytes (token-gated) let bytes = [0u8; 32]; let v = f32x8::from_bytes(token, &bytes); }
Conversion Example: Image Processing
#![allow(unused)] fn main() { #[arcane] fn brighten(token: Desktop64, pixels: &mut [u8]) { // Process 32 bytes at a time for chunk in pixels.chunks_exact_mut(32) { let v = u8x32::from_slice(token, chunk); // Convert to wider type for arithmetic let (lo, hi) = v.split(); let lo_wide = lo.extend_u16_low(); let hi_wide = hi.extend_u16_low(); // Add brightness (with saturation) let brightness = u16x16::splat(token, 20); let lo_bright = lo_wide.saturating_add(brightness); let hi_bright = hi_wide.saturating_add(brightness); // Pack back to u8 with saturation let result = u8x32::from_halves( token, lo_bright.pack_u8_saturate(), hi_bright.pack_u8_saturate() ); result.store_slice(chunk); } } }
Transcendental Functions
magetypes provides SIMD implementations of common mathematical functions. These are polynomial approximations optimized for speed.
Precision Levels
Functions come in multiple precision variants:
| Suffix | Precision | Speed | Use Case |
|---|---|---|---|
_lowp | ~12 bits | Fastest | Graphics, audio |
_midp | ~20 bits | Balanced | General use |
| (none) | Full | Slowest | Scientific |
#![allow(unused)] fn main() { let v = f32x8::splat(token, 2.0); let fast = v.exp2_lowp(); // ~12-bit precision, fastest let balanced = v.exp2_midp(); // ~20-bit precision let precise = v.exp2(); // Full precision }
Exponential Functions
#![allow(unused)] fn main() { // Base-2 exponential: 2^x let v = f32x8::splat(token, 3.0); let result = v.exp2(); // [8.0; 8] // Natural exponential: e^x let result = v.exp(); // [e³; 8] ≈ [20.09; 8] }
Logarithms
#![allow(unused)] fn main() { let v = f32x8::splat(token, 8.0); // Base-2 logarithm: log₂(x) let result = v.log2(); // [3.0; 8] // Natural logarithm: ln(x) let result = v.ln(); // [ln(8); 8] ≈ [2.08; 8] // Base-10 logarithm: log₁₀(x) let result = v.log10(); // [log₁₀(8); 8] ≈ [0.90; 8] }
Power Functions
#![allow(unused)] fn main() { let base = f32x8::splat(token, 2.0); let exp = f32x8::splat(token, 3.0); // x^y (uses exp2(y * log2(x))) let result = base.pow(exp); // [8.0; 8] }
Root Functions
#![allow(unused)] fn main() { let v = f32x8::splat(token, 9.0); // Square root let result = v.sqrt(); // [3.0; 8] // Cube root let result = v.cbrt(); // [∛9; 8] ≈ [2.08; 8] // Reciprocal square root: 1/√x let result = v.rsqrt(); // [1/3; 8] ≈ [0.33; 8] }
Approximations
Fast approximations for graphics/games:
#![allow(unused)] fn main() { // Reciprocal: 1/x (approximate) let v = f32x8::splat(token, 4.0); let result = v.rcp(); // ≈ [0.25; 8] // Reciprocal square root (approximate) let result = v.rsqrt(); // ≈ [0.5; 8] }
For higher precision, use Newton-Raphson refinement:
#![allow(unused)] fn main() { // One Newton-Raphson iteration for rsqrt let approx = v.rsqrt(); let refined = approx * (f32x8::splat(token, 1.5) - v * approx * approx * f32x8::splat(token, 0.5)); }
Special Handling
Domain Errors
#![allow(unused)] fn main() { let v = f32x8::from_array(token, [-1.0, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]); // sqrt of negative → NaN let sqrt = v.sqrt(); // [NaN, 0.0, 1.0, 1.41, 1.73, 2.0, 2.24, 2.45] // log of non-positive → NaN or -inf let log = v.ln(); // [NaN, -inf, 0.0, 0.69, 1.10, 1.39, 1.61, 1.79] }
Unchecked Variants
Some functions have _unchecked variants that skip domain validation:
#![allow(unused)] fn main() { // Assumes all inputs are valid (positive for sqrt/log) let result = v.sqrt_unchecked(); // Faster, UB if negative let result = v.ln_unchecked(); // Faster, UB if ≤ 0 }
Example: Gaussian Function
#![allow(unused)] fn main() { #[arcane] fn gaussian(token: Desktop64, x: &[f32; 8], sigma: f32) -> [f32; 8] { let v = f32x8::from_array(token, *x); let sigma_v = f32x8::splat(token, sigma); let two = f32x8::splat(token, 2.0); // exp(-x² / (2σ²)) let x_sq = v * v; let two_sigma_sq = two * sigma_v * sigma_v; let exponent = -(x_sq / two_sigma_sq); let result = exponent.exp_midp(); // Good precision, fast result.to_array() } }
Example: Softmax
#![allow(unused)] fn main() { #[arcane] fn softmax(token: Desktop64, logits: &[f32; 8]) -> [f32; 8] { let v = f32x8::from_array(token, *logits); // Subtract max for numerical stability let max = v.reduce_max(); let shifted = v - f32x8::splat(token, max); // exp(x - max) let exp = shifted.exp_midp(); // Normalize let sum = exp.reduce_add(); let result = exp / f32x8::splat(token, sum); result.to_array() } }
Platform Notes
- x86-64: All functions available for f32x4, f32x8, f64x2, f64x4
- AArch64: Full support via NEON
- WASM: Most functions available, some via scalar fallback
The implementation uses polynomial approximations tuned per platform for best performance.
Memory Operations
Efficiently moving data between memory and SIMD registers is critical for performance.
Load Operations
Unaligned Load
#![allow(unused)] fn main() { use magetypes::simd::f32x8; // From array reference let arr = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let v = f32x8::from_array(token, arr); // From slice (must have enough elements) let slice = &[1.0f32; 16]; let v = f32x8::from_slice(token, &slice[0..8]); }
Aligned Load
If you know your data is aligned:
#![allow(unused)] fn main() { // Aligned load (UB if not aligned to 32 bytes for f32x8) let v = unsafe { f32x8::load_aligned(ptr) }; }
Partial Load
Load fewer elements than the vector width:
#![allow(unused)] fn main() { // Load 4 elements into lower half, zero upper half let v = f32x8::load_low(token, &[1.0, 2.0, 3.0, 4.0]); // v = [1.0, 2.0, 3.0, 4.0, 0.0, 0.0, 0.0, 0.0] }
Store Operations
Unaligned Store
#![allow(unused)] fn main() { let v = f32x8::splat(token, 42.0); // To array let arr: [f32; 8] = v.to_array(); // To slice let mut buf = [0.0f32; 8]; v.store_slice(&mut buf); }
Aligned Store
#![allow(unused)] fn main() { // Aligned store (UB if not aligned) unsafe { v.store_aligned(ptr) }; }
Partial Store
Store only some elements:
#![allow(unused)] fn main() { // Store lower 4 elements v.store_low(&mut buf[0..4]); }
Streaming Stores
For large data where you won't read back soon:
#![allow(unused)] fn main() { // Non-temporal store (bypasses cache) unsafe { v.stream(ptr) }; }
Use streaming stores when:
- Writing large arrays sequentially
- Data won't be read again soon
- Avoiding cache pollution is important
Gather and Scatter
Load/store non-contiguous elements:
#![allow(unused)] fn main() { // Gather: load from scattered indices let data = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]; let indices = i32x8::from_array(token, [0, 2, 4, 6, 8, 1, 3, 5]); let gathered = f32x8::gather(&data, indices); // gathered = [0.0, 20.0, 40.0, 60.0, 80.0, 10.0, 30.0, 50.0] // Scatter: store to scattered indices let mut output = [0.0f32; 10]; let values = f32x8::splat(token, 1.0); values.scatter(&mut output, indices); }
Note: Gather/scatter may be slow on some CPUs. Profile before using.
Prefetch
Hint the CPU to load data into cache:
#![allow(unused)] fn main() { use std::arch::x86_64::*; // Prefetch for read unsafe { _mm_prefetch(ptr as *const i8, _MM_HINT_T0) }; // Prefetch levels: // _MM_HINT_T0 - All cache levels // _MM_HINT_T1 - L2 and above // _MM_HINT_T2 - L3 and above // _MM_HINT_NTA - Non-temporal (don't pollute cache) }
Interleaved Data
For RGBARGBA... or similar interleaved formats:
#![allow(unused)] fn main() { // Deinterleave 4 channels (RGBA) let (r, g, b, a) = f32x8::deinterleave_4ch( token, &rgba_data[0..8], &rgba_data[8..16], &rgba_data[16..24], &rgba_data[24..32] ); // Process channels separately let r_bright = r + f32x8::splat(token, 0.1); // Reinterleave let (out0, out1, out2, out3) = f32x8::interleave_4ch(token, r_bright, g, b, a); }
Chunked Processing
Process large arrays in SIMD-sized chunks:
#![allow(unused)] fn main() { #[arcane] fn process_large(token: Desktop64, data: &mut [f32]) { // Process full chunks for chunk in data.chunks_exact_mut(8) { let v = f32x8::from_slice(token, chunk); let result = v * v; // Process result.store_slice(chunk); } // Handle remainder for x in data.chunks_exact_mut(8).into_remainder() { *x = *x * *x; } } }
Alignment Tips
-
Use
#[repr(align(32))]for AVX2 data:#![allow(unused)] fn main() { #[repr(C, align(32))] struct AlignedData { values: [f32; 8], } } -
Allocate aligned memory:
#![allow(unused)] fn main() { use std::alloc::{alloc, Layout}; let layout = Layout::from_size_align(size, 32).unwrap(); let ptr = unsafe { alloc(layout) }; } -
Check alignment at runtime:
#![allow(unused)] fn main() { fn is_aligned<T>(ptr: *const T, align: usize) -> bool { (ptr as usize) % align == 0 } }
Performance Tips
- Minimize loads/stores — Keep data in registers
- Prefer unaligned — Modern CPUs handle it well
- Use streaming for large writes — Saves cache space
- Batch operations — Load once, do multiple ops, store once
- Avoid gather/scatter — Sequential access is faster
Methods with #[arcane]
Using #[arcane] on methods requires special handling because of how the macro transforms the function body.
The Problem
#![allow(unused)] fn main() { impl MyType { // This won't work as expected! #[arcane] fn process(&self, token: Desktop64) -> f32 { self.data[0] // Error: `self` not available in inner function } } }
The macro generates an inner function where self becomes a regular parameter.
The Solution: _self = Type
Use the _self argument and reference _self in your code:
#![allow(unused)] fn main() { use archmage::{Desktop64, arcane}; use magetypes::simd::f32x8; struct Vector8([f32; 8]); impl Vector8 { #[arcane(_self = Vector8)] fn magnitude(&self, token: Desktop64) -> f32 { // Use _self, not self let v = f32x8::from_array(token, _self.0); (v * v).reduce_add().sqrt() } } }
All Receiver Types
&self (Shared Reference)
#![allow(unused)] fn main() { impl Vector8 { #[arcane(_self = Vector8)] fn dot(&self, token: Desktop64, other: &Self) -> f32 { let a = f32x8::from_array(token, _self.0); let b = f32x8::from_array(token, other.0); (a * b).reduce_add() } } }
&mut self (Mutable Reference)
#![allow(unused)] fn main() { impl Vector8 { #[arcane(_self = Vector8)] fn normalize(&mut self, token: Desktop64) { let v = f32x8::from_array(token, _self.0); let len = (v * v).reduce_add().sqrt(); if len > 0.0 { let inv = f32x8::splat(token, 1.0 / len); let normalized = v * inv; _self.0 = normalized.to_array(); } } } }
self (By Value)
#![allow(unused)] fn main() { impl Vector8 { #[arcane(_self = Vector8)] fn scaled(self, token: Desktop64, factor: f32) -> Self { let v = f32x8::from_array(token, _self.0); let s = f32x8::splat(token, factor); Vector8((v * s).to_array()) } } }
Trait Implementations
Works with traits too:
#![allow(unused)] fn main() { trait SimdOps { fn double(&self, token: Desktop64) -> Self; } impl SimdOps for Vector8 { #[arcane(_self = Vector8)] fn double(&self, token: Desktop64) -> Self { let v = f32x8::from_array(token, _self.0); Vector8((v + v).to_array()) } } }
Why _self?
The name _self reminds you that:
- You're not using the normal
selfkeyword - The macro has transformed the function
- You need to be explicit about the type
It's a deliberate choice to make the transformation visible.
Generated Code
#![allow(unused)] fn main() { // You write: impl Vector8 { #[arcane(_self = Vector8)] fn process(&self, token: Desktop64) -> f32 { f32x8::from_array(token, _self.0).reduce_add() } } // Macro generates: impl Vector8 { fn process(&self, token: Desktop64) -> f32 { #[target_feature(enable = "avx2,fma,bmi1,bmi2")] #[inline] unsafe fn __inner(_self: &Vector8, token: Desktop64) -> f32 { f32x8::from_array(token, _self.0).reduce_add() } unsafe { __inner(self, token) } } } }
Common Patterns
Builder Pattern
#![allow(unused)] fn main() { impl ImageProcessor { #[arcane(_self = ImageProcessor)] fn with_brightness(self, token: Desktop64, amount: f32) -> Self { let mut result = _self; // Process brightness... result } #[arcane(_self = ImageProcessor)] fn with_contrast(self, token: Desktop64, amount: f32) -> Self { let mut result = _self; // Process contrast... result } } // Usage let processed = processor .with_brightness(token, 1.2) .with_contrast(token, 1.1); }
Mutable Iteration
#![allow(unused)] fn main() { impl Buffer { #[arcane(_self = Buffer)] fn process_all(&mut self, token: Desktop64) { for chunk in _self.data.chunks_exact_mut(8) { let v = f32x8::from_slice(token, chunk); let result = v * v; result.store_slice(chunk); } } } }
LLVM Optimization Boundaries
Understanding when LLVM can and cannot optimize across function calls is crucial for peak SIMD performance.
The Problem
#[target_feature] changes LLVM's target settings for a function. When caller and callee have different settings, LLVM cannot optimize across the boundary.
#![allow(unused)] fn main() { // Generic caller - baseline target settings fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) { if let Some(t) = token.as_x64v3() { process_avx2(t, data); // Different LLVM target! } } // AVX2 callee - AVX2 target settings #[arcane] fn process_avx2(token: X64V3Token, data: &[f32]) { // Can't inline back into dispatch() } }
Why It Matters
SIMD performance depends heavily on:
- Inlining — Avoids function call overhead
- Register allocation — Keeps values in SIMD registers
- Instruction scheduling — Reorders for pipeline efficiency
All of these break at target feature boundaries.
Good: Same Token Type Chain
#![allow(unused)] fn main() { #[arcane] fn outer(token: X64V3Token, data: &[f32]) -> f32 { let a = step1(token, data); // Same token → inlines let b = step2(token, data); // Same token → inlines a + b } #[arcane] fn step1(token: X64V3Token, data: &[f32]) -> f32 { // Shares LLVM target settings with outer // Can inline, share registers, optimize across boundary } #[arcane] fn step2(token: X64V3Token, data: &[f32]) -> f32 { // Same deal } }
Good: Downcast (Higher → Lower)
#![allow(unused)] fn main() { #[arcane] fn v4_main(token: X64V4Token, data: &[f32]) -> f32 { // Calling V3 function with V4 token // V4 is superset of V3, LLVM can still optimize v3_helper(token, data) } #[arcane] fn v3_helper(token: X64V3Token, data: &[f32]) -> f32 { // This inlines properly } }
Bad: Generic Boundary
#![allow(unused)] fn main() { // Generic function - no target features fn generic<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 { // This is compiled with baseline settings if let Some(t) = token.as_x64v3() { concrete(t, data) // BOUNDARY - can't inline back } else { 0.0 } } #[arcane] fn concrete(token: X64V3Token, data: &[f32]) -> f32 { // This has AVX2 settings // LLVM won't inline this into generic() } }
Bad: Upcast Check in Hot Code
#![allow(unused)] fn main() { #[arcane] fn process(token: X64V3Token, data: &[f32]) -> f32 { // WRONG: Checking for higher tier inside hot function for chunk in data.chunks(8) { if let Some(v4) = token.as_x64v4() { // Wait, this always fails! // V3 token can't become V4 } } } }
Even when the check makes sense, it's an optimization barrier.
Pattern: Dispatch at Entry
#![allow(unused)] fn main() { // Public API - dispatch happens here pub fn process(data: &[f32]) -> f32 { // Summon and dispatch ONCE #[cfg(feature = "avx512")] if let Some(token) = X64V4Token::summon() { return process_v4(token, data); } if let Some(token) = X64V3Token::summon() { return process_v3(token, data); } process_scalar(data) } // Each implementation is self-contained #[arcane] fn process_v4(token: X64V4Token, data: &[f32]) -> f32 { // All V4 code, fully optimizable let result = step1_v4(token, data); step2_v4(token, result) } #[arcane] fn step1_v4(token: X64V4Token, data: &[f32]) -> f32 { /* ... */ } #[arcane] fn step2_v4(token: X64V4Token, result: f32) -> f32 { /* ... */ } }
Pattern: Trait with Concrete Impls
#![allow(unused)] fn main() { trait Processor { fn process(&self, data: &[f32]) -> f32; } struct V3Processor(X64V3Token); impl Processor for V3Processor { fn process(&self, data: &[f32]) -> f32 { // Note: this can't use #[arcane] on trait method // Call through to arcane function instead process_v3_impl(self.0, data) } } #[arcane] fn process_v3_impl(token: X64V3Token, data: &[f32]) -> f32 { // Full optimization here } }
Measuring the Impact
#![allow(unused)] fn main() { // Benchmark both patterns fn bench_generic_dispatch(c: &mut Criterion) { c.bench_function("generic", |b| { let token = Desktop64::summon().unwrap(); b.iter(|| generic_dispatch(token, &data)) }); } fn bench_concrete_dispatch(c: &mut Criterion) { c.bench_function("concrete", |b| { let token = Desktop64::summon().unwrap(); b.iter(|| concrete_path(token, &data)) }); } }
Typical impact: 10-30% performance difference for small functions.
Summary
| Pattern | Inlining | Recommendation |
|---|---|---|
| Same concrete token | ✅ Full | Best for hot paths |
| Downcast (V4→V3) | ✅ Full | Safe and fast |
| Generic → concrete | ❌ Boundary | Entry point only |
| Upcast check | ❌ Boundary | Avoid in hot code |
AVX-512 Patterns
AVX-512 provides 512-bit vectors and advanced features. Here's how to use it effectively with archmage.
Enabling AVX-512
Add the feature to your Cargo.toml:
[dependencies]
archmage = { version = "0.4", features = ["avx512"] }
magetypes = { version = "0.4", features = ["avx512"] }
AVX-512 Tokens
| Token | Features | CPUs |
|---|---|---|
X64V4Token | F, BW, CD, DQ, VL | Skylake-X, Zen 4 |
Avx512ModernToken | + VNNI, VBMI, IFMA, etc. | Ice Lake+, Zen 4+ |
Avx512Fp16Token | + FP16 | Sapphire Rapids |
Aliases:
Server64=X64V4TokenAvx512Token=X64V4Token
Basic Usage
use archmage::{X64V4Token, SimdToken, arcane}; use magetypes::simd::f32x16; #[arcane] fn process_512(token: X64V4Token, data: &[f32; 16]) -> f32 { let v = f32x16::from_array(token, *data); (v * v).reduce_add() } fn main() { if let Some(token) = X64V4Token::summon() { let data = [1.0f32; 16]; let result = process_512(token, &data); println!("Result: {}", result); } else { println!("AVX-512 not available"); } }
512-bit Types
| Type | Elements | Intrinsic Type |
|---|---|---|
f32x16 | 16 × f32 | __m512 |
f64x8 | 8 × f64 | __m512d |
i32x16 | 16 × i32 | __m512i |
i64x8 | 8 × i64 | __m512i |
i16x32 | 32 × i16 | __m512i |
i8x64 | 64 × i8 | __m512i |
Masking
AVX-512's killer feature is per-lane masking:
#![allow(unused)] fn main() { use std::arch::x86_64::*; #[arcane] fn masked_add(token: X64V4Token, a: __m512, b: __m512, mask: __mmask16) -> __m512 { // Only add lanes where mask bit is 1 // Other lanes keep value from `a` _mm512_mask_add_ps(a, mask, a, b) } }
Tiered Fallback with AVX-512
#![allow(unused)] fn main() { pub fn process(data: &mut [f32]) { #[cfg(feature = "avx512")] if let Some(token) = X64V4Token::summon() { return process_avx512(token, data); } if let Some(token) = X64V3Token::summon() { return process_avx2(token, data); } process_scalar(data); } #[cfg(feature = "avx512")] #[arcane] fn process_avx512(token: X64V4Token, data: &mut [f32]) { for chunk in data.chunks_exact_mut(16) { let v = f32x16::from_slice(token, chunk); let result = v * v; result.store_slice(chunk); } // Handle remainder with AVX2 (V4 can downcast to V3) let remainder = data.chunks_exact_mut(16).into_remainder(); if !remainder.is_empty() { process_avx2(token, remainder); // Downcast works! } } #[arcane] fn process_avx2(token: X64V3Token, data: &mut [f32]) { for chunk in data.chunks_exact_mut(8) { let v = f32x8::from_slice(token, chunk); let result = v * v; result.store_slice(chunk); } for x in data.chunks_exact_mut(8).into_remainder() { *x = *x * *x; } } }
AVX-512 Performance Considerations
Frequency Throttling
Heavy AVX-512 use can cause CPU frequency throttling:
- Light AVX-512: Minimal impact
- Heavy 512-bit ops: Up to 20% frequency reduction
- Heavy 512-bit + FMA: Up to 30% reduction
For short bursts, this doesn't matter. For sustained workloads, consider if 256-bit is actually faster due to higher frequency.
When AVX-512 Wins
- Large data: Processing 16 floats vs 8 is 2× work per instruction
- Masked operations: No equivalent in AVX2
- Gather/scatter: Much faster than AVX2
- Specific instructions: VPTERNLOG, conflict detection, etc.
When AVX2 Might Win
- Short bursts: Throttling overhead not amortized
- Memory-bound code: Wider vectors don't help if waiting for RAM
- Mixed workloads: Frequency penalty affects scalar code too
Checking for AVX-512
#![allow(unused)] fn main() { use archmage::{X64V4Token, SimdToken}; fn check_avx512() { match X64V4Token::guaranteed() { Some(true) => println!("Compile-time AVX-512"), Some(false) => println!("Not x86-64"), None => { if X64V4Token::summon().is_some() { println!("Runtime AVX-512 available"); } else { println!("No AVX-512"); } } } } }
Example: Matrix Multiply
#![allow(unused)] fn main() { #[cfg(feature = "avx512")] #[arcane] fn matmul_4x4_avx512( token: X64V4Token, a: &[[f32; 4]; 4], b: &[[f32; 4]; 4], c: &mut [[f32; 4]; 4] ) { use std::arch::x86_64::*; // Load B columns into registers let b_col0 = _mm512_set_ps( b[3][0], b[2][0], b[1][0], b[0][0], b[3][0], b[2][0], b[1][0], b[0][0], b[3][0], b[2][0], b[1][0], b[0][0], b[3][0], b[2][0], b[1][0], b[0][0] ); // ... broadcast and FMA pattern } }
WASM SIMD
WebAssembly SIMD128 provides 128-bit vectors in the browser and WASI environments.
Setup
Enable SIMD128 in your build:
RUSTFLAGS="-Ctarget-feature=+simd128" cargo build --target wasm32-unknown-unknown
Or in .cargo/config.toml:
[target.wasm32-unknown-unknown]
rustflags = ["-Ctarget-feature=+simd128"]
The Token
#![allow(unused)] fn main() { use archmage::{Simd128Token, SimdToken}; fn check_wasm_simd() { if let Some(token) = Simd128Token::summon() { process_simd(token, &data); } else { process_scalar(&data); } } }
Note: On WASM, Simd128Token::summon() succeeds if the binary was compiled with SIMD128 support. There's no runtime feature detection in WASM—the capability is determined at compile time.
Available Types
| Type | Elements |
|---|---|
f32x4 | 4 × f32 |
f64x2 | 2 × f64 |
i32x4 | 4 × i32 |
i64x2 | 2 × i64 |
i16x8 | 8 × i16 |
i8x16 | 16 × i8 |
u32x4 | 4 × u32 |
u64x2 | 2 × u64 |
u16x8 | 8 × u16 |
u8x16 | 16 × u8 |
Basic Usage
#![allow(unused)] fn main() { use archmage::{Simd128Token, arcane}; use magetypes::simd::f32x4; #[arcane] fn dot_product(token: Simd128Token, a: &[f32; 4], b: &[f32; 4]) -> f32 { let va = f32x4::from_array(token, *a); let vb = f32x4::from_array(token, *b); (va * vb).reduce_add() } }
Cross-Platform Code
Write once, run on x86, ARM, and WASM:
#![allow(unused)] fn main() { use archmage::{Desktop64, NeonToken, Simd128Token, SimdToken, incant}; // Define platform-specific implementations #[cfg(target_arch = "x86_64")] #[arcane] fn sum_v3(token: Desktop64, data: &[f32; 8]) -> f32 { use magetypes::simd::f32x8; f32x8::from_array(token, *data).reduce_add() } #[cfg(target_arch = "aarch64")] #[arcane] fn sum_neon(token: NeonToken, data: &[f32; 8]) -> f32 { use magetypes::simd::f32x4; let a = f32x4::from_slice(token, &data[0..4]); let b = f32x4::from_slice(token, &data[4..8]); a.reduce_add() + b.reduce_add() } #[cfg(target_arch = "wasm32")] #[arcane] fn sum_wasm128(token: Simd128Token, data: &[f32; 8]) -> f32 { use magetypes::simd::f32x4; let a = f32x4::from_slice(token, &data[0..4]); let b = f32x4::from_slice(token, &data[4..8]); a.reduce_add() + b.reduce_add() } fn sum_scalar(data: &[f32; 8]) -> f32 { data.iter().sum() } // Public API pub fn sum(data: &[f32; 8]) -> f32 { incant!(sum(data)) } }
WASM-Specific Considerations
No Runtime Detection
Unlike x86/ARM, WASM doesn't have runtime feature detection. The SIMD support is baked in at compile time:
#![allow(unused)] fn main() { // On WASM, this is always the same result // (based on compile-time -Ctarget-feature=+simd128) let has_simd = Simd128Token::summon().is_some(); }
Browser Compatibility
WASM SIMD is supported in:
- Chrome 91+ (May 2021)
- Firefox 89+ (June 2021)
- Safari 16.4+ (March 2023)
- Node.js 16.4+
For older browsers, provide a non-SIMD fallback WASM binary.
Relaxed SIMD
WASM also has "relaxed SIMD" with even more instructions. As of 2024, this requires additional flags:
RUSTFLAGS="-Ctarget-feature=+simd128,+relaxed-simd" cargo build
Example: Image Processing in Browser
#![allow(unused)] fn main() { use wasm_bindgen::prelude::*; use archmage::{Simd128Token, SimdToken, arcane}; use magetypes::simd::u8x16; #[wasm_bindgen] pub fn brighten_image(pixels: &mut [u8], amount: u8) { if let Some(token) = Simd128Token::summon() { brighten_simd(token, pixels, amount); } else { brighten_scalar(pixels, amount); } } #[arcane] fn brighten_simd(token: Simd128Token, pixels: &mut [u8], amount: u8) { let add = u8x16::splat(token, amount); for chunk in pixels.chunks_exact_mut(16) { let v = u8x16::from_slice(token, chunk); let bright = v.saturating_add(add); bright.store_slice(chunk); } // Handle remainder for pixel in pixels.chunks_exact_mut(16).into_remainder() { *pixel = pixel.saturating_add(amount); } } fn brighten_scalar(pixels: &mut [u8], amount: u8) { for pixel in pixels { *pixel = pixel.saturating_add(amount); } } }
Testing WASM Code
Use wasm-pack test:
wasm-pack test --node
Or test natively with the scalar fallback:
#![allow(unused)] fn main() { #[test] fn test_sum() { let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let result = sum(&data); assert_eq!(result, 36.0); } }
Token Reference
Complete reference for all archmage tokens.
x86-64 Tokens
X64V2Token
Features: SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT
CPUs: Intel Nehalem (2008)+, AMD Bulldozer (2011)+
#![allow(unused)] fn main() { use archmage::{X64V2Token, SimdToken}; if let Some(token) = X64V2Token::summon() { // 128-bit SSE operations } }
X64V3Token / Desktop64 / Avx2FmaToken
Features: All V2 + AVX, AVX2, FMA, BMI1, BMI2, F16C, MOVBE
CPUs: Intel Haswell (2013)+, AMD Zen 1 (2017)+
#![allow(unused)] fn main() { use archmage::{X64V3Token, Desktop64, SimdToken}; // These are the same type: let t1: Option<X64V3Token> = X64V3Token::summon(); let t2: Option<Desktop64> = Desktop64::summon(); }
Aliases:
Desktop64— Friendly name for typical desktop/laptop CPUsAvx2FmaToken— Legacy name (deprecated)
X64V4Token / Server64 / Avx512Token
Features: All V3 + AVX-512F, AVX-512BW, AVX-512CD, AVX-512DQ, AVX-512VL
CPUs: Intel Skylake-X (2017)+, AMD Zen 4 (2022)+
Requires: avx512 feature
#![allow(unused)] fn main() { #[cfg(feature = "avx512")] use archmage::{X64V4Token, Server64, SimdToken}; if let Some(token) = X64V4Token::summon() { // 512-bit AVX-512 operations } }
Aliases:
Server64— Friendly name for server CPUsAvx512Token— Direct alias
Avx512ModernToken
Features: All V4 + VPOPCNTDQ, IFMA, VBMI, VNNI, BF16, VBMI2, BITALG, VPCLMULQDQ, GFNI, VAES
CPUs: Intel Ice Lake (2019)+, AMD Zen 4 (2022)+
Requires: avx512 feature
#![allow(unused)] fn main() { #[cfg(feature = "avx512")] if let Some(token) = Avx512ModernToken::summon() { // Modern AVX-512 extensions } }
Avx512Fp16Token
Features: AVX-512FP16
CPUs: Intel Sapphire Rapids (2023)+
Requires: avx512 feature
#![allow(unused)] fn main() { #[cfg(feature = "avx512")] if let Some(token) = Avx512Fp16Token::summon() { // Native FP16 operations } }
AArch64 Tokens
NeonToken / Arm64
Features: NEON (always available on AArch64)
CPUs: All 64-bit ARM processors
#![allow(unused)] fn main() { use archmage::{NeonToken, Arm64, SimdToken}; // Always succeeds on AArch64 let token = NeonToken::summon().unwrap(); }
Alias: Arm64
NeonAesToken
Features: NEON + AES
CPUs: Most ARMv8 processors with crypto extensions
#![allow(unused)] fn main() { if let Some(token) = NeonAesToken::summon() { // AES acceleration available } }
NeonSha3Token
Features: NEON + SHA3
CPUs: ARMv8.2+ with SHA3 extension
#![allow(unused)] fn main() { if let Some(token) = NeonSha3Token::summon() { // SHA3 acceleration available } }
NeonCrcToken
Features: NEON + CRC
CPUs: Most ARMv8 processors
#![allow(unused)] fn main() { if let Some(token) = NeonCrcToken::summon() { // CRC32 acceleration available } }
WASM Token
Simd128Token
Features: WASM SIMD128
Requires: Compile with -Ctarget-feature=+simd128
#![allow(unused)] fn main() { use archmage::{Simd128Token, SimdToken}; if let Some(token) = Simd128Token::summon() { // WASM SIMD128 operations } }
Universal Token
ScalarToken
Features: None (pure scalar fallback)
Availability: Always, on all platforms
#![allow(unused)] fn main() { use archmage::{ScalarToken, SimdToken}; // Always succeeds let token = ScalarToken::summon().unwrap(); // Or construct directly let token = ScalarToken; }
SimdToken Trait
All tokens implement SimdToken:
#![allow(unused)] fn main() { pub trait SimdToken: Copy + Clone + Send + Sync + 'static { const NAME: &'static str; /// Compile-time guarantee check fn guaranteed() -> Option<bool>; /// Runtime detection fn summon() -> Option<Self>; /// Alias for summon() fn attempt() -> Option<Self>; /// Legacy alias (deprecated) fn try_new() -> Option<Self>; /// Unsafe construction (deprecated) unsafe fn forge_token_dangerously() -> Self; } }
guaranteed()
Returns:
Some(true)— Feature is compile-time guaranteed (e.g.,-Ctarget-cpu=haswell)Some(false)— Wrong architecture (token can never exist)None— Runtime check needed
#![allow(unused)] fn main() { match Desktop64::guaranteed() { Some(true) => { // summon() will always succeed, check is elided let token = Desktop64::summon().unwrap(); } Some(false) => { // Wrong arch, use fallback } None => { // Need runtime check if let Some(token) = Desktop64::summon() { // ... } } } }
summon()
Performs runtime CPU feature detection. Returns Some(token) if features are available.
#![allow(unused)] fn main() { if let Some(token) = Desktop64::summon() { // CPU supports AVX2+FMA } }
Token Size
All tokens are zero-sized:
#![allow(unused)] fn main() { use std::mem::size_of; assert_eq!(size_of::<X64V3Token>(), 0); assert_eq!(size_of::<NeonToken>(), 0); assert_eq!(size_of::<ScalarToken>(), 0); }
Passing tokens has zero runtime cost.
Trait Reference
Reference for archmage traits.
SimdToken
The base trait for all capability tokens.
#![allow(unused)] fn main() { pub trait SimdToken: Copy + Clone + Send + Sync + 'static { const NAME: &'static str; fn guaranteed() -> Option<bool>; fn summon() -> Option<Self>; fn attempt() -> Option<Self>; } }
Implementors: All token types
IntoConcreteToken
Enables compile-time dispatch via type checking.
#![allow(unused)] fn main() { pub trait IntoConcreteToken: SimdToken { fn as_x64v2(self) -> Option<X64V2Token> { None } fn as_x64v3(self) -> Option<X64V3Token> { None } fn as_x64v4(self) -> Option<X64V4Token> { None } fn as_avx512_modern(self) -> Option<Avx512ModernToken> { None } fn as_avx512_fp16(self) -> Option<Avx512Fp16Token> { None } fn as_neon(self) -> Option<NeonToken> { None } fn as_neon_aes(self) -> Option<NeonAesToken> { None } fn as_neon_sha3(self) -> Option<NeonSha3Token> { None } fn as_neon_crc(self) -> Option<NeonCrcToken> { None } fn as_wasm128(self) -> Option<Simd128Token> { None } fn as_scalar(self) -> Option<ScalarToken> { None } } }
Usage:
#![allow(unused)] fn main() { fn dispatch<T: IntoConcreteToken>(token: T, data: &[f32]) -> f32 { if let Some(t) = token.as_x64v3() { process_avx2(t, data) } else if let Some(t) = token.as_neon() { process_neon(t, data) } else { process_scalar(data) } } }
Each concrete token returns Some(self) for its own method, None for others. The compiler eliminates dead branches.
Tier Traits
HasX64V2
Marker trait for tokens that provide x86-64-v2 features (SSE4.2+).
#![allow(unused)] fn main() { pub trait HasX64V2: SimdToken {} }
Implementors: X64V2Token, X64V3Token, X64V4Token, Avx512ModernToken, Avx512Fp16Token
Usage:
#![allow(unused)] fn main() { fn process<T: HasX64V2>(token: T, data: &[f32]) { // Can use SSE4.2 intrinsics } }
HasX64V4
Marker trait for tokens that provide x86-64-v4 features (AVX-512).
#![allow(unused)] fn main() { #[cfg(feature = "avx512")] pub trait HasX64V4: SimdToken {} }
Implementors: X64V4Token, Avx512ModernToken, Avx512Fp16Token
Requires: avx512 feature
HasNeon
Marker trait for tokens that provide NEON features.
#![allow(unused)] fn main() { pub trait HasNeon: SimdToken {} }
Implementors: NeonToken, NeonAesToken, NeonSha3Token, NeonCrcToken
HasNeonAes
Marker trait for tokens that provide NEON + AES features.
#![allow(unused)] fn main() { pub trait HasNeonAes: HasNeon {} }
Implementors: NeonAesToken
HasNeonSha3
Marker trait for tokens that provide NEON + SHA3 features.
#![allow(unused)] fn main() { pub trait HasNeonSha3: HasNeon {} }
Implementors: NeonSha3Token
Width Traits (Deprecated)
Warning: These traits are misleading and should not be used in new code.
Has128BitSimd (Deprecated)
Only enables SSE/SSE2 (x86 baseline). Does NOT enable SSE4, AVX, or anything useful beyond baseline.
Use instead: HasX64V2 or concrete tokens
Has256BitSimd (Deprecated)
Only enables AVX (NOT AVX2, NOT FMA). This is almost never what you want.
Use instead: X64V3Token or Desktop64
Has512BitSimd (Deprecated)
Only enables AVX-512F. Missing critical AVX-512 extensions.
Use instead: X64V4Token or HasX64V4
magetypes Traits
SimdTypes
Associates SIMD types with a token.
#![allow(unused)] fn main() { pub trait SimdTypes { type F32: SimdFloat; type F64: SimdFloat; type I32: SimdInt; type I64: SimdInt; // ... } }
Usage:
#![allow(unused)] fn main() { fn process<T: SimdTypes>(token: T, data: &[f32]) { let v = T::F32::splat(1.0); // ... } }
WidthDispatch
Provides access to all SIMD widths from any token.
#![allow(unused)] fn main() { pub trait WidthDispatch { fn w128(&self) -> W128Types; fn w256(&self) -> Option<W256Types>; fn w512(&self) -> Option<W512Types>; } }
Using Traits Correctly
Prefer Concrete Tokens
#![allow(unused)] fn main() { // GOOD: Concrete token, full optimization fn process(token: X64V3Token, data: &[f32]) { } // OK: Trait bound, but optimization boundary fn process<T: HasX64V2>(token: T, data: &[f32]) { } }
Trait Bounds at API Boundaries
#![allow(unused)] fn main() { // Public API can be generic pub fn process<T: IntoConcreteToken>(token: T, data: &[f32]) { // But dispatch to concrete implementations if let Some(t) = token.as_x64v3() { process_avx2(t, data); } } // Internal implementations use concrete tokens #[arcane] fn process_avx2(token: X64V3Token, data: &[f32]) { } }
Don't Over-Constrain
#![allow(unused)] fn main() { // WRONG: Over-constrained, hard to call fn process<T: HasX64V2 + HasNeon>(token: T, data: &[f32]) { // No token implements both! } // RIGHT: Use IntoConcreteToken for multi-platform fn process<T: IntoConcreteToken>(token: T, data: &[f32]) { if let Some(t) = token.as_x64v3() { // x86 path } else if let Some(t) = token.as_neon() { // ARM path } } }
Feature Flags
Reference for Cargo feature flags in archmage and magetypes.
archmage Features
std (default)
Enables standard library support.
[dependencies]
archmage = "0.4" # std enabled by default
Disable for no_std:
[dependencies]
archmage = { version = "0.4", default-features = false }
macros (default)
Enables procedural macros: #[arcane], #[rite], #[magetypes], incant!, etc.
# Disable macros (rare)
archmage = { version = "0.4", default-features = false, features = ["std"] }
avx512
Enables AVX-512 tokens and 512-bit types.
archmage = { version = "0.4", features = ["avx512"] }
Unlocks:
X64V4Token/Server64/Avx512TokenAvx512ModernTokenAvx512Fp16TokenHasX64V4trait
safe_unaligned_simd
Re-exports safe_unaligned_simd crate in the prelude.
archmage = { version = "0.4", features = ["safe_unaligned_simd"] }
Then use:
#![allow(unused)] fn main() { use archmage::prelude::*; // safe_unaligned_simd functions available }
magetypes Features
std (default)
Standard library support.
avx512
Enables 512-bit types.
magetypes = { version = "0.4", features = ["avx512"] }
Unlocks:
f32x16,f64x8i32x16,i64x8i16x32,i8x64u32x16,u64x8,u16x32,u8x64
Feature Combinations
Full-Featured x86
[dependencies]
archmage = { version = "0.4", features = ["avx512", "safe_unaligned_simd"] }
magetypes = { version = "0.4", features = ["avx512"] }
Minimal no_std
[dependencies]
archmage = { version = "0.4", default-features = false, features = ["macros"] }
magetypes = { version = "0.4", default-features = false }
Cross-Platform Library
[dependencies]
archmage = "0.4"
magetypes = "0.4"
[features]
default = ["std"]
std = ["archmage/std", "magetypes/std"]
avx512 = ["archmage/avx512", "magetypes/avx512"]
Cargo Feature vs CPU Feature
Don't confuse Cargo features with CPU features:
| Cargo Feature | Effect |
|---|---|
avx512 | Compiles AVX-512 code paths |
| (none) | Code exists but may not be called |
| CPU Feature | Effect |
|---|---|
| AVX-512 | CPU can execute AVX-512 instructions |
| (none) | Runtime fallback to other path |
#![allow(unused)] fn main() { // Cargo feature controls compilation #[cfg(feature = "avx512")] fn avx512_path(token: X64V4Token, data: &[f32]) { } // Token controls runtime dispatch if let Some(token) = X64V4Token::summon() { // Runtime check avx512_path(token, data); } }
RUSTFLAGS
Not Cargo features, but important compiler flags:
-Ctarget-cpu=native
Compile for current CPU:
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
Effects:
Token::guaranteed()returnsSome(true)for supported featuressummon()becomes a no-op- LLVM generates optimal code for your CPU
-Ctarget-cpu=<name>
Compile for specific CPU:
# Haswell = AVX2+FMA
RUSTFLAGS="-Ctarget-cpu=haswell" cargo build --release
# Skylake-AVX512 = AVX-512
RUSTFLAGS="-Ctarget-cpu=skylake-avx512" cargo build --release
-Ctarget-feature=+<feature>
Enable specific features:
# Just AVX2
RUSTFLAGS="-Ctarget-feature=+avx2" cargo build
# WASM SIMD
RUSTFLAGS="-Ctarget-feature=+simd128" cargo build --target wasm32-unknown-unknown
docs.rs Configuration
For docs.rs to show all features:
# Cargo.toml
[package.metadata.docs.rs]
all-features = true
rustdoc-args = ["--cfg", "docsrs"]
#![allow(unused)] fn main() { // lib.rs #![cfg_attr(docsrs, feature(doc_cfg))] #[cfg(feature = "avx512")] #[cfg_attr(docsrs, doc(cfg(feature = "avx512")))] pub use tokens::X64V4Token; }