AVX-512 Patterns
AVX-512 provides 512-bit vectors and advanced features. Here's how to use it effectively with archmage.
Enabling AVX-512
Add the feature to your Cargo.toml:
[dependencies]
archmage = { version = "0.4", features = ["avx512"] }
magetypes = { version = "0.4", features = ["avx512"] }
AVX-512 Tokens
| Token | Features | CPUs |
|---|---|---|
X64V4Token | F, BW, CD, DQ, VL | Skylake-X, Zen 4 |
Avx512ModernToken | + VNNI, VBMI, IFMA, etc. | Ice Lake+, Zen 4+ |
Avx512Fp16Token | + FP16 | Sapphire Rapids |
Aliases:
Server64=X64V4TokenAvx512Token=X64V4Token
Basic Usage
use archmage::{X64V4Token, SimdToken, arcane}; use magetypes::simd::f32x16; #[arcane] fn process_512(token: X64V4Token, data: &[f32; 16]) -> f32 { let v = f32x16::from_array(token, *data); (v * v).reduce_add() } fn main() { if let Some(token) = X64V4Token::summon() { let data = [1.0f32; 16]; let result = process_512(token, &data); println!("Result: {}", result); } else { println!("AVX-512 not available"); } }
512-bit Types
| Type | Elements | Intrinsic Type |
|---|---|---|
f32x16 | 16 × f32 | __m512 |
f64x8 | 8 × f64 | __m512d |
i32x16 | 16 × i32 | __m512i |
i64x8 | 8 × i64 | __m512i |
i16x32 | 32 × i16 | __m512i |
i8x64 | 64 × i8 | __m512i |
Masking
AVX-512's killer feature is per-lane masking:
#![allow(unused)] fn main() { use std::arch::x86_64::*; #[arcane] fn masked_add(token: X64V4Token, a: __m512, b: __m512, mask: __mmask16) -> __m512 { // Only add lanes where mask bit is 1 // Other lanes keep value from `a` _mm512_mask_add_ps(a, mask, a, b) } }
Tiered Fallback with AVX-512
#![allow(unused)] fn main() { pub fn process(data: &mut [f32]) { #[cfg(feature = "avx512")] if let Some(token) = X64V4Token::summon() { return process_avx512(token, data); } if let Some(token) = X64V3Token::summon() { return process_avx2(token, data); } process_scalar(data); } #[cfg(feature = "avx512")] #[arcane] fn process_avx512(token: X64V4Token, data: &mut [f32]) { for chunk in data.chunks_exact_mut(16) { let v = f32x16::from_slice(token, chunk); let result = v * v; result.store_slice(chunk); } // Handle remainder with AVX2 (V4 can downcast to V3) let remainder = data.chunks_exact_mut(16).into_remainder(); if !remainder.is_empty() { process_avx2(token, remainder); // Downcast works! } } #[arcane] fn process_avx2(token: X64V3Token, data: &mut [f32]) { for chunk in data.chunks_exact_mut(8) { let v = f32x8::from_slice(token, chunk); let result = v * v; result.store_slice(chunk); } for x in data.chunks_exact_mut(8).into_remainder() { *x = *x * *x; } } }
AVX-512 Performance Considerations
Frequency Throttling
Heavy AVX-512 use can cause CPU frequency throttling:
- Light AVX-512: Minimal impact
- Heavy 512-bit ops: Up to 20% frequency reduction
- Heavy 512-bit + FMA: Up to 30% reduction
For short bursts, this doesn't matter. For sustained workloads, consider if 256-bit is actually faster due to higher frequency.
When AVX-512 Wins
- Large data: Processing 16 floats vs 8 is 2× work per instruction
- Masked operations: No equivalent in AVX2
- Gather/scatter: Much faster than AVX2
- Specific instructions: VPTERNLOG, conflict detection, etc.
When AVX2 Might Win
- Short bursts: Throttling overhead not amortized
- Memory-bound code: Wider vectors don't help if waiting for RAM
- Mixed workloads: Frequency penalty affects scalar code too
Checking for AVX-512
#![allow(unused)] fn main() { use archmage::{X64V4Token, SimdToken}; fn check_avx512() { match X64V4Token::guaranteed() { Some(true) => println!("Compile-time AVX-512"), Some(false) => println!("Not x86-64"), None => { if X64V4Token::summon().is_some() { println!("Runtime AVX-512 available"); } else { println!("No AVX-512"); } } } } }
Example: Matrix Multiply
#![allow(unused)] fn main() { #[cfg(feature = "avx512")] #[arcane] fn matmul_4x4_avx512( token: X64V4Token, a: &[[f32; 4]; 4], b: &[[f32; 4]; 4], c: &mut [[f32; 4]; 4] ) { use std::arch::x86_64::*; // Load B columns into registers let b_col0 = _mm512_set_ps( b[3][0], b[2][0], b[1][0], b[0][0], b[3][0], b[2][0], b[1][0], b[0][0], b[3][0], b[2][0], b[1][0], b[0][0], b[3][0], b[2][0], b[1][0], b[0][0] ); // ... broadcast and FMA pattern } }