AVX-512 Patterns

AVX-512 provides 512-bit vectors and advanced features. Here's how to use it effectively with archmage.

Enabling AVX-512

Add the feature to your Cargo.toml:

[dependencies]
archmage = { version = "0.4", features = ["avx512"] }
magetypes = { version = "0.4", features = ["avx512"] }

AVX-512 Tokens

Token	Features	CPUs
`X64V4Token`	F, BW, CD, DQ, VL	Skylake-X, Zen 4
`Avx512ModernToken`	+ VNNI, VBMI, IFMA, etc.	Ice Lake+, Zen 4+
`Avx512Fp16Token`	+ FP16	Sapphire Rapids

Aliases:

Server64 = X64V4Token
Avx512Token = X64V4Token

Basic Usage

use archmage::{X64V4Token, SimdToken, arcane};
use magetypes::simd::f32x16;

#[arcane]
fn process_512(token: X64V4Token, data: &[f32; 16]) -> f32 {
    let v = f32x16::from_array(token, *data);
    (v * v).reduce_add()
}

fn main() {
    if let Some(token) = X64V4Token::summon() {
        let data = [1.0f32; 16];
        let result = process_512(token, &data);
        println!("Result: {}", result);
    } else {
        println!("AVX-512 not available");
    }
}

512-bit Types

Type	Elements	Intrinsic Type
`f32x16`	16 × f32	`__m512`
`f64x8`	8 × f64	`__m512d`
`i32x16`	16 × i32	`__m512i`
`i64x8`	8 × i64	`__m512i`
`i16x32`	32 × i16	`__m512i`
`i8x64`	64 × i8	`__m512i`

Masking

AVX-512's killer feature is per-lane masking:

#![allow(unused)]
fn main() {
use std::arch::x86_64::*;

#[arcane]
fn masked_add(token: X64V4Token, a: __m512, b: __m512, mask: __mmask16) -> __m512 {
    // Only add lanes where mask bit is 1
    // Other lanes keep value from `a`
    _mm512_mask_add_ps(a, mask, a, b)
}
}

Tiered Fallback with AVX-512

#![allow(unused)]
fn main() {
pub fn process(data: &mut [f32]) {
    #[cfg(feature = "avx512")]
    if let Some(token) = X64V4Token::summon() {
        return process_avx512(token, data);
    }

    if let Some(token) = X64V3Token::summon() {
        return process_avx2(token, data);
    }

    process_scalar(data);
}

#[cfg(feature = "avx512")]
#[arcane]
fn process_avx512(token: X64V4Token, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(16) {
        let v = f32x16::from_slice(token, chunk);
        let result = v * v;
        result.store_slice(chunk);
    }
    // Handle remainder with AVX2 (V4 can downcast to V3)
    let remainder = data.chunks_exact_mut(16).into_remainder();
    if !remainder.is_empty() {
        process_avx2(token, remainder);  // Downcast works!
    }
}

#[arcane]
fn process_avx2(token: X64V3Token, data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(8) {
        let v = f32x8::from_slice(token, chunk);
        let result = v * v;
        result.store_slice(chunk);
    }
    for x in data.chunks_exact_mut(8).into_remainder() {
        *x = *x * *x;
    }
}
}

AVX-512 Performance Considerations

Frequency Throttling

Heavy AVX-512 use can cause CPU frequency throttling:

Light AVX-512: Minimal impact
Heavy 512-bit ops: Up to 20% frequency reduction
Heavy 512-bit + FMA: Up to 30% reduction

For short bursts, this doesn't matter. For sustained workloads, consider if 256-bit is actually faster due to higher frequency.

When AVX-512 Wins

Large data: Processing 16 floats vs 8 is 2× work per instruction
Masked operations: No equivalent in AVX2
Gather/scatter: Much faster than AVX2
Specific instructions: VPTERNLOG, conflict detection, etc.

When AVX2 Might Win

Short bursts: Throttling overhead not amortized
Memory-bound code: Wider vectors don't help if waiting for RAM
Mixed workloads: Frequency penalty affects scalar code too

Checking for AVX-512

#![allow(unused)]
fn main() {
use archmage::{X64V4Token, SimdToken};

fn check_avx512() {
    match X64V4Token::guaranteed() {
        Some(true) => println!("Compile-time AVX-512"),
        Some(false) => println!("Not x86-64"),
        None => {
            if X64V4Token::summon().is_some() {
                println!("Runtime AVX-512 available");
            } else {
                println!("No AVX-512");
            }
        }
    }
}
}

Example: Matrix Multiply

#![allow(unused)]
fn main() {
#[cfg(feature = "avx512")]
#[arcane]
fn matmul_4x4_avx512(
    token: X64V4Token,
    a: &[[f32; 4]; 4],
    b: &[[f32; 4]; 4],
    c: &mut [[f32; 4]; 4]
) {
    use std::arch::x86_64::*;

    // Load B columns into registers
    let b_col0 = _mm512_set_ps(
        b[3][0], b[2][0], b[1][0], b[0][0],
        b[3][0], b[2][0], b[1][0], b[0][0],
        b[3][0], b[2][0], b[1][0], b[0][0],
        b[3][0], b[2][0], b[1][0], b[0][0]
    );
    // ... broadcast and FMA pattern
}
}

Archmage & Magetypes