Memory Operations
Efficiently moving data between memory and SIMD registers is critical for performance.
Load Operations
Unaligned Load
#![allow(unused)] fn main() { use magetypes::simd::f32x8; // From array reference let arr = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; let v = f32x8::from_array(token, arr); // From slice (must have enough elements) let slice = &[1.0f32; 16]; let v = f32x8::from_slice(token, &slice[0..8]); }
Aligned Load
If you know your data is aligned:
#![allow(unused)] fn main() { // Aligned load (UB if not aligned to 32 bytes for f32x8) let v = unsafe { f32x8::load_aligned(ptr) }; }
Partial Load
Load fewer elements than the vector width:
#![allow(unused)] fn main() { // Load 4 elements into lower half, zero upper half let v = f32x8::load_low(token, &[1.0, 2.0, 3.0, 4.0]); // v = [1.0, 2.0, 3.0, 4.0, 0.0, 0.0, 0.0, 0.0] }
Store Operations
Unaligned Store
#![allow(unused)] fn main() { let v = f32x8::splat(token, 42.0); // To array let arr: [f32; 8] = v.to_array(); // To slice let mut buf = [0.0f32; 8]; v.store_slice(&mut buf); }
Aligned Store
#![allow(unused)] fn main() { // Aligned store (UB if not aligned) unsafe { v.store_aligned(ptr) }; }
Partial Store
Store only some elements:
#![allow(unused)] fn main() { // Store lower 4 elements v.store_low(&mut buf[0..4]); }
Streaming Stores
For large data where you won't read back soon:
#![allow(unused)] fn main() { // Non-temporal store (bypasses cache) unsafe { v.stream(ptr) }; }
Use streaming stores when:
- Writing large arrays sequentially
- Data won't be read again soon
- Avoiding cache pollution is important
Gather and Scatter
Load/store non-contiguous elements:
#![allow(unused)] fn main() { // Gather: load from scattered indices let data = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]; let indices = i32x8::from_array(token, [0, 2, 4, 6, 8, 1, 3, 5]); let gathered = f32x8::gather(&data, indices); // gathered = [0.0, 20.0, 40.0, 60.0, 80.0, 10.0, 30.0, 50.0] // Scatter: store to scattered indices let mut output = [0.0f32; 10]; let values = f32x8::splat(token, 1.0); values.scatter(&mut output, indices); }
Note: Gather/scatter may be slow on some CPUs. Profile before using.
Prefetch
Hint the CPU to load data into cache:
#![allow(unused)] fn main() { use std::arch::x86_64::*; // Prefetch for read unsafe { _mm_prefetch(ptr as *const i8, _MM_HINT_T0) }; // Prefetch levels: // _MM_HINT_T0 - All cache levels // _MM_HINT_T1 - L2 and above // _MM_HINT_T2 - L3 and above // _MM_HINT_NTA - Non-temporal (don't pollute cache) }
Interleaved Data
For RGBARGBA... or similar interleaved formats:
#![allow(unused)] fn main() { // Deinterleave 4 channels (RGBA) let (r, g, b, a) = f32x8::deinterleave_4ch( token, &rgba_data[0..8], &rgba_data[8..16], &rgba_data[16..24], &rgba_data[24..32] ); // Process channels separately let r_bright = r + f32x8::splat(token, 0.1); // Reinterleave let (out0, out1, out2, out3) = f32x8::interleave_4ch(token, r_bright, g, b, a); }
Chunked Processing
Process large arrays in SIMD-sized chunks:
#![allow(unused)] fn main() { #[arcane] fn process_large(token: Desktop64, data: &mut [f32]) { // Process full chunks for chunk in data.chunks_exact_mut(8) { let v = f32x8::from_slice(token, chunk); let result = v * v; // Process result.store_slice(chunk); } // Handle remainder for x in data.chunks_exact_mut(8).into_remainder() { *x = *x * *x; } } }
Alignment Tips
-
Use
#[repr(align(32))]for AVX2 data:#![allow(unused)] fn main() { #[repr(C, align(32))] struct AlignedData { values: [f32; 8], } } -
Allocate aligned memory:
#![allow(unused)] fn main() { use std::alloc::{alloc, Layout}; let layout = Layout::from_size_align(size, 32).unwrap(); let ptr = unsafe { alloc(layout) }; } -
Check alignment at runtime:
#![allow(unused)] fn main() { fn is_aligned<T>(ptr: *const T, align: usize) -> bool { (ptr as usize) % align == 0 } }
Performance Tips
- Minimize loads/stores — Keep data in registers
- Prefer unaligned — Modern CPUs handle it well
- Use streaming for large writes — Saves cache space
- Batch operations — Load once, do multiple ops, store once
- Avoid gather/scatter — Sequential access is faster