Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.

  • I worked in the media broadcasting, we had an internal lib to scale/convert whatever format in real time, and it went from basic operation, to SSE3, to AVX512, to CUDA, and yes crafting some functions/loops wit assembly can give an enormous boost.