What is this page?

This page is a copy of a great x64 SIMD guide from https://www.officedaytime.com/simd512e/, but adapted for those of us who use C#.

I amended it to include the names of intrinsics from the System.Runtime.Intrinsics.X86 namespace as of .NET Core 3.1 (future versions pending).
These names are hard to find in MSDN documentation, and they differ from both Intel’s instruction names and C++ intrinsics. Hence, here they are.

C# names, when awailable, are written below C++ intrinsics.

♯ — links to MSDN documentation are marked with a sharp sign.

Some methods not included here are available for AVX and AVX2. If you can not find what you want, please try and change class name from SseX to Avx or Avx2.
Same deal with Vector256/Vector128.
AVX512 is completely missing from .NET Core at this point. To highlight only AVX2 instructions, use this link.

x86/x64 SIMD Instruction List (SSE to AVX512)

MMX register (64-bit) instructions are omitted.

S1=SSE  S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512

Instructions marked * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.

C/C++ intrinsic name is written below each instruction in blue.

AVX/AVX2

AVX512

This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.

Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm

When you find any error or something please post this feedback form or email me to the address at the bottom of this page.

 

Highlighter  to   Color    To make these default, bookmark this page after clicking here.

MOVE     ?MM = XMM / YMM / ZMM

  Integer Floating-Point YMM lane (128-bit)
QWORD DWORD WORD BYTE Double Single Half
?MM whole
from / to
?MM/mem
MOVDQA (S2
_mm_load_si128
Sse2.LoadAlignedVector128 
_mm_store_si128
Sse2.StoreAligned 
MOVDQU (S2
_mm_loadu_si128
Sse2.LoadVector128 
_mm_storeu_si128
Sse2.Store 
MOVAPD (S2
_mm_load_pd
Sse2.LoadAlignedVector128 
_mm_loadr_pd
_mm_store_pd
Sse2.StoreAligned 
_mm_storer_pd

MOVUPD (S2
_mm_loadu_pd
Sse2.LoadVector128 
_mm_storeu_pd
Sse2.Store 
MOVAPS (S1
_mm_load_ps
Sse.LoadAlignedVector128 
_mm_loadr_ps
_mm_store_ps
Sse.StoreAligned 
_mm_storer_ps

MOVUPS (S1
_mm_loadu_ps
Sse.LoadVector128 
_mm_storeu_ps
Sse.Store 
 
VMOVDQA64 (V5...
_mm_mask_load_epi64
_mm_mask_store_epi64
etc
VMOVDQU64 (V5...
_mm_mask_loadu_epi64
_mm_mask_store_epi64
etc
VMOVDQA32 (V5...
_mm_mask_load_epi32
_mm_mask_store_epi32
etc
VMOVDQU32 (V5...
_mm_mask_loadu_epi32
_mm_mask_storeu_epi32
etc
VMOVDQU16 (V5+BW...
_mm_mask_loadu_epi16
_mm_mask_storeu_epi16
etc
VMOVDQU8 (V5+BW...
_mm_mask_loadu_epi8
_mm_mask_storeu_epi8
etc
XMM upper half
from / to
mem
MOVHPD (S2
_mm_loadh_pd
Sse2.LoadHigh 
_mm_storeh_pd
Sse2.StoreHigh 
MOVHPS (S1
_mm_loadh_pi
Sse.LoadHigh 
_mm_storeh_pi
Sse.StoreHigh 
 
XMM upper half
from / to
XMM lower half
MOVHLPS (S1
_mm_movehl_ps
Sse.MoveHighToLow 

MOVLHPS (S1
_mm_movelh_ps
Sse.MoveLowToHigh 
 
XMM lower half
from / to
mem
MOVQ (S2
_mm_loadl_epi64
Sse2.LoadScalarVector128 
_mm_storel_epi64
Sse2.StoreScalar 
      MOVLPD (S2
_mm_loadl_pd
Sse2.LoadLow 
_mm_storel_pd
Sse2.StoreLow 
MOVLPS (S1
_mm_loadl_pi
Sse.LoadLow 
_mm_storel_pi
Sse.StoreLow 
   
XMM lowest 1 elem
from / to
r/m
MOVQ (S2
_mm_cvtsi64_si128
Sse2.X64.ConvertScalarToVector128Int64 
Sse2.X64.ConvertScalarToVector128UInt64 
_mm_cvtsi128_si64
Sse2.X64.ConvertToInt64 
Sse2.X64.ConvertToUInt64 
MOVD (S2
_mm_cvtsi32_si128
Sse2.ConvertScalarToVector128Int32 
Sse2.ConvertScalarToVector128UInt32 
_mm_cvtsi128_si32
Sse2.ConvertToInt32 
Sse2.ConvertToUInt32 
   
XMM lowest 1 elem
from / to
XMM/mem
MOVQ (S2
_mm_move_epi64
Sse2.MoveScalar 
      MOVSD (S2
_mm_load_sd
Sse2.LoadScalarVector128 
_mm_store_sd
Sse2.StoreScalar 
_mm_move_sd
Sse2.MoveScalar 
MOVSS (S1
_mm_load_ss
Sse.LoadScalarVector128 
_mm_store_ss
Sse.StoreScalar 
_mm_move_ss
Sse.MoveScalar 
   
XMM whole
from
1 elem
TIP 2
_mm_set1_epi64x
VPBROADCASTQ (V2
_mm_broadcastq_epi64
Avx2.BroadcastScalarToVector128 
TIP 2
_mm_set1_epi32
VPBROADCASTD (V2
_mm_broadcastd_epi32
Avx2.BroadcastScalarToVector128 
TIP 2
_mm_set1_epi16
VPBROADCASTW (V2
_mm_broadcastw_epi16
Avx2.BroadcastScalarToVector128 
_mm_set1_epi8
VPBROADCASTB (V2
_mm_broadcastb_epi8
Avx2.BroadcastScalarToVector128 
TIP 2
_mm_set1_pd
_mm_load1_pd
MOVDDUP (S3
_mm_movedup_pd
Sse3.MoveAndDuplicate 
_mm_loaddup_pd
Sse3.LoadAndDuplicateToVector128 

TIP 2
_mm_set1_ps
_mm_load1_ps

VBROADCASTSS
from mem (V1
from XMM (V2
_mm_broadcast_ss
Avx.BroadcastScalarToVector128 
YMM / ZMM whole
from
1 elem
VPBROADCASTQ (V2
_mm256_broadcastq_epi64
Avx2.BroadcastScalarToVector256 
VPBROADCASTD (V2
_mm256_broadcastd_epi32
Avx2.BroadcastScalarToVector256 
VPBROADCASTW (V2
_mm256_broadcastw_epi16
Avx2.BroadcastScalarToVector256 
VPBROADCASTB (V2
_mm256_broadcastb_epi8
Avx2.BroadcastScalarToVector256 
VBROADCASTSD
 from mem (V1
 from XMM (V2
_mm256_broadcast_sd
Avx.BroadcastScalarToVector256 
VBROADCASTSS
 from mem (V1
 from XMM (V2
_mm256_broadcast_ss
Avx.BroadcastScalarToVector256 
  VBROADCASTF128 (V1
_mm256_broadcast_ps
Avx.BroadcastVector128ToVector256 
_mm256_broadcast_pd
Avx.BroadcastVector128ToVector256 

VBROADCASTI128 (V2
_mm256_broadcastsi128_si256
Avx2.BroadcastVector128ToVector256 
YMM / ZMM whole
from
2/4/8 elems
VBROADCASTI64X2 (V5+DQ...
_mm512_broadcast_i64x2
VBROADCASTI64X4 (V5
_mm512_broadcast_i64x4
VBROADCASTI32X2 (V5+DQ...
_mm512_broadcast_i32x2
VBROADCASTI32X4 (V5...
_mm512_broadcast_i32x4
VBROADCASTI32X8 (V5+DQ
_mm512_broadcast_i32x8
VBROADCASTF64X2 (V5+DQ...
_mm512_broadcast_f64x2
VBROADCASTF64X4 (V5
_mm512_broadcast_f64x4
VBROADCASTF32X2 (V5+DQ...
_mm512_broadcast_f32x2
VBROADCASTF32X4 (V5...
_mm512_broadcast_f32x4
VBROADCASTF32X8 (V5+DQ
_mm512_broadcast_f32x8
?MM
from
multiple elems
_mm_set_epi64x
Vector256.Create 
_mm_setr_epi64x
_mm_set_epi32
Vector256.Create 
_mm_setr_epi32
_mm_set_epi16
Vector256.Create 
_mm_setr_epi16
_mm_set_epi8
Vector256.Create 
_mm_setr_epi8
_mm_set_pd
Vector256.Create 
_mm_setr_pd
_mm_set_ps
Vector256.Create 
_mm_setr_ps
   
?MM whole
from
zero
TIP 1
_mm_setzero_si128
Vector256`1.Zero 
TIP 1
_mm_setzero_pd
Vector256`1.Zero 
TIP 1
_mm_setzero_ps
Vector256`1.Zero 
   
extract PEXTRQ (S4.1
_mm_extract_epi64
Sse41.X64.Extract 
PEXTRD (S4.1
_mm_extract_epi32
Sse41.Extract 
PEXTRW to r (S2
PEXTRW to r/m (S4.1
_mm_extract_epi16
Sse2.Extract 

PEXTRB (S4.1
_mm_extract_epi8
Sse41.Extract 
->MOVHPD (S2
_mm_loadh_pd
Sse2.LoadHigh 
_mm_storeh_pd
Sse2.StoreHigh 

->MOVLPD (S2
_mm_loadl_pd
Sse2.LoadLow 
_mm_storel_pd
Sse2.StoreLow 
EXTRACTPS (S4.1
_mm_extract_ps
Sse41.Extract 
  VEXTRACTF128 (V1
_mm256_extractf128_ps
Avx.ExtractVector128 
_mm256_extractf128_pd
Avx.ExtractVector128 
_mm256_extractf128_si256
Avx.ExtractVector128 

VEXTRACTI128 (V2
_mm256_extracti128_si256
Avx2.ExtractVector128 
VEXTRACTI64X2 (V5+DQ...
_mm512_extracti64x2_epi64
VEXTRACTI64X4 (V5
_mm512_extracti64x4_epi64
VEXTRACTI32X4 (V5...
_mm512_extracti32x4_epi32
VEXTRACTI32X8 (V5+DQ
_mm512_extracti32x8_epi32
VEXTRACTF64X2 (V5+DQ...
_mm512_extractf64x2_pd
VEXTRACTF64X4 (V5
_mm512_extractf64x4_pd
VEXTRACTF32X4 (V5...
_mm512_extractf32x4_ps
VEXTRACTF32X8 (V5+DQ
_mm512_extractf32x8_ps
insert PINSRQ (S4.1
_mm_insert_epi64
Sse41.X64.Insert 
PINSRD (S4.1
_mm_insert_epi32
Sse41.Insert 
PINSRW (S2
_mm_insert_epi16
Sse2.Insert 
PINSRB (S4.1
_mm_insert_epi8
Sse41.Insert 
->MOVHPD (S2
_mm_loadh_pd
Sse2.LoadHigh 
_mm_storeh_pd
Sse2.StoreHigh 

->MOVLPD (S2
_mm_loadl_pd
Sse2.LoadLow 
_mm_storel_pd
Sse2.StoreLow 
INSERTPS (S4.1
_mm_insert_ps
Sse41.Insert 
  VINSERTF128 (V1
_mm256_insertf128_ps
Avx.InsertVector128 
_mm256_insertf128_pd
Avx.InsertVector128 
_mm256_insertf128_si256
Avx.InsertVector128 

VINSERTI128 (V2
_mm256_inserti128_si256
Avx2.InsertVector128 
VINSERTI64X2 (V5+DQ...
_mm512_inserrti64x2
VINSERTI64X4 (V5...
_mm512_inserti64x4
VINSERTI32X4 (V5...
_mm512_inserti32x4
VINSERTI32X8 (V5+DQ
_mm512_inserti32x8
VINSERTF64X2 (V5+DQ...
_mm512_insertf64x2
VINSERTF64X4 (V5
_mm512_insertf64x4
VINSERTF32X4 (V5...
_mm512_insertf32x4
VINSERTF32X8 (V5+DQ
_mm512_insertf32x8
unpack
PUNPCKHQDQ (S2
_mm_unpackhi_epi64
Sse2.UnpackHigh 
PUNPCKLQDQ (S2
_mm_unpacklo_epi64
Sse2.UnpackLow 
PUNPCKHDQ (S2
_mm_unpackhi_epi32
Sse2.UnpackHigh 

PUNPCKLDQ (S2
_mm_unpacklo_epi32
Sse2.UnpackLow 
PUNPCKHWD (S2
_mm_unpackhi_epi16
Sse2.UnpackHigh 

PUNPCKLWD (S2
_mm_unpacklo_epi16
Sse2.UnpackLow 
PUNPCKHBW (S2
_mm_unpackhi_epi8
Sse2.UnpackHigh 

PUNPCKLBW (S2
_mm_unpacklo_epi8
Sse2.UnpackLow 
UNPCKHPD (S2
_mm_unpackhi_pd
Sse2.UnpackHigh 

UNPCKLPD (S2
_mm_unpacklo_pd
Sse2.UnpackLow 
UNPCKHPS (S1
_mm_unpackhi_ps
Sse.UnpackHigh 

UNPCKLPS (S1
_mm_unpacklo_ps
Sse.UnpackLow 
   
shuffle/permute
VPERMQ (V2
_mm256_permute4x64_epi64
Avx2.Permute4x64 

VPERMI2Q (V5...
_mm_permutex2var_epi64
PSHUFD (S2
_mm_shuffle_epi32
Sse2.Shuffle 

VPERMD (V2
_mm256_permutevar8x32_epi32
Avx2.PermuteVar8x32 

_mm256_permutexvar_epi32
VPERMI2D (V5...
_mm_permutex2var_epi32
PSHUFHW (S2
_mm_shufflehi_epi16
Sse2.ShuffleHigh 

PSHUFLW (S2
_mm_shufflelo_epi16
Sse2.ShuffleLow 

VPERMW (V5+BW...
_mm_permutexvar_epi16
VPERMI2W (V5+BW...
_mm_permutex2var_epi16
PSHUFB (SS3
_mm_shuffle_epi8
Ssse3.Shuffle 
SHUFPD (S2
_mm_shuffle_pd
Sse2.Shuffle 

VPERMILPD (V1
_mm_permute_pd
Avx.Permute 
_mm_permutevar_pd
Avx.PermuteVar 

VPERMPD (V2
_mm256_permute4x64_pd
Avx2.Permute4x64 

VPERMI2PD (V5...
_mm_permutex2var_pd
SHUFPS (S1
_mm_shuffle_ps
Sse.Shuffle 

VPERMILPS (V1
_mm_permute_ps
Avx.Permute 
_mm_permutevar_ps
Avx.PermuteVar 

VPERMPS (V2
_mm256_permutevar8x32_ps
Avx2.PermuteVar8x32 

VPERMI2PS (V5...
_mm_permutex2var_ps
  VPERM2F128 (V1
_mm256_permute2f128_ps
Avx.Permute2x128 
_mm256_permute2f128_pd
Avx.Permute2x128 
_mm256_permute2f128_si256
Avx.Permute2x128 

VPERM2I128 (V2
_mm256_permute2x128_si256
Avx2.Permute2x128 
VSHUFI64X2 (V5...
_mm512_shuffle_i64x2
VSHUFI32X4 (V5...
_mm512_shuffle_i32x4
VSHUFF64X2 (V5...
_mm512_shuffle_f64x2
VSHUFF32X4 (V5...
_mm512_shuffle_f32x4
blend
VPBLENDMQ (V5...
_mm_mask_blend_epi32
VPBLENDD (V2
_mm_blend_epi32
Avx2.Blend 

VPBLENDMD (V5...
_mm_mask_blend_epi32
PBLENDW (S4.1
_mm_blend_epi16
Sse41.Blend 

VPBLENDMW (V5+BW...
_mm_mask_blend_epi16
PBLENDVB (S4.1
_mm_blendv_epi8
Sse41.BlendVariable 

VPBLENDMB (V5+BW...
_mm_mask_blend_epi8
BLENDPD (S4.1
_mm_blend_pd
Sse41.Blend 

BLENDVPD (S4.1
_mm_blendv_pd
Sse41.BlendVariable 

VBLENDMPD (V5...
_mm_mask_blend_pd
BLENDPS (S4.1
_mm_blend_ps
Sse41.Blend 

BLENDVPS (S4.1
_mm_blendv_ps
Sse41.BlendVariable 

VBLENDMPS (V5...
_mm_mask_blend_ps
   
move and duplicate MOVDDUP (S3
_mm_movedup_pd
Sse3.MoveAndDuplicate 
_mm_loaddup_pd
Sse3.LoadAndDuplicateToVector128 
MOVSHDUP (S3
_mm_movehdup_ps
Sse3.MoveHighAndDuplicate 

MOVSLDUP (S3
_mm_moveldup_ps
Sse3.MoveLowAndDuplicate 
 
mask move VPMASKMOVQ (V2
_mm_maskload_epi64
Avx2.MaskLoad 
_mm_maskstore_epi64
Avx2.MaskStore 
VPMASKMOVD (V2
_mm_maskload_epi32
Avx2.MaskLoad 
_mm_maskstore_epi32
Avx2.MaskStore 
    VMASKMOVPD (V1
_mm_maskload_pd
Avx.MaskLoad 
_mm_maskstore_pd
Avx.MaskStore 
VMASKMOVPS (V1
_mm_maskload_ps
Avx.MaskLoad 
_mm_maskstore_ps
Avx.MaskStore 
   
extract highest bit       PMOVMSKB (S2
_mm_movemask_epi8
Sse2.MoveMask 
MOVMSKPD (S2
_mm_movemask_pd
Sse2.MoveMask 
MOVMSKPS (S1
_mm_movemask_ps
Sse.MoveMask 
   
VPMOVQ2M (V5+DQ...
_mm_movepi64_mask
VPMOVD2M (V5+DQ...
_mm_movepi32_mask
VPMOVW2M (V5+BW...
_mm_movepi16_mask
VPMOVB2M (V5+BW...
_mm_movepi8_mask
gather
VPGATHERDQ (V2
_mm_i32gather_epi64
Avx2.GatherVector128 
_mm_mask_i32gather_epi64
Avx2.GatherMaskVector128 

VPGATHERQQ (V2
_mm_i64gather_epi64
Avx2.GatherVector128 
_mm_mask_i64gather_epi64
Avx2.GatherMaskVector128 
VPGATHERDD (V2
_mm_i32gather_epi32
Avx2.GatherVector128 
_mm_mask_i32gather_epi32
Avx2.GatherMaskVector128 

VPGATHERQD (V2
_mm_i64gather_epi32
Avx2.GatherVector128 
_mm_mask_i64gather_epi32
Avx2.GatherMaskVector128 
    VGATHERDPD (V2
_mm_i32gather_pd
Avx2.GatherVector128 
_mm_mask_i32gather_pd
Avx2.GatherMaskVector128 

VGATHERQPD (V2
_mm_i64gather_pd
Avx2.GatherVector128 
_mm_mask_i64gather_pd
Avx2.GatherMaskVector128 
VGATHERDPS (V2
_mm_i32gather_ps
Avx2.GatherVector128 
_mm_mask_i32gather_ps
Avx2.GatherMaskVector128 

VGATHERQPS (V2
_mm_i64gather_ps
Avx2.GatherVector128 
_mm_mask_i64gather_ps
Avx2.GatherMaskVector128 
   
scatter
VPSCATTERDQ (V5...
_mm_i32scatter_epi64
_mm_mask_i32scatter_epi64

VPSCATTERQQ (V5...
_mm_i64scatter_epi64
_mm_mask_i64scatter_epi64
VPSCATTERDD (V5...
_mm_i32scatter_epi32
_mm_mask_i32scatter_epi32

VPSCATTERQD (V5...
_mm_i64scatter_epi32
_mm_mask_i64scatter_epi32
    VSCATTERDPD (V5...
_mm_i32scatter_pd
_mm_mask_i32scatter_pd

VSCATTERQPD (V5...
_mm_i64scatter_pd
_mm_mask_i64scatter_pd
VSCATTERDPS (V5...
_mm_i32scatter_ps
_mm_mask_i32scatter_ps

VSCATTERQPS (V5...
_mm_i64scatter_ps
_mm_mask_i64scatter_ps
   
compress
VPCOMPRESSQ (V5...
_mm_mask_compress_epi64
_mm_mask_compressstoreu_epi64
VPCOMPRESSD (V5...
_mm_mask_compress_epi32
_mm_mask_compressstoreu_epi32
VCOMPRESSPD (V5...
_mm_mask_compress_pd
_mm_mask_compressstoreu_pd
VCOMPRESSPS (V5...
_mm_mask_compress_ps
_mm_mask_compressstoreu_ps
expand
VEXPANDQ (V5...
_mm_mask_expand_epi64
_mm_mask_expandloadu_epi64
VEXPANDD (V5...
_mm_mask_expand_epi32
_mm_mask_expandloadu_epi32
VEXPANDPD (V5...
_mm_mask_expand_pd
_mm_mask_expandloadu_pd
VEXPANDPS (V5...
_mm_mask_expand_ps
_mm_mask_expandloadu_ps
align right VALIGNQ (V5...
_mm_alignr_epi64
VALIGND (V5...
_mm_alignr_epi32
PALIGNR (SS3
_mm_alignr_epi8
Ssse3.AlignRight 
expand Opmask bits VPMOVM2Q (V5+DQ...
_mm_movm_epi64
VPMOVM2D (V5+DQ...
_mm_movm_epi32
VPMOVM2W (V5+BW...
_mm_movm_epi16
VPMOVM2B (V5+BW...
_mm_movm_epi8

 

Conversions

from \ to Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
Integer QWORD VPMOVQD (V5...
_mm_cvtepi64_epi32
VPMOVSQD (V5...
_mm_cvtsepi64_epi32
VPMOVUSQD (V5...
_mm_cvtusepi64_epi32
VPMOVQW (V5...
_mm_cvtepi64_epi16
VPMOVSQW (V5...
_mm_cvtsepi64_epi16
VPMOVUSQW (V5...
_mm_cvtusepi64_epi16
VPMOVQB (V5...
_mm_cvtepi64_epi8
VPMOVSQB (V5...
_mm_cvtsepi64_epi8
VPMOVUSQB (V5...
_mm_cvtusepi64_epi8
CVTSI2SD (S2 scalar only
_mm_cvtsi64_sd
Sse2.X64.ConvertScalarToVector128Double 

VCVTQQ2PD* (V5+DQ...
_mm_cvtepi64_pd
VCVTUQQ2PD* (V5+DQ...
_mm_cvtepu64_pd
CVTSI2SS (S1 scalar only
_mm_cvtsi64_ss
Sse.X64.ConvertScalarToVector128Single 

VCVTQQ2PS* (V5+DQ...
_mm_cvtepi64_ps
VCVTUQQ2PS* (V5+DQ...
_mm_cvtepu64_ps
DWORD TIP 3
PMOVSXDQ (S4.1
_mm_ cvtepi32_epi64
Sse41.ConvertToVector128Int64 

PMOVZXDQ (S4.1
_mm_ cvtepu32_epi64
Sse41.ConvertToVector128Int64 
  PACKSSDW (S2
_mm_packs_epi32
Sse2.PackSignedSaturate 

PACKUSDW (S4.1
_mm_packus_epi32
Sse41.PackUnsignedSaturate 

VPMOVDW (V5...
_mm_cvtepi32_epi16
VPMOVSDW (V5...
_mm_cvtsepi32_epi16
VPMOVUSDW (V5...
_mm_cvtusepi32_epi16
VPMOVDB (V5...
_mm_cvtepi32_epi8
VPMOVSDB (V5...
_mm_cvtsepi32_epi8
VPMOVUSDB (V5...
_mm_cvtusepi32_epi8
CVTDQ2PD* (S2
_mm_cvtepi32_pd
Sse2.ConvertToVector128Double 

VCVTUDQ2PD* (V5...
_mm_cvtepu32_pd
CVTDQ2PS* (S2
_mm_cvtepi32_ps
Sse2.ConvertToVector128Single 

VCVTUDQ2PS* (V5...
_mm_cvtepu32_ps
WORD PMOVSXWQ (S4.1
_mm_ cvtepi16_epi64
Sse41.ConvertToVector128Int64 

PMOVZXWQ (S4.1
_mm_ cvtepu16_epi64
Sse41.ConvertToVector128Int64 
TIP 3
PMOVSXWD (S4.1
_mm_ cvtepi16_epi32
Sse41.ConvertToVector128Int32 

PMOVZXWD (S4.1
_mm_ cvtepu16_epi32
Sse41.ConvertToVector128Int32 
PACKSSWB (S2
_mm_packs_epi16
Sse2.PackSignedSaturate 

PACKUSWB (S2
_mm_packus_epi16
Sse2.PackUnsignedSaturate 

VPMOVWB (V5+BW...
_mm_cvtepi16_epi8
VPMOVSWB (V5+BW...
_mm_cvtsepi16_epi8
VPMOVUSWB (V5+BW...
_mm_cvtusepi16_epi8
BYTE PMOVSXBQ (S4.1
_mm_ cvtepi8_epi64
Sse41.ConvertToVector128Int64 

PMOVZXBQ (S4.1
_mm_ cvtepu8_epi64
Sse41.ConvertToVector128Int64 
PMOVSXBD (S4.1
_mm_ cvtepi8_epi32
Sse41.ConvertToVector128Int32 

PMOVZXBD (S4.1
_mm_ cvtepu8_epi32
Sse41.ConvertToVector128Int32 
TIP 3
PMOVSXBW (S4.1
_mm_ cvtepi8_epi16
Sse41.ConvertToVector128Int16 

PMOVZXBW (S4.1
_mm_ cvtepu8_epi16
Sse41.ConvertToVector128Int16 
Floating-Point Double CVTSD2SI / CVTTSD2SI (S2 scalar only
_mm_cvtsd_si64 / _mm_cvttsd_si64
Sse2.X64.ConvertToInt64  / Sse2.X64.ConvertToInt64WithTruncation 

VCVTPD2QQ* / VCVTTPD2QQ* (V5+DQ...
_mm_cvtpd_epi64 / _mm_cvttpd_epi64
VCVTPD2UQQ* / VCVTTPD2UQQ* (V5+DQ...
_mm_cvtpd_epu64 / _mm_cvttpd_epu64
right ones are with truncation
CVTPD2DQ* / CVTTPD2DQ* (S2
_mm_cvtpd_epi32 / _mm_cvttpd_epi32
Sse2.ConvertToVector128Int32  / Sse2.ConvertToVector128Int32WithTruncation 

VCVTPD2UDQ* / VCVTTPD2UDQ* (V5...
_mm_cvtpd_epu32 / _mm_cvttpd_epu32
right ones are with truncation
CVTPD2PS* (S2
_mm_cvtpd_ps
Sse2.ConvertToVector128Single 
Single CVTSS2SI / CVTTSS2SI (S1 scalar only
_mm_cvtss_si64 / _mm_cvttss_si64
Sse.X64.ConvertToInt64  / Sse.X64.ConvertToInt64WithTruncation 

VCVTPS2QQ* / VCVTTPS2QQ* (V5+DQ...
_mm_cvtps_epi64 / _mm_cvttps_epi64
VCVTPS2UQQ* / VCVTTPS2UQQ* (V5+DQ...
_mm_cvtps_epu64 / _mm_cvttps_epu64
right ones are with truncation
CVTPS2DQ* / CVTTPS2DQ* (S2
_mm_cvtps_epi32 / _mm_cvttps_epi32
Sse2.ConvertToVector128Int32  / Sse2.ConvertToVector128Int32WithTruncation 

VCVTPS2UDQ* / VCVTTPS2UDQ* (V5...
_mm_cvtps_epu32 / _mm_cvttps_epu32
right ones are with truncation
  CVTPS2PD* (S2
_mm_cvtps_pd
Sse2.ConvertToVector128Double 
VCVTPS2PH (F16C
_mm_cvtps_ph
Half VCVTPH2PS (F16C
_mm_cvtph_ps

 

Arithmetic Operations

  Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
add PADDQ (S2
_mm_add_epi64
Sse2.Add 
PADDD (S2
_mm_add_epi32
Sse2.Add 
PADDW (S2
_mm_add_epi16
Sse2.Add 

PADDSW (S2
_mm_adds_epi16
Sse2.AddSaturate 

PADDUSW (S2
_mm_adds_epu16
Sse2.AddSaturate 
PADDB (S2
_mm_add_epi8
Sse2.Add 

PADDSB (S2
_mm_adds_epi8
Sse2.AddSaturate 

PADDUSB (S2
_mm_adds_epu8
Sse2.AddSaturate 
ADDPD* (S2
_mm_add_pd
Sse2.Add 
ADDPS* (S1
_mm_add_ps
Sse.Add 
sub PSUBQ (S2
_mm_sub_epi64
Sse2.Subtract 
PSUBD (S2
_mm_sub_epi32
Sse2.Subtract 
PSUBW (S2
_mm_sub_epi16
Sse2.Subtract 

PSUBSW (S2
_mm_subs_epi16
Sse2.SubtractSaturate 

PSUBUSW (S2
_mm_subs_epu16
Sse2.SubtractSaturate 
PSUBB (S2
_mm_sub_epi8
Sse2.Subtract 

PSUBSB (S2
_mm_subs_epi8
Sse2.SubtractSaturate 

PSUBUSB (S2
_mm_subs_epu8
Sse2.SubtractSaturate 
SUBPD* (S2
_mm_sub_pd
Sse2.Subtract 
SUBPS* (S1
_mm_sub_ps
Sse.Subtract 
 
mul VPMULLQ (V5+DQ...
_mm_mullo_epi64
PMULDQ (S4.1
_mm_mul_epi32
Sse41.Multiply 

PMULUDQ (S2
_mm_mul_epu32
Sse2.Multiply 

PMULLD (S4.1
_mm_mullo_epi32
Sse41.MultiplyLow 
PMULHW (S2
_mm_mulhi_epi16
Sse2.MultiplyHigh 

PMULHUW (S2
_mm_mulhi_epu16
Sse2.MultiplyHigh 

PMULLW (S2
_mm_mullo_epi16
Sse2.MultiplyLow 

MULPD* (S2
_mm_mul_pd
Sse2.Multiply 
MULPS* (S1
_mm_mul_ps
Sse.Multiply 
div DIVPD* (S2
_mm_div_pd
Sse2.Divide 
DIVPS* (S1
_mm_div_ps
Sse.Divide 
reciprocal         VRCP14PD* (V5...
_mm_rcp14_pd
VRCP28PD* (V5+ER
_mm512_rcp28_pd
RCPPS* (S1
_mm_rcp_ps
Sse.Reciprocal 

VRCP14PS* (V5...
_mm_rcp14_ps
VRCP28PS* (V5+ER
_mm512_rcp28_ps
 
square root         SQRTPD* (S2
_mm_sqrt_pd
Sse2.Sqrt 
SQRTPS* (S1
_mm_sqrt_ps
Sse.Sqrt 
 
reciprocal of square root         VRSQRT14PD* (V5...
_mm_rsqrt14_pd
VRSQRT28PD* (V5+ER
_mm512_rsqrt28_pd
RSQRTPS* (S1
_mm_rsqrt_ps
Sse.ReciprocalSqrt 

VRSQRT14PS* (V5...
_mm_rsqrt14_ps
VRSQRT28PS* (V5+ER
_mm_rsqrt28_ps
 
power of two         VEXP2PD* (V5+ER
_mm512_exp2a23_roundpd
VEXP2PS* (V5+ER
_mm512_exp2a23_round_ps
 
multiply nth power of 2 VSCALEFPD* (V5...
_mm_scalef_pd
VSCALEFPS* (V5...
_mm_scalef_ps
max TIP 8
VPMAXSQ (V5...
_mm_max_epi64
VPMAXUQ (V5...
_mm_max_epu64
TIP 8
PMAXSD (S4.1
_mm_max_epi32
Sse41.Max 

PMAXUD (S4.1
_mm_max_epu32
Sse41.Max 
PMAXSW (S2
_mm_max_epi16
Sse2.Max 

PMAXUW (S4.1
_mm_max_epu16
Sse41.Max 
TIP 8
PMAXSB (S4.1
_mm_max_epi8
Sse41.Max 

PMAXUB (S2
_mm_max_epu8
Sse2.Max 
TIP 8
MAXPD* (S2
_mm_max_pd
Sse2.Max 
TIP 8
MAXPS* (S1
_mm_max_ps
Sse.Max 
 
min TIP 8
VPMINSQ (V5...
_mm_min_epi64
VPMINUQ (V5...
_mm_min_epu64
TIP 8
PMINSD (S4.1
_mm_min_epi32
Sse41.Min 

PMINUD (S4.1
_mm_min_epu32
Sse41.Min 
PMINSW (S2
_mm_min_epi16
Sse2.Min 

PMINUW (S4.1
_mm_min_epu16
Sse41.Min 

TIP 8
PMINSB (S4.1
_mm_min_epi8
Sse41.Min 

PMINUB (S2
_mm_min_epu8
Sse2.Min 
TIP 8
MINPD* (S2
_mm_min_pd
Sse2.Min 
TIP 8
MINPS* (S1
_mm_min_ps
Sse.Min 
average     PAVGW (S2
_mm_avg_epu16
Sse2.Average 
PAVGB (S2
_mm_avg_epu8
Sse2.Average 
     
absolute TIP 4
VPABSQ (V5...
_mm_abs_epi64
TIP 4
PABSD (SS3
_mm_abs_epi32
Ssse3.Abs 
TIP 4
PABSW (SS3
_mm_abs_epi16
Ssse3.Abs 
TIP 4
PABSB (SS3
_mm_abs_epi8
Ssse3.Abs 
TIP 5 TIP 5  
sign operation   PSIGND (SS3
_mm_sign_epi32
Ssse3.Sign 
PSIGNW (SS3
_mm_sign_epi16
Ssse3.Sign 
PSIGNB (SS3
_mm_sign_epi8
Ssse3.Sign 
     
round         ROUNDPD* (S4.1
_mm_round_pd
Sse41.RoundToNearestInteger 
_mm_floor_pd
Sse41.Floor 
_mm_ceil_pd
Sse41.Ceiling 

VRNDSCALEPD* (V5...
_mm_roundscale_pd
ROUNDPS* (S4.1
_mm_round_ps
Sse41.RoundToNearestInteger 
_mm_floor_ps
Sse41.Floor 
_mm_ceil_ps
Sse41.Ceiling 

VRNDSCALEPS* (V5...
_mm_roundscale_ps
 
difference from rounded value         VREDUCEPD* (V5+DQ...
_mm_reduce_pd
VREDUCEPS* (V5+DQ...
_mm_reduce_ps
 
add / sub         ADDSUBPD (S3
_mm_addsub_pd
Sse3.AddSubtract 
ADDSUBPS (S3
_mm_addsub_ps
Sse3.AddSubtract 
 
horizontal add   PHADDD (SS3
_mm_hadd_epi32
Ssse3.HorizontalAdd 
PHADDW (SS3
_mm_hadd_epi16
Ssse3.HorizontalAdd 

PHADDSW (SS3
_mm_hadds_epi16
Ssse3.HorizontalAddSaturate 
  HADDPD (S3
_mm_hadd_pd
Sse3.HorizontalAdd 
HADDPS (S3
_mm_hadd_ps
Sse3.HorizontalAdd 
 
horizontal sub   PHSUBD (SS3
_mm_hsub_epi32
Ssse3.HorizontalSubtract 
PHSUBW (SS3
_mm_hsub_epi16
Ssse3.HorizontalSubtract 

PHSUBSW (SS3
_mm_hsubs_epi16
Ssse3.HorizontalSubtractSaturate 
  HSUBPD (S3
_mm_hsub_pd
Sse3.HorizontalSubtract 
HSUBPS (S3
_mm_hsub_ps
Sse3.HorizontalSubtract 
 
dot product         DPPD (S4.1
_mm_dp_pd
Sse41.DotProduct 
DPPS (S4.1
_mm_dp_ps
Sse41.DotProduct 
 
multiply and add PMADDWD (S2
_mm_madd_epi16
Sse2.MultiplyAddAdjacent 
PMADDUBSW (SS3
_mm_maddubs_epi16
Ssse3.MultiplyAddAdjacent 
fused multiply and add / sub         VFMADDxxxPD* (FMA
_mm_fmadd_pd
Fma.MultiplyAdd 

VFMSUBxxxPD* (FMA
_mm_fmsub_pd
Fma.MultiplySubtract 

VFMADDSUBxxxPD (FMA
_mm_fmaddsub_pd
Fma.MultiplyAddSubtract 

VFMSUBADDxxxPD (FMA
_mm_fmsubadd_pd
Fma.MultiplySubtractAdd 

VFNMADDxxxPD* (FMA
_mm_fnmadd_pd
Fma.MultiplyAddNegated 

VFNMSUBxxxPD* (FMA
_mm_fnmsub_pd
Fma.MultiplySubtractNegated 

xxx=132/213/231
VFMADDxxxPS* (FMA
_mm_fmadd_ps
Fma.MultiplyAdd 

VFMSUBxxxPS* (FMA
_mm_fmsub_ps
Fma.MultiplySubtract 

VFMADDSUBxxxPS (FMA
_mm_fmaddsub_ps
Fma.MultiplyAddSubtract 

VFMSUBADDxxxPS (FMA
_mm_fmsubadd_ps
Fma.MultiplySubtractAdd 

VFNMADDxxxPS* (FMA
_mm_fnmadd_ps
Fma.MultiplyAddNegated 

VFNMSUBxxxPS* (FMA
_mm_fnmsub_ps
Fma.MultiplySubtractNegated 

xxx=132/213/231
 

 

Compare

  Integer
QWORD DWORD WORD BYTE
compare for == PCMPEQQ (S4.1
_mm_cmpeq_epi64
Sse41.CompareEqual 

_mm_cmpeq_epi64_mask (V5...
VPCMPUQ (0) (V5...
_mm_cmpeq_epu64_mask
PCMPEQD (S2
_mm_cmpeq_epi32
Sse2.CompareEqual 

_mm_cmpeq_epi32_mask (V5...
VPCMPUD (0) (V5...
_mm_cmpeq_epu32_mask
PCMPEQW (S2
_mm_cmpeq_epi16
Sse2.CompareEqual 

_mm_cmpeq_epi16_mask (V5+BW...
VPCMPUW (0) (V5+BW...
_mm_cmpeq_epu16_mask
PCMPEQB (S2
_mm_cmpeq_epi8
Sse2.CompareEqual 

_mm_cmpeq_epi8_mask (V5+BW...
VPCMPUB (0) (V5+BW...
_mm_cmpeq_epu8_mask
compare for < VPCMPQ (1) (V5...
_mm_cmplt_epi64_mask
VPCMPUQ (1) (V5...
_mm_cmplt_epu64_mask
VPCMPD (1) (V5...
_mm_cmplt_epi32_mask
VPCMPUD (1) (V5...
_mm_cmplt_epu32_mask
VPCMPW (1) (V5+BW...
_mm_cmplt_epi16_mask
VPCMPUW (1) (V5+BW...
_mm_cmplt_epu16_mask
VPCMPB (1) (V5+BW...
_mm_cmplt_epi8_mask
VPCMPUB (1) (V5+BW...
_mm_cmplt_epu8_mask
compare for <= VPCMPQ (2) (V5...
_mm_cmple_epi64_mask
VPCMPUQ (2) (V5...
_mm_cmple_epu64_mask
VPCMPD (2) (V5...
_mm_cmple_epi32_mask
VPCMPUD (2) (V5...
_mm_cmple_epu32_mask
VPCMPW (2) (V5+BW...
_mm_cmple_epi16_mask
VPCMPUW (2) (V5+BW...
_mm_cmple_epu16_mask
VPCMPB (2) (V5+BW...
_mm_cmple_epi8_mask
VPCMPUB (2) (V5+BW...
_mm_cmple_epu8_mask
compare for > PCMPGTQ (S4.2
_mm_cmpgt_epi64
Sse42.CompareGreaterThan 

VPCMPQ (6) (V5...
_mm_cmpgt_epi64_mask
VPCMPUQ (6) (V5...
_mm_cmpgt_epu64_mask
PCMPGTD (S2
_mm_cmpgt_epi32
Sse2.CompareGreaterThan 

VPCMPD (6) (V5...
_mm_cmpgt_epi32_mask
VPCMPUD (6) (V5...
_mm_cmpgt_epu32_mask
PCMPGTW (S2
_mm_cmpgt_epi16
Sse2.CompareGreaterThan 

VPCMPW (6) (V5+BW...
_mm_cmpgt_epi16_mask
VPCMPUW (6) (V5+BW...
_mm_cmpgt_epu16_mask
PCMPGTB (S2
_mm_cmpgt_epi8
Sse2.CompareGreaterThan 

VPCMPB (6) (V5+BW...
_mm_cmpgt_epi8_mask
VPCMPUB (6) (V5+BW...
_mm_cmpgt_epu8_mask
compare for >= VPCMPQ (5) (V5...
_mm_cmpge_epi64_mask
VPCMPUQ (5) (V5...
_mm_cmpge_epu64_mask
VPCMPD (5) (V5...
_mm_cmpge_epi32_mask
VPCMPUD (5) (V5...
_mm_cmpge_epu32_mask
VPCMPW (5) (V5+BW...
_mm_cmpge_epi16_mask
VPCMPUW (5) (V5+BW...
_mm_cmpge_epu16_mask
VPCMPB (5) (V5+BW...
_mm_cmpge_epi8_mask
VPCMPUB (5) (V5+BW...
_mm_cmpge_epu8_mask
compare for != VPCMPQ (4) (V5...
_mm_cmpneq_epi64_mask
VPCMPUQ (4) (V5...
_mm_cmpneq_epu64_mask
VPCMPD (4) (V5...
_mm_cmpneq_epi32_mask
VPCMPUD (4) (V5...
_mm_cmpneq_epu32_mask
VPCMPW (4) (V5+BW...
_mm_cmpneq_epi16_mask
VPCMPUW (4) (V5+BW...
_mm_cmpneq_epu16_mask
VPCMPB (4) (V5+BW...
_mm_cmpneq_epi8_mask
VPCMPUB (4) (V5+BW...
_mm_cmpneq_epu8_mask

 

Floating-Point
Double Single Half
when either (or both) is Nan condition unmet condition met condition unmet condition met  
Exception on QNaN YES NO YES NO YES NO YES NO  
compare for == VCMPEQ_OSPD* (V1
_mm_cmp_pd
Avx.Compare 
CMPEQPD* (S2
_mm_cmpeq_pd
Sse2.CompareEqual 
VCMPEQ_USPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPEQ_UQPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPEQ_OSPS* (V1
_mm_cmp_ps
Avx.Compare 
CMPEQPS* (S1
_mm_cmpeq_ps
Sse.CompareEqual 
VCMPEQ_USPS* (V1
_mm_cmp_ps
Avx.Compare 
VCMPEQ_UQPS* (V1
_mm_cmp_ps
Avx.Compare 
 
compare for < CMPLTPD* (S2
_mm_cmplt_pd
Sse2.CompareLessThan 
VCMPLT_OQPD* (V1
_mm_cmp_pd
Avx.Compare 
    CMPLTPS* (S1
_mm_cmplt_ps
Sse.CompareLessThan 
VCMPLT_OQPS* (V1
_mm_cmp_ps
Avx.Compare 
     
compare for <= CMPLEPD* (S2
_mm_cmple_pd
Sse2.CompareLessThanOrEqual 
VCMPLE_OQPD* (V1
_mm_cmp_pd
Avx.Compare 
CMPLEPS* (S1
_mm_cmple_ps
Sse.CompareLessThanOrEqual 
VCMPLE_OQPS* (V1
_mm_cmp_ps
Avx.Compare 
 
compare for > VCMPGTPD* (V1
_mm_cmpgt_pd (S2
Sse2.CompareGreaterThan 
VCMPGT_OQPD* (V1
_mm_cmp_pd
Avx.Compare 
    VCMPGTPS* (V1
_mm_cmpgt_ps (S1
Sse.CompareGreaterThan 
VCMPGT_OQPS* (V1
_mm_cmp_ps
Avx.Compare 
     
compare for >= VCMPGEPD* (V1
_mm_cmpge_pd (S2
Sse2.CompareGreaterThanOrEqual 
VCMPGE_OQPD* (V1
_mm_cmp_pd
Avx.Compare 
    VCMPGEPS* (V1
_mm_cmpge_ps (S1
Sse.CompareGreaterThanOrEqual 
VCMPGE_OQPS* (V1
_mm_cmp_ps
Avx.Compare 
     
compare for != VCMPNEQ_OSPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPNEQ_OQPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPNEQ_USPD* (V1
_mm_cmp_pd
Avx.Compare 
CMPNEQPD* (S2
_mm_cmpneq_pd
Sse2.CompareNotEqual 
VCMPNEQ_OSPS* (V1
_mm_cmp_ps
Avx.Compare 
VCMPNEQ_OQPS* (V1
_mm_cmp_ps
Avx.Compare 
VCMPNEQ_USPS* (V1
_mm_cmp_ps
Avx.Compare 
CMPNEQPS* (S1
_mm_cmpneq_ps
Sse.CompareNotEqual 
 
compare for ! < CMPNLTPD* (S2
_mm_cmpnlt_pd
Sse2.CompareNotLessThan 
VCMPNLT_UQPD* (V1
_mm_cmp_pd
Avx.Compare 
CMPNLTPS* (S1
_mm_cmpnlt_ps
Sse.CompareNotLessThan 
VCMPNLT_UQPS* (V1
_mm_cmp_ps
Avx.Compare 
 
compare for ! <=     CMPNLEPD* (S2
_mm_cmpnle_pd
Sse2.CompareNotLessThanOrEqual 
VCMPNLE_UQPD* (V1
_mm_cmp_pd
Avx.Compare 
    CMPNLEPS* (S1
_mm_cmpnle_ps
Sse.CompareNotLessThanOrEqual 
VCMPNLE_UQPS* (V1
_mm_cmp_ps
Avx.Compare 
 
compare for ! > VCMPNGTPD* (V1
_mm_cmpngt_pd (S2
Sse2.CompareNotGreaterThan 
VCMPNGT_UQPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPNGTPS* (V1
_mm_cmpngt_ps (S1
Sse.CompareNotGreaterThan 
VCMPNGT_UQPS* (V1
_mm_cmp_ps
Avx.Compare 
 
compare for ! >=     VCMPNGEPD* (V1
_mm_cmpnge_pd (S2
Sse2.CompareNotGreaterThanOrEqual 
VCMPNGE_UQPD* (V1
_mm_cmp_pd
Avx.Compare 
    VCMPNGEPS* (V1
_mm_cmpnge_ps (S1
Sse.CompareNotGreaterThanOrEqual 
VCMPNGE_UQPS* (V1
_mm_cmp_ps
Avx.Compare 
 
compare for ordered VCMPORD_SPD* (V1
_mm_cmp_pd
Avx.Compare 
CMPORDPD* (S2
_mm_cmpord_pd
Sse2.CompareOrdered 
VCMPORD_SPS* (V1
_mm_cmp_ps
Avx.Compare 
CMPORDPS* (S1
_mm_cmpord_ps
Sse.CompareOrdered 
 
compare for unordered     VCMPUNORD_SPD* (V1
_mm_cmp_pd
Avx.Compare 
CMPUNORDPD* (S2
_mm_cmpunord_pd
Sse2.CompareUnordered 
    VCMPUNORD_SPS* (V1
_mm_cmp_ps
Avx.Compare 
CMPUNORDPS* (S1
_mm_cmpunord_ps
Sse.CompareUnordered 
 
TRUE VCMPTRUE_USPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPTRUEPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPTRUE_USPS* (V1
_mm_cmp_ps
Avx.Compare 
VCMPTRUEPS* (V1
_mm_cmp_ps
Avx.Compare 
 
FALSE VCMPFALSE_OSPD* (V1
_mm_cmp_pd
Avx.Compare 
VCMPFALSEPD* (V1
_mm_cmp_pd
Avx.Compare 
    VCMPFALSE_OSPS* (V1
_mm_cmp_ps
Avx.Compare 
VCMPFALSEPS* (V1
_mm_cmp_ps
Avx.Compare 
     

 

  Floating-Point
Double Single Half
compare scalar values
to set flag register
COMISD (S2
_mm_comieq_sd
Sse2.CompareScalarOrderedEqual 
_mm_comilt_sd
Sse2.CompareScalarOrderedLessThan 
_mm_comile_sd
Sse2.CompareScalarOrderedLessThanOrEqual 
_mm_comigt_sd
Sse2.CompareScalarOrderedGreaterThan 
_mm_comige_sd
Sse2.CompareScalarOrderedGreaterThanOrEqual 
_mm_comineq_sd
Sse2.CompareScalarOrderedNotEqual 

UCOMISD (S2
_mm_ucomieq_sd
Sse2.CompareScalarUnorderedEqual 
_mm_ucomilt_sd
Sse2.CompareScalarUnorderedLessThan 
_mm_ucomile_sd
Sse2.CompareScalarUnorderedLessThanOrEqual 
_mm_ucomigt_sd
Sse2.CompareScalarUnorderedGreaterThan 
_mm_ucomige_sd
Sse2.CompareScalarUnorderedGreaterThanOrEqual 
_mm_ucomineq_sd
Sse2.CompareScalarUnorderedNotEqual 
COMISS (S1
_mm_comieq_ss
Sse.CompareScalarOrderedEqual 
_mm_comilt_ss
Sse.CompareScalarOrderedLessThan 
_mm_comile_ss
Sse.CompareScalarOrderedLessThanOrEqual 
_mm_comigt_ss
Sse.CompareScalarOrderedGreaterThan 
_mm_comige_ss
Sse.CompareScalarOrderedGreaterThanOrEqual 
_mm_comineq_ss
Sse.CompareScalarOrderedNotEqual 

UCOMISS (S1
_mm_ucomieq_ss
Sse.CompareScalarUnorderedEqual 
_mm_ucomilt_ss
Sse.CompareScalarUnorderedLessThan 
_mm_ucomile_ss
Sse.CompareScalarUnorderedLessThanOrEqual 
_mm_ucomigt_ss
Sse.CompareScalarUnorderedGreaterThan 
_mm_ucomige_ss
Sse.CompareScalarUnorderedGreaterThanOrEqual 
_mm_ucomineq_ss
Sse.CompareScalarUnorderedNotEqual 
 

 

Bitwise Logical Operations

  Integer Floating-Point
QWORD DWORD WORD BYTE Double Single Half
and PAND (S2
_mm_and_si128
Sse2.And 
ANDPD (S2
_mm_and_pd
Sse2.And 
ANDPS (S1
_mm_and_ps
Sse.And 
 
VPANDQ (V5...
_mm512_and_epi64
etc
VPANDD (V5...
_mm512_and_epi32
etc
and not PANDN (S2
_mm_andnot_si128
Sse2.AndNot 
ANDNPD (S2
_mm_andnot_pd
Sse2.AndNot 
ANDNPS (S1
_mm_andnot_ps
Sse.AndNot 
 
VPANDNQ (V5...
_mm512_andnot_epi64
etc
VPANDND (V5...
_mm512_andnot_epi32
etc
or POR (S2
_mm_or_si128
Sse2.Or 
ORPD (S2
_mm_or_pd
Sse2.Or 
ORPS (S1
_mm_or_ps
Sse.Or 
 
VPORQ (V5...
_mm512_or_epi64
etc
VPORD (V5...
_mm512_or_epi32
etc
xor PXOR (S2
_mm_xor_si128
Sse2.Xor 
XORPD (S2
_mm_xor_pd
Sse2.Xor 
XORPS (S1
_mm_xor_ps
Sse.Xor 
VPXORQ (V5...
_mm512_xor_epi64
etc
VPXORD (V5...
_mm512_xor_epi32
etc
test PTEST (S4.1
_mm_testz_si128
Sse41.TestZ 
_mm_testc_si128
Sse41.TestC 
_mm_testnzc_si128
Sse41.TestNotZAndNotC 
VTESTPD (V1
_mm_testz_pd
Avx.TestZ 
_mm_testc_pd
Avx.TestC 
_mm_testnzc_pd
Avx.TestNotZAndNotC 
VTESTPS (V1
_mm_testz_ps
Avx.TestZ 
_mm_testc_ps
Avx.TestC 
_mm_testnzc_ps
Avx.TestNotZAndNotC 
 
VPTESTMQ (V5...
_mm_test_epi64_mask
VPTESTNMQ (V5...
_mm_testn_epi64_mask
VPTESTMD (V5...
_mm_test_epi32_mask
VPTESTNMD (V5...
_mm_testn_epi32_mask
VPTESTMW (V5+BW...
_mm_test_epi16_mask
VPTESTNMW (V5+BW...
_mm_testn_epi16_mask
VPTESTMB (V5+BW...
_mm_test_epi8_mask
VPTESTNMB (V5+BW...
_mm_testn_epi8_mask
ternary operation VPTERNLOGQ (V5...
_mm_ternarylogic_epi64
VPTERNLOGD (V5...
_mm_ternarylogic_epi32

 

Bit Shift / Rotate

  Integer
QWORD DWORD WORD BYTE
shift left logical PSLLQ (S2
_mm_slli_epi64
Sse2.ShiftLeftLogical 
_mm_sll_epi64
Sse2.ShiftLeftLogical 
PSLLD (S2
_mm_slli_epi32
Sse2.ShiftLeftLogical 
_mm_sll_epi32
Sse2.ShiftLeftLogical 
PSLLW (S2
_mm_slli_epi16
Sse2.ShiftLeftLogical 
_mm_sll_epi16
Sse2.ShiftLeftLogical 
 
VPSLLVQ (V2
_mm_sllv_epi64
Avx2.ShiftLeftLogicalVariable 
VPSLLVD (V2
_mm_sllv_epi32
Avx2.ShiftLeftLogicalVariable 
VPSLLVW (V5+BW...
_mm_sllv_epi16
 
shift right logical PSRLQ (S2
_mm_srli_epi64
Sse2.ShiftRightLogical 
_mm_srl_epi64
Sse2.ShiftRightLogical 
PSRLD (S2
_mm_srli_epi32
Sse2.ShiftRightLogical 
_mm_srl_epi32
Sse2.ShiftRightLogical 
PSRLW (S2
_mm_srli_epi16
Sse2.ShiftRightLogical 
_mm_srl_epi16
Sse2.ShiftRightLogical 
 
VPSRLVQ (V2
_mm_srlv_epi64
Avx2.ShiftRightLogicalVariable 
VPSRLVD (V2
_mm_srlv_epi32
Avx2.ShiftRightLogicalVariable 
VPSRLVW (V5+BW...
_mm_srlv_epi16
 
shift right arithmetic VPSRAQ (V5...
_mm_srai_epi64
_mm_sra_epi64
PSRAD (S2
_mm_srai_epi32
Sse2.ShiftRightArithmetic 
_mm_sra_epi32
Sse2.ShiftRightArithmetic 
PSRAW (S2
_mm_srai_epi16
Sse2.ShiftRightArithmetic 
_mm_sra_epi16
Sse2.ShiftRightArithmetic 
 
VPSRAVQ (V5...
_mm_srav_epi64
VPSRAVD (V2
_mm_srav_epi32
Avx2.ShiftRightArithmeticVariable 
VPSRAVW (V5+BW...
_mm_srav_epi16
 
rotate left VPROLQ (V5...
_mm_rol_epi64
VPROLD (V5...
_mm_rol_epi32
VPROLVQ (V5...
_mm_rolv_epi64
VPROLVD (V5...
_mm_rolv_epi32
rotate right VPRORQ (V5...
_mm_ror_epi64
VPRORD (V5...
_mm_ror_epi32
VPRORVQ (V5...
_mm_rorv_epi64
VPRORVD (V5...
_mm_rorv_epi32

 

Byte Shift

128-bit
shift left logical PSLLDQ (S2
_mm_slli_si128
Sse2.ShiftLeftLogical128BitLane 
shift right logical PSRLDQ (S2
_mm_srli_si128
Sse2.ShiftRightLogical128BitLane 
packed align right PALIGNR (SS3
_mm_alignr_epi8
Ssse3.AlignRight 

 

Compare Strings

explicit length implicit length
return index PCMPESTRI (S4.2
_mm_cmpestri
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRI (S4.2
_mm_cmpistri
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz
return mask PCMPESTRM (S4.2
_mm_cmpestrm
_mm_cmpestra
_mm_cmpestrc
_mm_cmpestro
_mm_cmpestrs
_mm_cmpestrz
PCMPISTRM (S4.2
_mm_cmpistrm
_mm_cmpistra
_mm_cmpistrc
_mm_cmpistro
_mm_cmpistrs
_mm_cmpistrz

 

Others

LDMXCSR (S1
_mm_setcsr
Load MXCSR register
STMXCSR (S1
_mm_getcsr
Save MXCSR register state

PSADBW (S2
_mm_sad_epu8
Sse2.SumAbsoluteDifferences 
Compute sum of absolute differences
MPSADBW (S4.1
_mm_mpsadbw_epu8
Sse41.MultipleSumAbsoluteDifferences 
Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers.
VDBPSADBW (V5+BW...
_mm_dbsad_epu8
Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

PMULHRSW (SS3
_mm_mulhrs_epi16
Ssse3.MultiplyHighRoundScale 
Packed Multiply High with Round and Scale

PHMINPOSUW (S4.1
_mm_minpos_epu16
Sse41.MinHorizontal 
Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

VPCONFLICTQ (V5+CD...
_mm512_conflict_epi64
VPCONFLICTD (V5+CD...
_mm512_conflict_epi32
Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

VPLZCNTQ (V5+CD...
_mm_lzcnt_epi64
VPLZCNTD (V5+CD...
_mm_lzcnt_epi32
Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

VFIXUPIMMPD* (V5...
_mm512_fixupimm_pd
VFIXUPIMMPS* (V5...
_mm512_fixupimm_ps
Fix Up Special Packed Float64/32 Values
VFPCLASSPD* (V5...
_mm512_fpclass_pd_mask
VFPCLASSSD* (V5...
_mm512_fpclass_sd_mask
Tests Types Of a Packed Float64/32 Values
VRANGEPD* (V5+DQ...
_mm_range_pd
VRANGEPS* (V5+DQ...
_mm_range_pd
Range Restriction Calculation For Packed Pairs of Float64/32 Values
VGETEXPPD* (V5...
_mm512_getexp_pd
VGETEXPPS* (V5...
_mm512_getexp_ps
Convert Exponents of Packed DP/SP FP Values to FP Values
VGETMANTPD* (V5...
_mm512_getmant_pd
VGETMANTPS* (V5...
_mm512_getmant_ps
Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

AESDEC (AESNI
_mm_aesdec_si128
Aes.Decrypt 
Perform an AES decryption round using an 128-bit state and a round key
AESDECLAST (AESNI
_mm_aesdeclast_si128
Aes.DecryptLast 
Perform the last AES decryption round using an 128-bit state and a round key
AESENC (AESNI
_mm_aesenc_si128
Aes.Encrypt 
Perform an AES encryption round using an 128-bit state and a round key
AESENCLAST (AESNI
_mm_aesenclast_si128
Aes.EncryptLast 
Perform the last AES encryption round using an 128-bit state and a round key
AESIMC (AESNI
_mm_aesimc_si128
Aes.InverseMixColumns 
Perform an inverse mix column transformation primitive
AESKEYGENASSIST (AESNI
_mm_aeskeygenassist_si128
Aes.KeygenAssist 
Assist the creation of round keys with a key expansion schedule
PCLMULQDQ (PCLMULQDQ
_mm_clmulepi64_si128
Pclmulqdq.CarrylessMultiply 
Perform carryless multiplication of two 64-bit numbers

SHA1RNDS4 (SHA
_mm_sha1rnds4_epu32
Perform Four Rounds of SHA1 Operation
SHA1NEXTE (SHA
_mm_sha1nexte_epu32
Calculate SHA1 State Variable E after Four Rounds
SHA1MSG1 (SHA
_mm_sha1msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords
SHA1MSG2 (SHA
_mm_sha1msg2_epu32
Perform a Final Calculation for the Next Four SHA1 Message Dwords
SHA256RNDS2 (SHA
_mm_sha256rnds2_epu32
Perform Two Rounds of SHA256 Operation
SHA256MSG1 (SHA
_mm_sha256msg1_epu32
Perform an Intermediate Calculation for the Next Four SHA256 Message
SHA256MSG2 (SHA
_mm_sha256msg2_epu32
Perform a Final Calculation for the Next Four SHA256 Message Dwords

VPBROADCASTMB2Q (V5+CD...
_mm_broadcastmb_epi64
VPBROADCASTMW2D (V5+CD...
_mm_broadcastmw_epi32
Broadcast Mask to Vector Register

VZEROALL (V1
_mm256_zeroall
Zero all YMM registers
VZEROUPPER (V1
_mm256_zeroupper
Zero upper 128 bits of all YMM registers

MOVNTPS (S1
_mm_stream_ps
Sse.StoreAlignedNonTemporal 
Non-temporal store of four packed single-precision floating-point values from an XMM register into memory
MASKMOVDQU (S2
_mm_maskmoveu_si128
Sse2.MaskMove 
Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD (S2
_mm_stream_pd
Sse2.StoreAlignedNonTemporal 
Non-temporal store of two packed double-precision floating-point values from an XMM register into memory
MOVNTDQ (S2
_mm_stream_si128
Sse2.StoreAlignedNonTemporal 
Non-temporal store of double quadword from an XMM register into memory
LDDQU (S3
_mm_lddqu_si128
Sse3.LoadDquVector128 
Special 128-bit unaligned load designed to avoid cache line splits
MOVNTDQA (S4.1
_mm_stream_load_si128
Sse41.LoadAlignedVector128NonTemporal 
Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

VGATHERPFxDPS (V5+PF
_mm512_mask_prefetch_i32gather_ps
VGATHERPFxQPS (V5+PF
_mm512_mask_prefetch_i64gather_ps
VGATHERPFxDPD (V5+PF
_mm512_mask_prefetch_i32gather_pd
VGATHERPFxQPD (V5+PF
_mm512_mask_prefetch_i64gather_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint
VSCATTERPFxDPS (V5+PF
_mm512_prefetch_i32scatter_ps
VSCATTERPFxQPS (V5+PF
_mm512_prefetch_i64scatter_ps
VSCATTERPFxDPD (V5+PF
_mm512_prefetch_i32scatter_pd
VSCATTERPFxQPD (V5+PF
_mm512_prefetch_i64scatter_pd
x=0/1
Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint with Intent to Write

 

 

TIPS

TIP 1: Zero Clear

XOR instructions do for both Integer and Floating-point.

Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1

        pxor         xmm1, xmm1

Example: Set 0.0f to 4 floats in XMM1

        xorps        xmm1, xmm1

Example: Set 0.0 to 2 doubles in XMM1

        xorpd        xmm1, xmm1

 

TIP 2: Copy the lowest 1 element to other elements in XMM register

Shuffle instructions do.

Example: Copy the lowest float element to other 3 elements in XMM1.

        shufps       xmm1, xmm1, 0

Example: Copy the lowest WORD element to other 7 elements in XMM1

        pshuflw       xmm1, xmm1, 0
        pshufd        xmm1, xmm1, 0

Example: Copy the lower QWORD element to the upper element in XMM1

        pshufd        xmm1, xmm1, 44h     ; 01 00 01 00 B = 44h

Is this better?

        punpcklqdq    xmm1, xmm1

 

TIP 3: Integer Sign Extension / Zero Extension

Unpack instructions do.

Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).

        movdqa     xmm2, xmm1     ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0]
        pxor       xmm3, xmm3     ; upper 16-bit to attach to each WORD = all 0
        punpcklwd  xmm1, xmm3     ; lower 4 DWORDS:  0 [3] 0 [2] 0 [1] 0 [0] 
        punpckhwd  xmm2, xmm3     ; upper 4 DWORDS:  0 [7] 0 [6] 0 [5] 0 [4]

Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).

        pxor       xmm3, xmm3
        movdqa     xmm2, xmm1
        pcmpgtb    xmm3, xmm1     ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1
        punpcklbw  xmm1, xmm3     ; lower 8 WORDS
        punpckhbw  xmm2, xmm3     ; upper 8 WORDS

Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)

    const __m128i izero = _mm_setzero_si128();
    __m128i words8hi = _mm_cmpgt_epi16(izero, words8);
    __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi);
    __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

 

TIP 4: Absolute Values of Integers

If an integer value is positive or zero, it is already the abosoute value. Else, adding 1 after complementing all bits makes the absolute value.

Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1

                                  ; if src is positive or 0; if src is negative
        pxor      xmm2, xmm2      
        pcmpgtw   xmm2, xmm1      ; xmm2 <- 0              ; xmm2 <- 1
        pxor      xmm1, xmm2      ; xor with 0(do nothing) ; xor with -1(complement all bits)
        psubw     xmm1, xmm2      ; subtract 0(do nothing) ; subtract -1(add 1)

Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4

    const __m128i izero = _mm_setzero_si128();
    __m128i tmp = _mm_cmpgt_epi32(izero, dwords4);
    dwords4 = _mm_xor_si128(dwords4, tmp);
    dwords4 = _mm_sub_epi32(dwords4, tmp);

 

TIPS 5: Absolute Values of Floating-Points

Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.

Example: Set absolute values of 4 floats in XMM1 to XMM1

; data
              align   16
signoffmask   dd      4 dup (7fffffffH)       ; mask for clearing the highest bit
        
; code
        andps   xmm1, xmmword ptr signoffmask        

Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4

        const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000

        floats4 = _mm_andnot_ps(signmask, floats4);

 

TIP 6: Lacking some integer MUL instructions?

Signed/unsigned makes difference only for the calculation of the upper part. Fot the lower part, the same instruction can be used both for signed and unsigned.

unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW

singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW

 

TIP 8: max / min

Bitwise operation after getting mask by compararison does.

Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1

; A=xmm1  B=xmm2                    ; if A>B        ; if A<=B
        movdqa      xmm0, xmm1
        pcmpgtd     xmm1, xmm2      ; xmm1=-1       ; xmm1=0
        pand        xmm2, xmm1      ; xmm2=B        ; xmm2=0
        pandn       xmm1, xmm0      ; xmm1=0        ; xmm1=A
        por         xmm1, xmm2      ; xmm1=B        ; xmm1=A

Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB

    __m128i mask = _mm_cmpgt_epi8(a, b);
    __m128i selectedA = _mm_and_si128(mask, a);
    __m128i selectedB = _mm_andnot_si128(mask, b);
    __m128i maxAB = _mm_or_si128(selectedA, selectedB);

 

TIP 10: Set all bits

PCMPEQx instruction does.

Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.

        pcmpeqb         xmm1, xmm1

 


ver 2019101400

Original content by daytime. Instruction names © Intel, .net method names © Microsoft.

Generated by this script. More details (in Russian) here.