As far as I understand there are only two instructions which take an immediate 64-bit address, called moffset64 — direct memory offset that specifies a quadword (64-bit) operand in memory:dzhao wrote:I noticed this when I played with x64 first time. I think the reason the compiler uses a register instead of an explicit constant to access a global is to reduce the code size. A 64 bit constant is 8 bytes, which is a large operand and no good for fast instruction decoding.Gerd Isenberg wrote: I have the impression that accessing global variables in 64-bit mode becomes more expensive. There is no compact mode with 32-bit addresses. There is a rip-relative addressing mode, but assembly generated by vc2005 indicates a pointer is needed to access globals all the time. Globals like static class members as well as statics inside the local scope of a function.Thus passing a board-, search- or equivalently a this-pointer around - even in a recursive search, might be faster than accessing globals. It might even make sense to keep all the constant data inside a one time initialized, embedded none static "const" member.Code: Select all
lea r10, base of some data_segment mov rax, [r10 + offset global var]
Gerd
I don't think you need to put constants on heap if you do multithreading and use pointer to address a search tree or board.
In such a case r10 is already known (or initialized), that is the board (or tree) pointer for a thread. The first load only executes once.
Code: Select all
MOV RAX, moffset64 opcode A1
MOV moffset64, RAX opcode A3
Otherwise compiler rely on a ModRM/SIB mode - or on rip-relative addressing mode. ModRM/SIB mode requires base-register which is adjusted by the os-loader. Of course the lea is not always needed if the compiler keeps the pointer to global data over function boundaries - anyway the register is "wasted" to access globals, similar to a this-pointer to access "private" data. I managed to relax register pressure to speedup my code by eliminating all global data references in critical code to keep once initialized constants as embedded objects accessible via this-ptr which I use anyway.
Here is one sample with static data, where the vc-compiler for some reason does not emit a lea-instruction, but uses four times memory operands with four byte displacements each.
Code: Select all
U32 popCount(U64 bb) {
static const U64 CACHE_ALIGN masks[8] = {
C64(0x0101010101010101), C64(0x0202020202020202),
C64(0x0404040404040404), C64(0x0808080808080808),
C64(0x1010101010101010), C64(0x2020202020202020),
C64(0x4040404040404040), C64(0x8080808080808080),
};
__m128i x0, x1, x2, x3, zr; U32 cnt;
__m128i * pM = (__m128i*) masks;
x0 = _mm_cvtsi64x_si128 ( bb );
x0 = _mm_unpacklo_epi64 ( x0, x0 );
zr = _mm_setzero_si128();
x3 = _mm_andnot_si128 ( x0, pM[3] );
x2 = _mm_andnot_si128 ( x0, pM[2] );
x1 = _mm_andnot_si128 ( x0, pM[1] );
x0 = _mm_andnot_si128 ( x0, pM[0] );
x3 = _mm_cmpeq_epi8 ( x3, zr );
x2 = _mm_cmpeq_epi8 ( x2, zr );
x1 = _mm_cmpeq_epi8 ( x1, zr );
x0 = _mm_cmpeq_epi8 ( x0, zr );
x2 = _mm_add_epi8 ( x2, x3 );
x0 = _mm_add_epi8 ( x0, x1 );
x0 = _mm_add_epi8 ( x0, x2 );
x0 = _mm_sad_epu8 ( x0, zr );
cnt = -_mm_cvtsi128_si32( x0 )
-_mm_extract_epi16( x0, 4 );
return cnt & 255;
}
bb$ = 8
?popCount@@YAI_K@Z PROC
00000 66 0f ef db pxor xmm3, xmm3
00004 66 48 0f 6e d1 movd xmm2, rcx
00009 66 0f 6c d2 punpcklqdq xmm2, xmm2
0000d 66 0f 6f e2 movdqa xmm4, xmm2
00011 66 0f 6f c2 movdqa xmm0, xmm2
00015 66 0f 6f ca movdqa xmm1, xmm2
00019 66 0f df 15 30 00 00 00 pandn xmm2, XMMWORD PTR ?masks+48
00021 66 0f df 25 00 00 00 00 pandn xmm4, XMMWORD PTR ?masks
00029 66 0f df 0d 20 00 00 00 pandn xmm1, XMMWORD PTR ?masks+32
00031 66 0f df 05 10 00 00 00 pandn xmm0, XMMWORD PTR ?masks+16
00039 66 0f 74 e3 pcmpeqb xmm4, xmm3
0003d 66 0f 74 c3 pcmpeqb xmm0, xmm3
00041 66 0f 74 cb pcmpeqb xmm1, xmm3
00045 66 0f fc e0 paddb xmm4, xmm0
00049 66 0f 74 d3 pcmpeqb xmm2, xmm3
0004d 66 0f fc ca paddb xmm1, xmm2
00051 66 0f fc e1 paddb xmm4, xmm1
00055 66 0f f6 e3 psadbw xmm4, xmm3
00059 66 0f 7e e0 movd eax, xmm4
0005d 66 0f c5 cc 04 pextrw ecx, xmm4, 4
00062 03 c8 add ecx, eax
00064 f7 d9 neg ecx
00066 0f b6 c1 movzx eax, cl
00069 c3 ret 0
?popCount@@YAI_K@Z ENDP