Hash speed and memory usage

cdani · Post by **cdani** » Tue Mar 04, 2014 3:02 pm

Hi.
In the last version of Andscacs I added a little hash of evaluations. It helps quite to win speed because the function of evaluation does a lot of things.
I tried to do it with dynamic allocation (calloc) but it’s quite slower than a fixed length array.
For the moment I keep a fixed array but small, 4MB, just to not disturb anything in the final user system.

So the questions are, if it’s normal that exists such a speed difference and it can be arranged in any way, and if it cannot be arranged, which you consider it’s the maximum size I can use for this array without considering that the engine is using unfair advantage over the other engines, or disturbing the system.

I understand that the best way is to do it dynamical, and allow the user/gui to configure the size.

As additional data, if I increase it to 8MB, the general speed increases easily 2% or more.

As a complementary question, as all the engines use memory for a lot of static variables and arrays, is there any non writed rule of what is considered maximum MB of this static memory?

Thanks!

bob · Post by **bob** » Tue Mar 04, 2014 5:38 pm

cdani wrote:Hi.
In the last version of Andscacs I added a little hash of evaluations. It helps quite to win speed because the function of evaluation does a lot of things.
I tried to do it with dynamic allocation (calloc) but it’s quite slower than a fixed length array.
For the moment I keep a fixed array but small, 4MB, just to not disturb anything in the final user system.

So the questions are, if it’s normal that exists such a speed difference and it can be arranged in any way, and if it cannot be arranged, which you consider it’s the maximum size I can use for this array without considering that the engine is using unfair advantage over the other engines, or disturbing the system.

I understand that the best way is to do it dynamical, and allow the user/gui to configure the size.

As additional data, if I increase it to 8MB, the general speed increases easily 2% or more.

As a complementary question, as all the engines use memory for a lot of static variables and arrays, is there any non writed rule of what is considered maximum MB of this static memory?

Thanks!

I am not sure I understand what you mean by dynamic. If you create such a hash table by a static declaration (global variable in the form of a large array) or if you use malloc() to grab the same amount of memory, you should see exactly zero difference in performance. If you mean allocating a new ENTRY each time you store something, that would be horribly inefficient.

cdani · Post by **cdani** » Tue Mar 04, 2014 9:04 pm

bob wrote: I am not sure I understand what you mean by dynamic. If you create such a hash table by a static declaration (global variable in the form of a large array) or if you use malloc() to grab the same amount of memory, you should see exactly zero difference in performance. If you mean allocating a new ENTRY each time you store something, that would be horribly inefficient.

I will explain with an example.

Code: Select all

int _tmain(int argc, _TCHAR* argv[])
{
	char * var_dynamic;
	int var_static[10];

	char var_index, var_resul;

	var_dynamic = (char *)malloc(10);

	var_index = 5;

	var_resul = var_dynamic[var_index];
	var_resul = var_static[var_index];

	return 0;
}

This code when in assembly translates to something like:

Code: Select all

	var_resul = var_dynamic[var_index];
009D13F3  movsx       eax,byte ptr [ebp-45h]  
009D13F7  mov         ecx,dword ptr [ebp-0Ch]  
009D13FA  mov         dl,byte ptr [ecx+eax]  
009D13FD  mov         byte ptr [ebp-51h],dl  
	var_resul = var_static[var_index];
009D1400  movsx       eax,byte ptr [ebp-45h]  
009D1404  mov         cl,byte ptr [ebp+eax*4-3Ch]  
009D1408  mov         byte ptr [ebp-51h],cl

There is one line plus of assembly code when using memory allocated by malloc, because it's necessary to load the address of the pointer. This should be true regardless of the compiler.
So it must be slower, and is noticeable slower in my tests, unless there is a workaround that I was not able to find.

Thanks!

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Mar 04, 2014 9:36 pm

There might be copy-on-write issues with calloc. Have you tried malloc and memset?
http://blogs.fau.de/hager/archives/825

Otherwise, beside of "chaotic" changes due to compiler optimization and code-alignment issues etc., there should be no big difference except loading the pointer which takes a register on x86-32, which is possibly nil on x86-64.

On Win32/64 memory sizes:
http://www.viva64.com/en/k/0036/

Antonio Torrecillas · Post by **Antonio Torrecillas** » Tue Mar 04, 2014 9:49 pm

note that your var_static is allocated in the stack, a pointer exist (the stack pointer ) that indicate the memory region your are looking for. Your var_dinamic doesn't benefit from this and need to be loaded first in a register. I'm not an assembler expert but this is probably the extra instruction that you've noticed. Defining var_static outside the main function must make this two approach similar.
have a look at http://www.agner.org/optimize/optimizing_cpp.pdf
there is so much things to learn

cdani · Post by **cdani** » Tue Mar 04, 2014 10:42 pm

Antonio Torrecillas wrote:note that your var_static is allocated in the stack, a pointer exist (the stack pointer ) that indicate the memory region your are looking for. Your var_dinamic doesn't benefit from this and need to be loaded first in a register. I'm not an assembler expert but this is probably the extra instruction that you've noticed. Defining var_static outside the main function must make this two approach similar.

Yes. The overhead defining inside or outside the function is the same; always need an instruction more to load the pointer, so it's slower. In Andscacs it's defined outside, so it's not in the stack.

Gerd Isenberg wrote:There might be copy-on-write issues with calloc. Have you tried malloc and memset?

Did not know this! It's very interesting. So I tested:

In milliseconds, depth 20 from start position, average of 3 measures for each case:
static 35064
malloc 35875
calloc 36085

Slightly better with malloc, but nowhere near static array.
Many times it's not only having to access the memory to load the pointer, but also using a register that will not be available for other functions, as you know.
I think this cannot be fixed easily, because will be necessary to change the assembly instructions dynamically on runtime to point to the allocated memory region.

cdani · Post by **cdani** » Tue Mar 04, 2014 11:25 pm

Just to highlight more this performance issue, even with 32 MB of hash and malloc I don't get the same performance gain as with 4MB! of static array.
It must be similar with the transposition table hash, I think.
But may be it's another thing that I'm not able to see...

bob · Post by **bob** » Tue Mar 04, 2014 11:45 pm

cdani wrote:
bob wrote: I am not sure I understand what you mean by dynamic. If you create such a hash table by a static declaration (global variable in the form of a large array) or if you use malloc() to grab the same amount of memory, you should see exactly zero difference in performance. If you mean allocating a new ENTRY each time you store something, that would be horribly inefficient.
I will explain with an example.
Code: Select all
int _tmain(int argc, _TCHAR* argv[])
{
	char * var_dynamic;
	int var_static[10];

	char var_index, var_resul;

	var_dynamic = (char *)malloc(10);

	var_index = 5;

	var_resul = var_dynamic[var_index];
	var_resul = var_static[var_index];

	return 0;
}
This code when in assembly translates to something like:
Code: Select all
	var_resul = var_dynamic[var_index];
009D13F3  movsx       eax,byte ptr [ebp-45h]  
009D13F7  mov         ecx,dword ptr [ebp-0Ch]  
009D13FA  mov         dl,byte ptr [ecx+eax]  
009D13FD  mov         byte ptr [ebp-51h],dl  
	var_resul = var_static[var_index];
009D1400  movsx       eax,byte ptr [ebp-45h]  
009D1404  mov         cl,byte ptr [ebp+eax*4-3Ch]  
009D1408  mov         byte ptr [ebp-51h],cl  
There is one line plus of assembly code when using memory allocated by malloc, because it's necessary to load the address of the pointer. This should be true regardless of the compiler.
So it must be slower, and is noticeable slower in my tests, unless there is a workaround that I was not able to find.

Thanks!

I don't see how 1 instruction can make it "noticeably slower". You only do this once per node at most. It takes a LOT of instructions to handle a single node. move generation, ordering, search recursion, ttable probe, repeat check, legality checks, evaluation.

In fact, you would make it faster if you got rid of the char nonsense. Char subscripts are bad from the get-go.

You are sort of apple and oranging things here too. That local variable is allocated on the stack, so that EBP always points close to it. Declare it globally outside the procedure... for the dynamic variable, malloc() it outside the procedure. Then compare to see how it looks.

Looks the same to me. I'd get rid of char subscripts to get rid of the mogsx stuff.

cdani · Post by **cdani** » Wed Mar 05, 2014 12:04 am

bob wrote: In fact, you would make it faster if you got rid of the char nonsense. Char subscripts are bad from the get-go.

Sorry if I did not explained it. I don't use an array of chars. This was just an example I wrote to show what the compiler does. But the result in assembly it's exactly the same as if you see what the compiler does when Stockfish accesses his TT.
What you suggest is exactly what my program does when I use malloc, and it's slower as you can see from the post I put before.
May be it's another thing, and not this added instruction.

cdani · Post by **cdani** » Wed Mar 05, 2014 6:32 am

I have found more or less a solution! I suppose it's related on how the cpu handles the memory.
It was enough to manually setting the hash memory to 0 with a for loop after doing the same with memset, and now it goes quite faster. Following the example I put before, now requires 35250 milliseconds. It's slower than with a static array, but a lot better.
I tried in different ways, and the faster was this dirty one, at least on my system:

Code: Select all

memset(thishash, 0, numofbytes);

char *j;
int i;
j=(char*)uintptr_t(thishash);
for (i=0;i<numofbytes;i++)
	  *j=1;
for (i=0;i<numofbytes;i++)
	  *j=0;

If I do with only the second for loop, it's slower. And if I add a third for loop, also it's slower.
With static array this trick it's not necessary, so clearly the system handles different the two types of memory.

But all those things sure will be different from system to system, so I suppose the best approach is to use static arrays for hash that will be little in size, and malloc hash for the bigger ones.

At last, may be this will be useful to someone, at least to avoid crashing his head

And now we return to one of the starting questions, about the size of static memory usage that you consider fair.

Thanks for your patience!

Hash speed and memory usage

Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage

Re: Hash speed and memory usage