Removing Large Arrays

D Sceviour · Post by **D Sceviour** » Wed Mar 11, 2020 3:20 am

Deberger wrote: ↑Wed Mar 11, 2020 12:39 am
D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136
The optimizer can remove what we are testing because the results are unused.

(warning: variable ‘d’ set but not used)

What happens if you propagate the edge values?

For example:
Code: Select all
int main() {
uint64_t c,d,e;
int start;

  printf("10 second test ...\n");
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += PushToEdges[c];
          e++;
       }
  }
  printf(" PushToEdges  %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += push_to_edge(c);
          e++;
       }
  }
  printf(" push_to_edge %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  return 0;
}

Excellent! This correction shows the array is much faster. I get:

10 second test ...
PushToEdges 15432851328
prevent optimization removal 954907675920
push_to_edge 6208815040
prevent optimization removal 409781792640

Terje · Post by **Terje** » Wed Mar 11, 2020 3:03 pm

D Sceviour wrote: ↑Wed Mar 11, 2020 3:20 am
Deberger wrote: ↑Wed Mar 11, 2020 12:39 am
D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136
The optimizer can remove what we are testing because the results are unused.

(warning: variable ‘d’ set but not used)

What happens if you propagate the edge values?

For example:
Code: Select all
int main() {
uint64_t c,d,e;
int start;

  printf("10 second test ...\n");
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += PushToEdges[c];
          e++;
       }
  }
  printf(" PushToEdges  %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += push_to_edge(c);
          e++;
       }
  }
  printf(" push_to_edge %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  return 0;
}
Excellent! This correction shows the array is much faster. I get:

10 second test ...
PushToEdges 15432851328
prevent optimization removal 954907675920
push_to_edge 6208815040
prevent optimization removal 409781792640

This doesn't show what you want it to. As others have pointed out, the array won't always be in L1 cache like in your test. Accessing the array in-order is probably an advantage. Finally, the loops are vectorized by gcc - the performance is nothing like what you'd see in SF.

Joerg Oster · Post by **Joerg Oster** » Wed Mar 11, 2020 3:19 pm

Terje wrote: ↑Wed Mar 11, 2020 3:03 pm
D Sceviour wrote: ↑Wed Mar 11, 2020 3:20 am
Deberger wrote: ↑Wed Mar 11, 2020 12:39 am
D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136
The optimizer can remove what we are testing because the results are unused.

(warning: variable ‘d’ set but not used)

What happens if you propagate the edge values?

For example:
Code: Select all
int main() {
uint64_t c,d,e;
int start;

  printf("10 second test ...\n");
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += PushToEdges[c];
          e++;
       }
  }
  printf(" PushToEdges  %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += push_to_edge(c);
          e++;
       }
  }
  printf(" push_to_edge %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  return 0;
}
Excellent! This correction shows the array is much faster. I get:

10 second test ...
PushToEdges 15432851328
prevent optimization removal 954907675920
push_to_edge 6208815040
prevent optimization removal 409781792640
This doesn't show what you want it to. As others have pointed out, the array won't always be in L1 cache like in your test. Accessing the array in-order is probably an advantage. Finally, the loops are vectorized by gcc - the performance is nothing like what you'd see in SF.

This array is only accessed frequently once Stockfish has reached a won endgame.
It is very likely the array will stay in L1 cache then!

D Sceviour · Post by **D Sceviour** » Wed Mar 11, 2020 4:25 pm

Terje wrote: ↑Wed Mar 11, 2020 3:03 pm This doesn't show what you want it to. As others have pointed out, the array won't always be in L1 cache like in your test. Accessing the array in-order is probably an advantage. Finally, the loops are vectorized by gcc - the performance is nothing like what you'd see in SF.

What is trying to be demonstrated is whether the array is faster than an inline calculation, and the code results show this is true in all variants. I broke open the machine code to have look with:

gcc -O -Wall-fverbose-asm -mtune=generic test.c -S

and found nothing unusual. The -O switch is essential if calling inline functions. "generic" coding is about the only way to understand machine code now. MMX optimizations and multiple registers make machine coding very difficult to follow now. I used to hate C because it was not as efficient as machine code, but now gcc coding is about the only way to utilize modern hardware.

The system decides what and what not to put in the cache. I have no idea how to issue direct orders to gcc to demand or ignore cache loading. The best that can be done is to vary the prefix's such and const, volatile and inline to influence the assembly. There are also default commands that can be overidden such as "-ftree-loop-distribution".

Of course this cannot duplicate the results inside SF. The only way to do that is to insert the routine into the SF code and then compile. However, that is not necessary to demonstrate the reasoning.

Dann Corbit · Post by **Dann Corbit** » Wed Mar 11, 2020 7:04 pm

Without demonstration inside the stockfish code, using a profiler to see where the time is really going, the trivial tests are meaningless.
When there are a hundred arrays and large data objects all competing for the cache, the code will behave very differently from when things are all sitting nicely in the cache.

It will also vary a lot from machine to machine. My 3970x has a very large cache, so many arrays will not be a bother. But on a machine with a small cache, it can be a big problem.

A fetch from main memory is very expensive, a fetch from the outer cache is far less expensive, a fetch from the inner cache is almost free. So accessing a table is fast or slow and the reason why is "It depends."

mar · Post by **mar** » Wed Mar 11, 2020 7:40 pm

First you have to compare apples to apples, i.e. the patch contains functional changes and produces different values than the old tables.
So you bake the new formula into tables (of bytes, not ints, that is 4 times smaller), then you can bench hundreds/thousands of relevant EG positions and let us know which one is faster.

the formula generates like 20 times as much machine code and hugely increases register pressure.
versus one cache line (when done properly) when the EG code triggers (times two since there are two tables/formulas). can you measure the difference? I very much doubt that. Do you like the "simplified" code? I very much doubt that either.

D Sceviour · Post by **D Sceviour** » Wed Mar 11, 2020 7:55 pm

Dann Corbit wrote: ↑Wed Mar 11, 2020 7:04 pm Without demonstration inside the stockfish code, using a profiler to see where the time is really going, the trivial tests are meaningless.
When there are a hundred arrays and large data objects all competing for the cache, the code will behave very differently from when things are all sitting nicely in the cache.

It will also vary a lot from machine to machine. My 3970x has a very large cache, so many arrays will not be a bother. But on a machine with a small cache, it can be a big problem.

A fetch from main memory is very expensive, a fetch from the outer cache is far less expensive, a fetch from the inner cache is almost free. So accessing a table is fast or slow and the reason why is "It depends."

It should be possible to emulate the internal processor timers. In the simplified code, the cache demand is singular so there is no interference from other threads. I get the same ratio results if threads are run in the background to overload the memory and processors. The system does not recognize "Stockfish"; it recognizes thread calls.

There is of course the prefetch() command to demand cache loading. I use it because it makes things go faster, but have never been clear on how it works. For example, how does it move the memory?

ram -> L3 -> L2 -> L1 -> register

And what happens if the loop calls for the information before the memory has been loaded into the cache? Does it wait in queue on the main bus transfer or does the system just jump to:

ram -> register

Or does all I/O go through one of the caches? It depends.

Dann Corbit · Post by **Dann Corbit** » Wed Mar 11, 2020 11:13 pm

mar wrote: ↑Wed Mar 11, 2020 7:40 pm First you have to compare apples to apples, i.e. the patch contains functional changes and produces different values than the old tables.
So you bake the new formula into tables (of bytes, not ints, that is 4 times smaller), then you can bench hundreds/thousands of relevant EG positions and let us know which one is faster.

the formula generates like 20 times as much machine code and hugely increases register pressure.
versus one cache line (when done properly) when the EG code triggers (times two since there are two tables/formulas). can you measure the difference? I very much doubt that. Do you like the "simplified" code? I very much doubt that either.

The new code is an abomination {to me, which means little}.

Stockfish is so crammed with magic numbers that it should be a Disney movie with wands and sorcerer's apprentices and bippity-boppity-boos.

My only point was that a 20 line program to compare the two methods is not valid. You have to profile both versions.

I do not defend the new code, and my personal preference that there should be meaningful constants or comments is also irrelevant.

Deberger · Post by **Deberger** » Thu Mar 12, 2020 3:08 am

D Sceviour wrote: ↑Wed Mar 11, 2020 3:20 am
Deberger wrote: ↑Wed Mar 11, 2020 12:39 am
D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136
The optimizer can remove what we are testing because the results are unused.

(warning: variable ‘d’ set but not used)

What happens if you propagate the edge values?

For example:
Code: Select all
int main() {
uint64_t c,d,e;
int start;

  printf("10 second test ...\n");
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += PushToEdges[c];
          e++;
       }
  }
  printf(" PushToEdges  %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += push_to_edge(c);
          e++;
       }
  }
  printf(" push_to_edge %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  return 0;
}
Excellent! This correction shows the array is much faster. I get:

10 second test ...
PushToEdges 15432851328
prevent optimization removal 954907675920
push_to_edge 6208815040
prevent optimization removal 409781792640

Post-modern software development means never bothering to unit test,

preferring to squander hundreds of thousands of test games instead.

bob · Post by **bob** » Thu Mar 12, 2020 7:49 pm

There is one thing everyone is missing. Have you ever had a case where you removed a bit of code (code that is not even executed during the search) and found that the speed improved. Or got worse? Have you ever added a bit of code (say a couple of checks to do something if a bug you are hunting happens) and the speed increases with the new code? And is slower when you remove it?

This is not a regular occurrence, but it happens with some frequency in MY testing. It has to do with alignment in memory. IE if it shifts all of commonly used data into one cache line where it was split over two and the second was getting overwritten somewhere. Some of these very small speed changes are purely random in nature. Not saying that has anything to do with the current test, but my point was this:

If you take a single piece of code and test it in isolation, you might get one result in terms of speed, while when you test the same code inside the engine, the speed might change significantly. Lots of cache references between execution points where the code is used will change the speed of that code since it is possible nothing stays in cache between two execution cycles.

Bottom line? I'd use the array method. This is certainly not going to be a multiple Elo gain or loss. It is more likely caused by something ELSE in the code that was changed at the same time. Otherwise it is a flaw in the testing methodology.

Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays