Removing Large Arrays

Ovyron · Post by **Ovyron** » Tue Mar 10, 2020 5:52 pm

At least they shouldn't call it "simplification", as this doesn't simplify anything under any definition. Looks like change for the sake of change.

RubiChess · Post by **RubiChess** » Tue Mar 10, 2020 6:03 pm

Look at the discussion here: https://github.com/official-stockfish/S ... /pull/2570
Developers seem to agree that equation is faster than memory lookup to a table. But we agree that is not what we would call 'simplification'.

mar · Post by **mar** » Tue Mar 10, 2020 7:19 pm

no way the formula is faster if the table is in L1
the patch seems to do more than what was posted here, we were misled.
it replaces two tables plus it allegedly fixes some EG bug, so there are functional changes (also tables differ from formulas).
so it's not about performance at all.

mvanthoor · Post by **mvanthoor** » Tue Mar 10, 2020 8:13 pm

AndrewGrant wrote: ↑Tue Mar 10, 2020 1:41 pm I've long disagreed with a certain users string of "simplifications"....

if it succeeds you gain nothing but obfuscate a complex code base even further.

In the past, I've often noticed that C programmers tend to try and make code as brief as possible.

I can understand why you would to this back in the old days of the 90's, 80's or even before. (I'm just old enough to have experienced the tail end of the 80's with regard to computers.) Text screens were only 80 columns wide, and 25 lines high. Code completion wasn't as good as it is now, or didn't even exist. Compilers weren't as smart. Shorter code meant much more code on screen and faster editing. Less variables often meant more speed and less memory usage. With current IDE's and compilers, those things are often moot points.

A tiny piece of the magic bitboard generation/initialization in my engine (under development) looks like this:

Code: Select all

        let mask = if is_rook {
        	create_rook_mask(square)
        } else {
            create_bishop_mask(square)
        };
        let bits = mask.count_ones();
        let permutations = 2u64.pow(bits);
        let end_offset = offset + permutations - 1;
        ...

I could also have written it like this (if it's short enough, Rust will auto-format the if-statement as a ternary operator):

Code: Select all

let m = if isr { crrm(s) } else { crbm(s) };
let b = m.count_ones();
let p = 2u64.pow(b);
let e = o + p - 1;

or even shorter...

Code: Select all

let m = if isr { crrm(s) } else { crbm(s) };
let e = o + 2u64.pow(m.count_ones()) - 1;

There is no reason to write my code in the second way, let alone the third. Nobody (not even myself) will ever understand version 2 or 3 of this code, as opposed to version 1, where it's completely clear that I'm calculating the end of a range that starts at offset and ends at end_offset, and this range has something to do with all possible permutations of the rook mask for a certain square. (Further down in the code it'll be clear that each permutation of the mask is a blocker board, that I'm generating the attack board, and that the index of the attack board is going to fall from offset up to and including end_offset within the attack table.)

The compiler will do all of the inlines, stripping, and optimizations for me.

Watch this video, where the speaker actually shows how it works. A piece of code containing methods and structs and more stuff to make code readable and manageable for _humans_, is stripped down to "3 + input" by the compiler.

Because of that, I'm all for clarity over brevity, and clarity over cleverness.

So I agree: replacing that table with a function is a bad idea, because I can actually understand and surmise at first glance what that table is doing (especially when I know that this table comes from a chess program), but I can't do the same with the function.

I even disagree with the short syntax Rust uses:

fn: why not function?
impl: why not "implement"?
Vec and vec! (a type, and a macro to fill such a type): why not Vector and vector! ?
Implicit returns when the last line of a function is an expression:

So instead of (some random pseudo-code here):

Code: Select all

fn get_stuff() -> u64 {
	let mut number = random::generate();
	number += current_time_stamp();
	return hash(number);
}

The linter will complain that you're returning a variable that you didn't use for anything else but temporary storage, and it will suggest to remove it, and the return as well, because it's unneeded:

Code: Select all

fn get_stuff() -> u64 {
	hash(random::generate() + current_time_stamp())
}

And don't you dare put a ";" at the end of that line, because that would make it a statement, instead of a returnable expression, and you'd get the error that your function isn't returning a u64. IMHO, things like that are brevity for brevity's sake, and it creates cryptic code.

Everyone has to program the way they like it, especially if a project is personal. That's even true if it's an open source project, if one is the originator or even sole maintainer. People can join, or not, at their discretion. But, I personally think that writing code more clearly with less brevity is the better way to go if cooperation and having contribution from others to advance the project faster are the main goals.

D Sceviour · Post by **D Sceviour** » Tue Mar 10, 2020 8:30 pm

Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136

Code: Select all

// gcc -O3 -Wall -m64 test.c -o test -lm -s
// speed test method for different functions

#include <stdio.h>
#include <stdint.h>
#include <time.h>

#define rank_of(sq)                  ((sq) >> 3)
#define file_of(sq)                  ((sq) & 0x7)
#define _max(a, b)                   ((a) >= (b) ? (a) : (b))
#define _min(a, b)                   ((a) < (b) ? (a) : (b))

const int PushToEdges[64] = {
    100, 90, 80, 70, 70, 80, 90, 100,
     90, 70, 60, 50, 50, 60, 70,  90,
     80, 60, 40, 30, 30, 40, 60,  80,
     70, 50, 30, 20, 20, 30, 50,  70,
     70, 50, 30, 20, 20, 30, 50,  70,
     80, 60, 40, 30, 30, 40, 60,  80,
     90, 70, 60, 50, 50, 60, 70,  90,
    100, 90, 80, 70, 70, 80, 90, 100
  };

enum { RANK_1, RANK_2, RANK_3, RANK_4, RANK_5, RANK_6, RANK_7, RANK_8, N_RANK };
enum { FILE_A, FILE_B, FILE_C, FILE_D, FILE_E, FILE_F, FILE_G, FILE_H, N_FILE };

inline int edge_distance_f(int f) { return _min(f, (FILE_H - f)); } 
inline int edge_distance_r(int r) { return _min(r, (RANK_8 - r)); }

inline int push_to_edge(int s) {
   int rd = edge_distance_r(rank_of(s)), fd = edge_distance_f(file_of(s));
   return 90 - (7 * fd * fd / 2 + 7 * rd * rd / 2);
}

uint64_t time_in_ms() {
struct timespec t;
  clock_gettime(CLOCK_MONOTONIC, &t);
  return (uint64_t)t.tv_sec * 1000 + t.tv_nsec / 1000000;
}

int main() {
uint64_t c,d,e;
int start;

  printf("10 second test ...\n");
  start = time_in_ms();
  e = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d = PushToEdges[c];
          e++;
       }
  }
  printf(" PushToEdges  %I64d\n",e);
  start = time_in_ms();
  e = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d = push_to_edge(c);
          e++;
       }
  }
  printf(" push_to_edge %I64d\n",e);
  return 0;
}

mar · Post by **mar** » Tue Mar 10, 2020 8:39 pm

D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136

nonsense, the compiler completely eliminates the inner loops, so it boils down to nothing (busy-loop for 10 seconds).

EDIT: making the counters volatile, I got this:

Code: Select all

10 second test ...
 PushToEdges  3889291392
 push_to_edge 2689210048

since you count loops, more is better so table-based version (if in cache) is indeed significantly faster
the problem is that the patch probably succeeded due to a bugfix, so the "simplification" made it into the code undeservedly it seems.

D Sceviour · Post by **D Sceviour** » Tue Mar 10, 2020 8:55 pm

mar wrote: ↑Tue Mar 10, 2020 8:39 pm
D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136
nonsense, the compiler completely eliminates the inner loops, so it boils down to nothing (busy-loop for 10 seconds).

It is not nonsense. For one thing, you can understand the busy-loop. I only spent about 20 minutes writing this example, and there are better ways to do it. The timer test could be performed with much less frequency to give the busy-loop more activity. Still, I have been using this type of method for a long time and have no complaints in the methodology. The method does not describe the effects of internal optimization of the caches which could have an effect on results. However, that would be more due to hardware than software.

EDIT: I see your edit now, and am surpised you get even better results.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Mar 10, 2020 9:16 pm

D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136

In a real chess program with many instances fighting for the caches, things may turn out differently than a tiny loop test suggests.
Even small array lookups will throw other cachelines out of L1, computation may be done concurrently with other instructions improving ipc.

D Sceviour · Post by **D Sceviour** » Tue Mar 10, 2020 10:00 pm

Gerd Isenberg wrote: ↑Tue Mar 10, 2020 9:16 pm
D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136
In a real chess program with many instances fighting for the caches, things may turn out differently than a tiny loop test suggests.
Even small array lookups will throw other cachelines out of L1, computation may be done concurrently with other instructions improving ipc.

I agree that optimization is a problem for comparison of small arrays and small loops. However, I used the timer type test to compare large syzygy reads with coded endgame table base calculations and demonstrated syzygy was about 30,000 times slower.

In the posted code, the volatile variable is similar to deleting the -O3 optimization.

Deberger · Post by **Deberger** » Wed Mar 11, 2020 12:39 am

D Sceviour wrote: ↑Tue Mar 10, 2020 8:30 pm Often, I test the speed of different function methods. The results indicate the fixed array PushToEdges[] seems to be faster on average:

10 second test ...
PushToEdges 16519006400
push_to_edge 16610155136

The optimizer can remove what we are testing because the results are unused.

(warning: variable ‘d’ set but not used)

What happens if you propagate the edge values?

For example:

Code: Select all

int main() {
uint64_t c,d,e;
int start;

  printf("10 second test ...\n");
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += PushToEdges[c];
          e++;
       }
  }
  printf(" PushToEdges  %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  start = time_in_ms();
  e = 0;
  d = 0;

  while((time_in_ms() - start) < 10000) {
       for (c = 0; c < 64; c++) {
          d += push_to_edge(c);
          e++;
       }
  }
  printf(" push_to_edge %ld\n",e);
  printf(" prevent optimization removal %ld\n",d);
  return 0;
}

Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays

Re: Removing Large Arrays