C++ optimizations

Discussion of chess software programming and technical issues.

Moderator: Ras

wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: C++ optimizations

Post by wgarvin »

lucasart wrote:So I don't understand where this massive speed gain is coming from.
I didn't read the code, but one possibility might be aliasing. Maybe the old code used pointers to generic types like int, while the new code used structs/enums in such a way that the compiler knows they can't alias in certain critical blocks of code, resulting in more efficient generated code? Just a wild guess.

Oh: if you haven't tried it yet, try disabling C++ exceptions and see if that helps at all. They aren't "supposed" to cost anything, but some compilers can't optimize as well with them turned on. Turn of RTTI while you're at it. :)
lucasart
Posts: 3241
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: C++ optimizations

Post by lucasart »

mcostalba wrote:Congratulations !

Actually I have tested with -std=c++11 but unfortunatly no luck here :-(
I finally found the culprit. In DiscoCheck, my move struct is defined like this:

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t promotion:3;
	uint16_t ep:1;
};
If I use instead:

Code: Select all

struct move_t {
	uint16_t fsq:8, tsq:8;
	uint16_t promotion:8;
	uint16_t ep:1;
};
the speed gain on a perft benchmark is already huge (56 M leaf/sec to 85 M leaf/sec)
And I get almost 100 M leaf/sec with this:

Code: Select all

struct move_t {
	unsigned fsq, tsq;
	int promotion;
	bool ep;
};
And the reason is obvious, if you think about how the code runs in assembly language. Extracting m.fsq, and m.tsq for example requires some bit shifting and filtering in the first case, while in the second and third case it's straight forward. For example in the second case, sizeof(move_t) = 4, and if the 32 bit register EAX contains move_t m, then AL == m.fsq, AH = m.tsq, and m.promotion is the third byte of AX (less commonly used, if there isn't a trick to get it directly, then simply an shr EAX, 16 and AL contains m.promotion)

The problem is that sizeof(move_t) really has to be 2, for the purpose of packing hash entries later into 16 bytes, while retaining a sufficnent number of bits from the hash key to avoid collisions (I use 62 bits in DiscoCheck to be on the safe side).

So, for a pure perft(), rather than a fully fledged engine, this is quite an optimization.

PS: Contrary to my initial (stupid) idea, this has nothing to do with C++11 or C++ vs C. The same code compiled in C99 or C++11 performed exactly the same (the C++ executable being slightly more bloated, and a tiny bit slower, as usual).
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: C++ optimizations

Post by Sven »

lucasart wrote:
mcostalba wrote:Congratulations !

Actually I have tested with -std=c++11 but unfortunatly no luck here :-(
I finally found the culprit. In DiscoCheck, my move struct is defined like this:

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t promotion:3;
	uint16_t ep:1;
};
If I use instead:

Code: Select all

struct move_t {
	uint16_t fsq:8, tsq:8;
	uint16_t promotion:8;
	uint16_t ep:1;
};
the speed gain on a perft benchmark is already huge (56 M leaf/sec to 85 M leaf/sec)
And I get almost 100 M leaf/sec with this:

Code: Select all

struct move_t {
	unsigned fsq, tsq;
	int promotion;
	bool ep;
};
And the reason is obvious, if you think about how the code runs in assembly language. Extracting m.fsq, and m.tsq for example requires some bit shifting and filtering in the first case, while in the second and third case it's straight forward. For example in the second case, sizeof(move_t) = 4, and if the 32 bit register EAX contains move_t m, then AL == m.fsq, AH = m.tsq, and m.promotion is the third byte of AX (less commonly used, if there isn't a trick to get it directly, then simply an shr EAX, 16 and AL contains m.promotion)
That is quite astonishing.

Fruit 2.1, for instance, uses a 16 bit move type (no bitfields but with explicit masking and shifting to access from, to etc.). I would be really surprised if you could get even close to this kind of (perft) speedup for Fruit by changing its move type into a 4 byte or even 16 byte struct. Based on my own experience I would say that this kind of change is good for some 2-3% speed change effect but not more.

I hesistate to ask this, but are you really sure that this was the only change that caused that huge perft speedup?

Sven
abulmo
Posts: 151
Joined: Thu Nov 12, 2009 6:31 pm

Re: C++ optimizations

Post by abulmo »

Gian-Carlo Pascutto wrote:Profiling would shed light.
With which tools ?
Instrumented profilers like gprof mostly measure the impact of the added instrumenting code and give numbers so far from any reality that they mean nothing.
Statistical probing tools like sysprof give... stats, which, as you know are one of the three kind of lies. They produce random results varying between runs, too much random to shed any light.

The only realistic way I found is to make small change and measure with the chronometer (using accurate functions from a realtime library) their impact. Even that is hard and needs several runs to achieve a less than 1% accuracy results.
Richard
lucasart
Posts: 3241
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: C++ optimizations

Post by lucasart »

Sven Schüle wrote: That is quite astonishing.

Fruit 2.1, for instance, uses a 16 bit move type (no bitfields but with explicit masking and shifting to access from, to etc.). I would be really surprised if you could get even close to this kind of (perft) speedup for Fruit by changing its move type into a 4 byte or even 16 byte struct. Based on my own experience I would say that this kind of change is good for some 2-3% speed change effect but not more.

I hesistate to ask this, but are you really sure that this was the only change that caused that huge perft speedup?

Sven
Yes, I'm 100% certain. Beware of comparison with Fruit though. Fruit is not very fast (design choice), and is a fully fledged chess engine. I'm talking about a raw perft code here.

If you want to verify:

Code: Select all

$ git clone https://github.com/lucasart/chess
$ cd chess
$ ./make.sh
$ ./chess
Now modify fsq and tsq to 8 bits here, in board.h

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling
and run again

Code: Select all

./make.sh
./chess
Here are the results on my machine:

before
speed: 68205337 leaf/sec

after
speed: 118395027 leaf/sec

It's like DAY and NIGHT !
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
syzygy
Posts: 5722
Joined: Tue Feb 28, 2012 11:56 pm

Re: C++ optimizations

Post by syzygy »

abulmo wrote:
Gian-Carlo Pascutto wrote:Profiling would shed light.
With which tools ?
Instrumented profilers like gprof mostly measure the impact of the added instrumenting code and give numbers so far from any reality that they mean nothing.
Profile new and old and look at the difference.
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: C++ optimizations

Post by Karlo Bala »

lucasart wrote:
Sven Schüle wrote: That is quite astonishing.

Fruit 2.1, for instance, uses a 16 bit move type (no bitfields but with explicit masking and shifting to access from, to etc.). I would be really surprised if you could get even close to this kind of (perft) speedup for Fruit by changing its move type into a 4 byte or even 16 byte struct. Based on my own experience I would say that this kind of change is good for some 2-3% speed change effect but not more.

I hesistate to ask this, but are you really sure that this was the only change that caused that huge perft speedup?

Sven
Yes, I'm 100% certain. Beware of comparison with Fruit though. Fruit is not very fast (design choice), and is a fully fledged chess engine. I'm talking about a raw perft code here.

If you want to verify:

Code: Select all

$ git clone https://github.com/lucasart/chess
$ cd chess
$ ./make.sh
$ ./chess
Now modify fsq and tsq to 8 bits here, in board.h

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling
and run again

Code: Select all

./make.sh
./chess
Here are the results on my machine:

before
speed: 68205337 leaf/sec

after
speed: 118395027 leaf/sec

It's like DAY and NIGHT !
Instead of defining a move as a structure, perhaps you should try to define a move as unsigned short (16 bit). I'd like to see results.
Best Regards,
Karlo Balla Jr.
lucasart
Posts: 3241
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: C++ optimizations

Post by lucasart »

Karlo Bala wrote:
lucasart wrote:
Sven Schüle wrote: That is quite astonishing.

Fruit 2.1, for instance, uses a 16 bit move type (no bitfields but with explicit masking and shifting to access from, to etc.). I would be really surprised if you could get even close to this kind of (perft) speedup for Fruit by changing its move type into a 4 byte or even 16 byte struct. Based on my own experience I would say that this kind of change is good for some 2-3% speed change effect but not more.

I hesistate to ask this, but are you really sure that this was the only change that caused that huge perft speedup?

Sven
Yes, I'm 100% certain. Beware of comparison with Fruit though. Fruit is not very fast (design choice), and is a fully fledged chess engine. I'm talking about a raw perft code here.

If you want to verify:

Code: Select all

$ git clone https://github.com/lucasart/chess
$ cd chess
$ ./make.sh
$ ./chess
Now modify fsq and tsq to 8 bits here, in board.h

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling
and run again

Code: Select all

./make.sh
./chess
Here are the results on my machine:

before
speed: 68205337 leaf/sec

after
speed: 118395027 leaf/sec

It's like DAY and NIGHT !
Instead of defining a move as a structure, perhaps you should try to define a move as unsigned short (16 bit). I'd like to see results.
I think that's waste of time. It basically means that you have to do manually all the bit shifting and filtering that the compiler otherwise does (or doesn't have to do using 8 bit sub registers as I mentionned before). I really doubt you'll beat the compiler, but you can try.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: C++ optimizations

Post by Karlo Bala »

lucasart wrote:
I think that's waste of time. It basically means that you have to do manually all the bit shifting and filtering that the compiler otherwise does (or doesn't have to do using 8 bit sub registers as I mentionned before). I really doubt you'll beat the compiler, but you can try.
The question is, if you can beat the compiler? :)

Bitfiled is the first thing that everybody tries. What do you think, why almost no one uses structure for a move representation?

http://www.talkchess.com/forum/viewtopi ... 63&t=38939
Best Regards,
Karlo Balla Jr.
lucasart
Posts: 3241
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: C++ optimizations

Post by lucasart »

Karlo Bala wrote:
lucasart wrote:
I think that's waste of time. It basically means that you have to do manually all the bit shifting and filtering that the compiler otherwise does (or doesn't have to do using 8 bit sub registers as I mentionned before). I really doubt you'll beat the compiler, but you can try.
The question is, if you can beat the compiler? :)

Bitfiled is the first thing that everybody tries. What do you think, why almost no one uses structure for a move representation?

http://www.talkchess.com/forum/viewtopi ... 63&t=38939
Feel free to try for yourself, but I claim that you're wrong until proven otherwise. Perhaps that was true with 10 years ago's compilers, but not anymore. Note how GCC has optimized going from 6 to 8 bits, recognizing that it can use 8-bit sub-registers, instead of doing bit shifting and filtering.

My recommendation is as follows:

- let's say you define a struct like this

Code: Select all

struct move_t {
  uint32_t fsq:8, tsq:8;
  uint32_t prom:2;
  uint32_t flag:2;
}
- now when you have a move_t m at the start of a function, do not use m.fsq, m.tsq, m.flag, m.prom more than once (unless in rare if conditions where it doesn't matter). Typically you'll use m.fsq and m.tsq a lot, but perhaps m.flag and m.prom only once or twice in a rare branching, rather start like this

Code: Select all

const int fsq = m.fsq, tsq = m.tsq;
and *then* use fsq, and tsq normally.

- the "syntactic sugar" introduced by the language is hiding potentially expensive operations. do not forget that! It's just like using C++ STL: do it right and you achieve a the same performance as plain C; do it wrong and your performance is like Java written by a noob (that's like double noob!)
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.