C++ optimizations

Discussion of chess software programming and technical issues.

Moderator: Ras

Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: C++ optimizations

Post by Karlo Bala »

lucasart wrote:
wgarvin wrote:
syzygy wrote:
Linus wrote:So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do.
Yeah... he was talking about what the gcc compiler writers should do in their optimizer, but oddly enough it sounds like good advice for us to follow too: if you want the fastest engine, don't use the compiler's bitfields, but do your own shifting and masking instead. :P

Of course you should use macros or inline functions to keep that nastiness just in one place, and have clear readable code everywhere else.
OK I decided to experiment, here are my results (results are expressed in leaf/sec on a varied perft() benchmark):

1/ struct with compiler handled bitfields

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling

	bool operator== (move_t m) const {
		return fsq == m.get_fsq() && tsq == m.get_tsq() && flag == m.get_flag();
	}
	
	int get_prom() const {
		assert(flag == PROMOTION);
		return prom + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		prom = piece - KNIGHT;
	}
};
speed: 69563522 leaf/sec

2/ class with encapsulation

A preparatory step to modify the class to go from compiler bitfields to manual ones (see 3/). As expected there is absolutely zero overhead:

Code: Select all

class move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling

public:
	bool operator== (move_t m) const {
		return fsq == m.get_fsq() && tsq == m.get_tsq() && flag == m.get_flag();
	}
	
	int get_fsq() const { return fsq; }
	int get_tsq() const { return tsq; }
	int get_flag() const { return flag; }

	void set_fsq(unsigned _fsq) { fsq = _fsq; }
	void set_tsq(unsigned _tsq) { tsq = _tsq; }
	void set_flag(unsigned _flag) { flag = _flag; }
	
	int get_prom() const {
		assert(flag == PROMOTION);
		return prom + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		prom = piece - KNIGHT;
	}
};
speed: 68930908 leaf/sec

3/ manual bitfields

it's quite ugly, and the compiler screams for unitialized variables, so I had to do a default ctor (with infinitesimal and unmeasurable performance cost, so don't worry about that). At least, the uglyness is kept inside the class, and the rest of the code doens't have to be ugly:

Code: Select all

class move_t {
	/* 16 bit field:
	 * fsq = 0..5
	 * tsq = 6..11
	 * prom = 12,13 (0=Knight..3=Queen)
	 * flag = 14,15 (0=NORMAL...3=CASTLING)
	 * */
	uint16_t mask;

public:
	move_t(): mask(0) {}	// silence compiler warnings
	bool operator== (move_t m) const { return mask == m.mask; }
	
	int get_fsq() const { return mask & 0x3f; }
	int get_tsq() const { return (mask >> 6) & 0x3f; }
	int get_flag() const { return (mask >> 14) & 3; }

	void set_fsq(unsigned fsq)	 { mask &= 0xffc0; mask ^= fsq; }
	void set_tsq(unsigned tsq)	 { mask &= 0xf03f; mask ^= (tsq << 6); }
	void set_flag(unsigned flag) { mask &= 0x3fff; mask ^= (flag << 14); }
	
	int get_prom() const {
		assert(get_flag() == PROMOTION);
		return ((mask >> 12) & 3) + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		mask &= 0xcfff;
		mask ^= (piece - KNIGHT) << 12;
	}
};
speed: 93625373 leaf/sec

I am shocked by the poor performance of GCC here! My struct/class with compiler handled bitfields didn't incur any undue sign extension, so it should have been just as efficient as manually handled bitfields. But admittedly you guys were right, GCC sucks at bitfields, big time!

But I still don't like doing this. It's like fixing your code instead of fixing the compiler. I don't see what prevents the GNU guys from doing it right in GCC, so that people don't have to write this kind of code. This is so 80s... We're in 2012 :cry:

I wonder if other compilers do better here. I pushed the patches 2/ and 3/ on my github:
https://github.com/lucasart/chess

PS: 94 M leaf/sec with a 16-bit move_t on an Intel Due Core ES5300 @ 2.6 GHz (one thread, no hash table cheating) is pretty damned fast perft :twisted:
You should get a small speed up when you need to set or get all fields together.

Code: Select all

void set_move (unsigned fsq, unsigned tsq, unsigned prom, unsigned flag)
{
   mask = ((((((flag<<2)|prom)<<6)|fsk)<<6)|tsq);
}

void get_move (unsigned& fsq, unsigned& tsq, unsigned& prom, unsigned& flag)
{
   flag = mask;
   fsq = flag & 0x3f;
   flag >>= 6;
   tsq = flag & 0x3f;
   flag >>= 6;
   prom = flag & 0x3;
   flag >>= 2;
}

Best Regards,
Karlo Balla Jr.
lucasart
Posts: 3242
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: C++ optimizations

Post by lucasart »

Karlo Bala wrote: You should get a small speed up when you can set or get all fields together.
Yeah, I thought about that too. But it will incurr quite a bit of rewriting. Maybe I should use my 16-bit class with a single getter/setter that outputs/input from another struct like this:

Code: Select all

struct FatMove {
  int fsq, tsq, prom, flag;
};
For example in a function that returns a move_t I could write

Code: Select all

{
  FatMove m;
  /* use m normally without rewriting anything */
  return move_t(m);  // conversion done at once here
}
And generally work with "fat moves" and dump them in one go into a "slim move" move_t like

Code: Select all

m.set_move(fat_move);
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
ZirconiumX
Posts: 1361
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: C++ optimizations

Post by ZirconiumX »

lucasart wrote: In an ideal world that should be suppressed by a statement like

Code: Select all

__assume(fsq < 64);
I know that MSVC has the __assume keyword. I don't know how to do it with GCC, and whether the optimizer would use this information efficiently anyway.
If you are willing to abandon strict ISO C99 and compatibility with compilers other than GCC,

Code: Select all

 if (__builtin_expect(sq >= 64, 0)) omfgbbq_bugsie_alert(); 
Should do something similar to your __assume() function. It is meant for branch prediction but if GCC has any sense it should optimize away the &= 64, since sq >= 64 will never be reached

Matthew:out
tu ne cede malis, sed contra audentior ito