C++ optimizations

Karlo Bala · Post by **Karlo Bala** » Tue Nov 13, 2012 1:23 pm

lucasart wrote:
wgarvin wrote:
syzygy wrote:
Linus wrote:So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do.
Yeah... he was talking about what the gcc compiler writers should do in their optimizer, but oddly enough it sounds like good advice for us to follow too: if you want the fastest engine, don't use the compiler's bitfields, but do your own shifting and masking instead.

Of course you should use macros or inline functions to keep that nastiness just in one place, and have clear readable code everywhere else.
OK I decided to experiment, here are my results (results are expressed in leaf/sec on a varied perft() benchmark):

1/ struct with compiler handled bitfields
Code: Select all
struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling

	bool operator== (move_t m) const {
		return fsq == m.get_fsq() && tsq == m.get_tsq() && flag == m.get_flag();
	}
	
	int get_prom() const {
		assert(flag == PROMOTION);
		return prom + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		prom = piece - KNIGHT;
	}
};
speed: 69563522 leaf/sec

2/ class with encapsulation

A preparatory step to modify the class to go from compiler bitfields to manual ones (see 3/). As expected there is absolutely zero overhead:
Code: Select all
class move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling

public:
	bool operator== (move_t m) const {
		return fsq == m.get_fsq() && tsq == m.get_tsq() && flag == m.get_flag();
	}
	
	int get_fsq() const { return fsq; }
	int get_tsq() const { return tsq; }
	int get_flag() const { return flag; }

	void set_fsq(unsigned _fsq) { fsq = _fsq; }
	void set_tsq(unsigned _tsq) { tsq = _tsq; }
	void set_flag(unsigned _flag) { flag = _flag; }
	
	int get_prom() const {
		assert(flag == PROMOTION);
		return prom + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		prom = piece - KNIGHT;
	}
};
speed: 68930908 leaf/sec

3/ manual bitfields

it's quite ugly, and the compiler screams for unitialized variables, so I had to do a default ctor (with infinitesimal and unmeasurable performance cost, so don't worry about that). At least, the uglyness is kept inside the class, and the rest of the code doens't have to be ugly:
Code: Select all
class move_t {
	/* 16 bit field:
	 * fsq = 0..5
	 * tsq = 6..11
	 * prom = 12,13 (0=Knight..3=Queen)
	 * flag = 14,15 (0=NORMAL...3=CASTLING)
	 * */
	uint16_t mask;

public:
	move_t(): mask(0) {}	// silence compiler warnings
	bool operator== (move_t m) const { return mask == m.mask; }
	
	int get_fsq() const { return mask & 0x3f; }
	int get_tsq() const { return (mask >> 6) & 0x3f; }
	int get_flag() const { return (mask >> 14) & 3; }

	void set_fsq(unsigned fsq)	 { mask &= 0xffc0; mask ^= fsq; }
	void set_tsq(unsigned tsq)	 { mask &= 0xf03f; mask ^= (tsq << 6); }
	void set_flag(unsigned flag) { mask &= 0x3fff; mask ^= (flag << 14); }
	
	int get_prom() const {
		assert(get_flag() == PROMOTION);
		return ((mask >> 12) & 3) + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		mask &= 0xcfff;
		mask ^= (piece - KNIGHT) << 12;
	}
};
speed: 93625373 leaf/sec

I am shocked by the poor performance of GCC here! My struct/class with compiler handled bitfields didn't incur any undue sign extension, so it should have been just as efficient as manually handled bitfields. But admittedly you guys were right, GCC sucks at bitfields, big time!

But I still don't like doing this. It's like fixing your code instead of fixing the compiler. I don't see what prevents the GNU guys from doing it right in GCC, so that people don't have to write this kind of code. This is so 80s... We're in 2012

I wonder if other compilers do better here. I pushed the patches 2/ and 3/ on my github:
https://github.com/lucasart/chess

PS: 94 M leaf/sec with a 16-bit move_t on an Intel Due Core ES5300 @ 2.6 GHz (one thread, no hash table cheating) is pretty damned fast perft

You should get a small speed up when you need to set or get all fields together.

Code: Select all

void set_move (unsigned fsq, unsigned tsq, unsigned prom, unsigned flag)
{
   mask = ((((((flag<<2)|prom)<<6)|fsk)<<6)|tsq);
}

void get_move (unsigned& fsq, unsigned& tsq, unsigned& prom, unsigned& flag)
{
   flag = mask;
   fsq = flag & 0x3f;
   flag >>= 6;
   tsq = flag & 0x3f;
   flag >>= 6;
   prom = flag & 0x3;
   flag >>= 2;
}

lucasart · Post by **lucasart** » Tue Nov 13, 2012 1:34 pm

Karlo Bala wrote: You should get a small speed up when you can set or get all fields together.

Yeah, I thought about that too. But it will incurr quite a bit of rewriting. Maybe I should use my 16-bit class with a single getter/setter that outputs/input from another struct like this:

Code: Select all

struct FatMove {
  int fsq, tsq, prom, flag;
};

For example in a function that returns a move_t I could write

Code: Select all

{
  FatMove m;
  /* use m normally without rewriting anything */
  return move_t(m);  // conversion done at once here
}

And generally work with "fat moves" and dump them in one go into a "slim move" move_t like

Code: Select all

m.set_move(fat_move);

ZirconiumX · Post by **ZirconiumX** » Tue Nov 13, 2012 5:38 pm

lucasart wrote: In an ideal world that should be suppressed by a statement like
Code: Select all
__assume(fsq < 64);
I know that MSVC has the __assume keyword. I don't know how to do it with GCC, and whether the optimizer would use this information efficiently anyway.

If you are willing to abandon strict ISO C99 and compatibility with compilers other than GCC,

Code: Select all

 if (__builtin_expect(sq >= 64, 0)) omfgbbq_bugsie_alert();

Should do something similar to your __assume() function. It is meant for branch prediction but if GCC has any sense it should optimize away the &= 64, since sq >= 64 will never be reached

Matthew:out

C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations