C++ optimizations

lucasart · Post by **lucasart** » Mon Nov 12, 2012 5:00 pm

lucasart wrote: - now when you have a move_t m at the start of a function, do not use m.fsq, m.tsq, m.flag, m.prom more than once (unless in rare if conditions where it doesn't matter). Typically you'll use m.fsq and m.tsq a lot, but perhaps m.flag and m.prom only once or twice in a rare branching, rather start like this
Code: Select all
const int fsq = m.fsq, tsq = m.tsq;
and *then* use fsq, and tsq normally.

In fact, this optimization should be performed by the compiler. The compiler analyzes your function block, and it KNOWS that m is not modified, so it knows that if m.fsq, m.tsq etc. appear in several places, the extraction should be performed once and for all. Try it, with gcc -O3, you'll be surprised! Also, using const everywhere you can may help the compiler, as well as avoid human errors.

Generally, I would say: never assume, always verify. And never listen to anyone, especially on talkchess

wgarvin · Post by **wgarvin** » Mon Nov 12, 2012 5:40 pm

lucasart wrote:
lucasart wrote: - now when you have a move_t m at the start of a function, do not use m.fsq, m.tsq, m.flag, m.prom more than once (unless in rare if conditions where it doesn't matter). Typically you'll use m.fsq and m.tsq a lot, but perhaps m.flag and m.prom only once or twice in a rare branching, rather start like this
Code: Select all
const int fsq = m.fsq, tsq = m.tsq;
and *then* use fsq, and tsq normally.
In fact, this optimization should be performed by the compiler. The compiler analyzes your function block, and it KNOWS that m is not modified, so it knows that if m.fsq, m.tsq etc. appear in several places, the extraction should be performed once and for all. Try it, with gcc -O3, you'll be surprised!

In simple cases yes. But doing it manually is helpful because it makes it obvious to the compiler that even if something potentially aliases m.fsq after this line of code, that aliasing can not affect the value of the local variable fsq. So fsq will almost certainly live in a register and not get any spurious loads. Even with the strict aliasing rule, m.fsq might present ambiguities to the compiler (especially in more complex code, or where non-inline functions are called, etc.) Thats one reason I hate the strict-aliasing rules: you can write non-conforming code that seems to work for a while, but it might silently fail one day and cause wrong behaviour. And meanwhile it still doesn't always help solve even these simple cases of aliasing. You still end up needing to put restrict on things to tell the compiler when you know there is no actual aliasing. And there is almost no way to comply with it when writing deliberately-aliasing low-level code (for example, to write a decompressor which does deliberate unaligned loads of various sizes from a bytestream, there is no portable way of doing this that conforms with the strict aliasing rule... but its easy if you just disable strict aliasing). For any complex codebase that includes type-punning and low-level bit twiddling, just turning off strict-aliasing is by far the safest way to handle it.

lucasart wrote:Also, using const everywhere you can may help the compiler, as well as avoid human errors.

Const doesn't actually help the compiler at all (with the small exception that if you put it right on a declaration/definition of a global variable, it can put it in a read-only data segment). The problem is this: even if you declare a pointer as "const int* p" that only tells the compiler that you won't modify the pointed-to memory through this pointer p. You might modify it through some other pointer, and you might pass the pointer p to some function that casts away the constness and modifies the memory. So basically, the compiler doesn't learn anything useful about memory aliasing from your use of const. All it knows are (1) what it can figure out by analysis of the code, and (2) what is promised to it by the language standard (such as the strict-aliasing rule) and (3) what is promised to it by your use of restrict. And even using restrict correctly is confusing and difficult for some programmers (and it tends to expose bugs in compilers too, although this has been slowly improving over the years.)

The real reason to use const, is that it helps express your intention not to modify the object, to other programmers (and to your future self reading the code six months from now). Used correctly, it can help make code easier to read.

lucasart wrote:Generally, I would say: never assume, always verify. And never listen to anyone, especially on talkchess

This I agree with 100% !

wgarvin · Post by **wgarvin** » Mon Nov 12, 2012 5:51 pm

lucasart wrote:
Karlo Bala wrote:Instead of defining a move as a structure, perhaps you should try to define a move as unsigned short (16 bit). I'd like to see results.
I think that's waste of time. It basically means that you have to do manually all the bit shifting and filtering that the compiler otherwise does (or doesn't have to do using 8 bit sub registers as I mentionned before). I really doubt you'll beat the compiler, but you can try.

Actually, bitfields are one area where compilers have traditionally not always generated the very best code. Ask Bob Hyatt about this.

Its one of those "well, nobody uses this much" areas of the language, that doesn't get a huge amount of attention from the optimizer. I haven't tried them in any recent versions of compilers to see if this has changed in the last couple years, but it used to be that you could get better generated code from most compilers if you did your own explicit shifts and masking, than if you use bitfields. You can also manually pick one of the fields to put at the MSB end of the variable, and one to put at the LSB end (meaning you need fewer operations to extract their values). You can also mix signed and unsigned fields more easily (e.g. if you only have one signed field, put it at the MSB end so the sign-extension is "free" when you read its value). Even if you know the order that your compiler puts the bitfields into the word... some other compiler might put them in the opposite order.

Another common problem with bitfields, is people using signed types such as "int". That means each bitfield has its own sign bit and needs to be sign-extended when the value is extracted into a temporary of the full type such as "int". A one-bit signed bitfield contains nothing but the sign bit!

syzygy · Post by **syzygy** » Mon Nov 12, 2012 11:01 pm

wgarvin wrote:Actually, bitfields are one area where compilers have traditionally not always generated the very best code. Ask Bob Hyatt about this.

Or ask Linus: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

Linus wrote:So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do.

Interestingly, the problem reported by Linus is that gcc mixes byte access with word access, which modern x86 cpus cannot really cope with.

wgarvin · Post by **wgarvin** » Mon Nov 12, 2012 11:15 pm

syzygy wrote:
Linus wrote:So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do.

Yeah... he was talking about what the gcc compiler writers should do in their optimizer, but oddly enough it sounds like good advice for us to follow too: if you want the fastest engine, don't use the compiler's bitfields, but do your own shifting and masking instead.

Of course you should use macros or inline functions to keep that nastiness just in one place, and have clear readable code everywhere else.

abulmo · Post by **abulmo** » Tue Nov 13, 2012 1:38 am

syzygy wrote:
abulmo wrote:
Gian-Carlo Pascutto wrote:Profiling would shed light.
With which tools ?
Instrumented profilers like gprof mostly measure the impact of the added instrumenting code and give numbers so far from any reality that they mean nothing.
Profile new and old and look at the difference.

What's the meaning of the difference of two wrong numbers ?
Here is what I get when profiling a perft 7 with an instrumented profiler (from intel (*)):

Code: Select all

58.16% Board::count_moves() const
12.97% Board::get_pinning_direction(Coordinate, Color, Color) const
 8.13% Board::is_direction_not_pinned(int, int) const
 7.22% Board::is_attacked(Coordinate, Color) const
 4.50% Board::update(Move&)
 3.18% Board::generate_moves(MoveArray&) const
 2.66% Board::bulk_perft(int)
 2.22% Board::count_moves_black() const
 1.77% Board::restore(Move const&)

and with a probing profiler (sysprof):

Code: Select all

50.29% Board::count_moves() const
21.85% Board::get_pinning_direction(Coordinate, Color, Color) const
 6.86% Board::is_attacked(Coordinate, Color) const
 5.35% Board::update(Move&)
 4.16% Board::is_direction_not_pinned(int, int)
 3.36% Board::generate_moves(MoveArray&) const
 3.23% Board::restore(Move const&)

So which number should I believe ?

IMHO, It's ok to have a gross idea to where the time is spent, but no more.

(*) I prefer not to show the result of gprof, which finds time spent non existing functions like frame_dummy or on obviously uncalled function.

syzygy · Post by **syzygy** » Tue Nov 13, 2012 2:59 am

abulmo wrote:
syzygy wrote:
abulmo wrote:
Gian-Carlo Pascutto wrote:Profiling would shed light.
With which tools ?
Instrumented profilers like gprof mostly measure the impact of the added instrumenting code and give numbers so far from any reality that they mean nothing.
Profile new and old and look at the difference.
What's the meaning of the difference of two wrong numbers ?

They'll be off in the same way. You'll see where the new version is faster than the old version. If the new version is 40% or so faster, it will be hard not to get a clue by looking at the profiling reports.

Here is what I get when profiling a perft 7 with an instrumented profiler (from intel (*)):

Now slow down one of your functions with a loop, profile again, and see if you can find the function that you modified by comparing the two profiling reports that were created using the same tools.

I agree that profiling adds noise, especially if you have a lot of small functions that get called a lot, but as long as you know what you're doing and you have a reasonable idea of what's happening under the hood, it can certainly help to get a better understanding of a program's performance.

lucasart · Post by **lucasart** » Tue Nov 13, 2012 4:42 am

wgarvin wrote: You can also mix signed and unsigned fields more easily (e.g. if you only have one signed field, put it at the MSB end so the sign-extension is "free" when you read its value).

Yes, that is a typical "newbism". Note how I avoided making that mistake by using uint32_t in the struct {...};
In some cases however (like in disco check with the way i put the depth in the hash entries) it is useful to use an int bitfield with its own sign bit. This is rare but also very useful when you need it!

lucasart · Post by **lucasart** » Tue Nov 13, 2012 12:44 pm

wgarvin wrote:
syzygy wrote:
Linus wrote:So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do.
Yeah... he was talking about what the gcc compiler writers should do in their optimizer, but oddly enough it sounds like good advice for us to follow too: if you want the fastest engine, don't use the compiler's bitfields, but do your own shifting and masking instead.

Of course you should use macros or inline functions to keep that nastiness just in one place, and have clear readable code everywhere else.

OK I decided to experiment, here are my results (results are expressed in leaf/sec on a varied perft() benchmark):

1/ struct with compiler handled bitfields

Code: Select all

struct move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling

	bool operator== (move_t m) const {
		return fsq == m.get_fsq() && tsq == m.get_tsq() && flag == m.get_flag();
	}
	
	int get_prom() const {
		assert(flag == PROMOTION);
		return prom + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		prom = piece - KNIGHT;
	}
};

speed: 69563522 leaf/sec

2/ class with encapsulation

A preparatory step to modify the class to go from compiler bitfields to manual ones (see 3/). As expected there is absolutely zero overhead:

Code: Select all

class move_t {
	uint16_t fsq:6, tsq:6;
	uint16_t prom:2;	// 0=Knight...3=Queen
	uint16_t flag:2;	// 0=normal,1=ep,2=prom,3=castling

public:
	bool operator== (move_t m) const {
		return fsq == m.get_fsq() && tsq == m.get_tsq() && flag == m.get_flag();
	}
	
	int get_fsq() const { return fsq; }
	int get_tsq() const { return tsq; }
	int get_flag() const { return flag; }

	void set_fsq(unsigned _fsq) { fsq = _fsq; }
	void set_tsq(unsigned _tsq) { tsq = _tsq; }
	void set_flag(unsigned _flag) { flag = _flag; }
	
	int get_prom() const {
		assert(flag == PROMOTION);
		return prom + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		prom = piece - KNIGHT;
	}
};

speed: 68930908 leaf/sec

3/ manual bitfields

it's quite ugly, and the compiler screams for unitialized variables, so I had to do a default ctor (with infinitesimal and unmeasurable performance cost, so don't worry about that). At least, the uglyness is kept inside the class, and the rest of the code doens't have to be ugly:

Code: Select all

class move_t {
	/* 16 bit field:
	 * fsq = 0..5
	 * tsq = 6..11
	 * prom = 12,13 (0=Knight..3=Queen)
	 * flag = 14,15 (0=NORMAL...3=CASTLING)
	 * */
	uint16_t mask;

public:
	move_t(): mask(0) {}	// silence compiler warnings
	bool operator== (move_t m) const { return mask == m.mask; }
	
	int get_fsq() const { return mask & 0x3f; }
	int get_tsq() const { return (mask >> 6) & 0x3f; }
	int get_flag() const { return (mask >> 14) & 3; }

	void set_fsq(unsigned fsq)	 { mask &= 0xffc0; mask ^= fsq; }
	void set_tsq(unsigned tsq)	 { mask &= 0xf03f; mask ^= (tsq << 6); }
	void set_flag(unsigned flag) { mask &= 0x3fff; mask ^= (flag << 14); }
	
	int get_prom() const {
		assert(get_flag() == PROMOTION);
		return ((mask >> 12) & 3) + KNIGHT;
	}
	
	void set_prom(int piece) {
		assert(KNIGHT <= piece && piece <= QUEEN);
		mask &= 0xcfff;
		mask ^= (piece - KNIGHT) << 12;
	}
};

speed: 93625373 leaf/sec

I am shocked by the poor performance of GCC here! My struct/class with compiler handled bitfields didn't incur any undue sign extension, so it should have been just as efficient as manually handled bitfields. But admittedly you guys were right, GCC sucks at bitfields, big time!

But I still don't like doing this. It's like fixing your code instead of fixing the compiler. I don't see what prevents the GNU guys from doing it right in GCC, so that people don't have to write this kind of code. This is so 80s... We're in 2012

I wonder if other compilers do better here. I pushed the patches 2/ and 3/ on my github:
https://github.com/lucasart/chess

PS: 94 M leaf/sec with a 16-bit move_t on an Intel Due Core ES5300 @ 2.6 GHz (one thread, no hash table cheating) is pretty damned fast perft

lucasart · Post by **lucasart** » Tue Nov 13, 2012 12:59 pm

lucasart wrote: I don't see what prevents the GNU guys from doing it right in GCC

I can see one reason why GCC can't produce code as fast as my manual version. And that's essentially parameter validation. For example in

Code: Select all

void set_fsq(unsigned fsq)	 { mask &= 0xffc0; mask ^= fsq; }

what if fsq is outside the range 0..63 ? The standard forces the compiler to at least do something like that

Code: Select all

void set_fsq(unsigned fsq) {
    fsq &= 63;  /* filter the 6 lsb only, to avoid side effects on other elements of the bitfields */
    mask &= 0xffc0; mask ^= fsq;
}

In an ideal world that should be suppressed by a statement like

Code: Select all

__assume(fsq < 64);

I know that MSVC has the __assume keyword. I don't know how to do it with GCC, and whether the optimizer would use this information efficiently anyway.

C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations

Re: C++ optimizations