performance of dynamically allocated move lists

Rein Halbersma · Post by **Rein Halbersma** » Mon Aug 01, 2011 8:24 pm

 
container	enhancement	time	penalty	   
array<128>		         29359		   
vector	 "reserve&#40;0&#41;"	42046	43,2%	   
	        reserve&#40;1&#41;	 41828	42,5%	   
	        reserve&#40;2&#41;	 39531	34,6%	   
	        reserve&#40;4&#41;	 37313	27,1%	   
	        reserve&#40;8&#41;	 35328	20,3%	   
	        reserve&#40;16&#41;	33062	12,6%	   
	        reserve&#40;32&#41;	32875	12,0%	   
	        reserve&#40;64&#41;	35437	20,7%	   
	        reserve&#40;128&#41;  35688	21,6%	   
	        reserve&#40;256&#41;  40734	38,7%	   
deque	   "reserve&#40;4K&#41;" 48969	66,8%

The above table (with a somewhat mangled layout) shows the timings for a 15-ply search with my draughts engine on the initial position in international draughts. I tested against the baseline of stack-allocated move lists of a fixed size of 128 moves. For this baseline, I used the STL container std::array<Move, 128>. Then I tested the exact same search with the dynamically allocated STL containers std::vector<Move> and std::deque<Move>. For each run, I varied the amount of pre-allocated memory with a reserve(X) statement, where X ran from 0 to 256 moves for std::vector, and to 4K bytes for std::deque (this is done implicitly by the STL provided allocator).

Whenever more than X moves are generated when reserve(X) has been called, the STL will automatically re-allocate about twice as much memory and copy the old vector into the new vector. For this 15-ply search, the average branching factor was a little less than 16. I didn't try to collect more detailed statistics on the distribution of the branching factor. Note that my Move struct is 24 bytes, so that there could be some cache alignment issues (perhaps padding to 32 bytes would be an idea). I didn't try to address this.

The performance results are quite surprising to me. It "only" costs 12% in the total time to depth (29 vs 33 seconds) when I use std::vector with reserved space for 32 moves, instead of doing a stack-based std::array. The 12%, of course, buys you full insurance against any crashes for positions with more than the maximum amount of stack-reserved moves. For chess engines, with Move structs of 4 bytes and average branching factor of 32, I would expect similar performance differences when moving to dynamically allocated move lists. Using incremental move generation (which I can't use in draughts because of capture rules) the performance in chess could be even slightly better.

wgarvin · Post by **wgarvin** » Mon Aug 01, 2011 8:51 pm

Why not just keep a giant stack of moves per thread, and within each search function just remember pointers to the beginning and end of the current "move list" for that node? That way, you don't ever spend any time allocating or copying movelists, and you don't waste any space. About the only down-side is that you lose control over which cachelines (or cache ways) the movelist for a particular node lies in, but I dont think anybody is optimizing the placement with that much care anyway.

I know you're a fan of STL, but this might be a case where plain simple C style is simpler and faster...

Rein Halbersma · Post by **Rein Halbersma** » Mon Aug 01, 2011 10:58 pm

wgarvin wrote:Why not just keep a giant stack of moves per thread, and within each search function just remember pointers to the beginning and end of the current "move list" for that node? That way, you don't ever spend any time allocating or copying movelists, and you don't waste any space. About the only down-side is that you lose control over which cachelines (or cache ways) the movelist for a particular node lies in, but I dont think anybody is optimizing the placement with that much care anyway.

I know you're a fan of STL, but this might be a case where plain simple C style is simpler and faster...

About Big Global Move Stacks: been there done that, didn't like the T-shirt

I currently use the std::array approach with a hopefully sufficient amount of moves. How much ELO does 12% cost anyway? BTW, I'm looking for good custom allocators to see how far one can push the std::vector abstraction penalty. For 5% I'd be willing to switch.

wgarvin · Post by **wgarvin** » Tue Aug 02, 2011 2:28 am

Rein Halbersma wrote:
wgarvin wrote:Why not just keep a giant stack of moves per thread, and within each search function just remember pointers to the beginning and end of the current "move list" for that node? That way, you don't ever spend any time allocating or copying movelists, and you don't waste any space. About the only down-side is that you lose control over which cachelines (or cache ways) the movelist for a particular node lies in, but I dont think anybody is optimizing the placement with that much care anyway.

I know you're a fan of STL, but this might be a case where plain simple C style is simpler and faster...
About Big Global Move Stacks: been there done that, didn't like the T-shirt I currently use the std::array approach with a hopefully sufficient amount of moves. How much ELO does 12% cost anyway? BTW, I'm looking for good custom allocators to see how far one can push the std::vector abstraction penalty. For 5% I'd be willing to switch.

There's actually two std::vector abstraction penalties.

One of them is the performance cost. The other one is the cost that people reading the code need to understand how std::vector works in order to read it.

Of course for std::vector and other bog-standard stuff, this penalty is very small (and worth it, because they add a lot of value over plain C stuff). But there's lots of programmers out there who would claim to know C++ but actually aren't that familiar with most of whats offered by the STL. If I'm being honest, I'm one of them. My opinion is that almost nothing from STL is worth using except for std::vector, std::sort and std::stable_sort, std::unique, std::lower_bound, and maybe std::string and std::map for non-critical things. Everything else is either too complicated or slow (other container classes), is harder to use than the C version (iostream) or lets you write 1 line of code instead of a simple 3-line idiom (std::swap, std::find, etc.)

Its true there are things in <algorithm> which are very handy, if you happen to be in the exact situation they were designed for and if you happen to know what they do. But even if you find them and use them, it just means the next person to read your code will have to look them up too. I would much rather read a simple 3-line loop where I can figure out what it does without having to google it or dig around in some include directory for the header file to find out what "std::find_if" means (and then forget 2 days later, since I only come across it about once a year). I don't think things like std::min_element add any value unless you're using them in some indirect way like passing them as a functor arg to something else.

For the last 9 years I've worked at places that did not use much STL, so most of <algorithm> I've never seen used before in production code. So I'm at one end of that spectrum. But even at the other end... have you ever seen a use of std::next_permutation or std::make_heap or std::random_shuffle? Are there really applications for this stuff, or did the STL designers just add everything they could think of?

[Edit: std::min_element is a good example, because it replaces a loop that remembers a pointer to the element with the lowest value. And you can call it like a function, e.g. in the middle of an argument list for something else. So thats convenient. It might occasionally be useful, but what do you do when you discover that you actually need the *two* lowest values? Or the lowest and the highest? The simplest thing is to just rip out that nice "abstract" call to std::min_element and put the dumb loop back in. At which point, you might (rightly) wonder why you didn't just write the dumb loop in the first place. Of course, some people in this situation immediately set out to write their own min_2_elements template function that returns a std::pair<T, T> or something. They will put it in some utils.h file thats included by everything in your entire project, and proceed to use it in the one or maybe two places where it applies, and there it will sit for the next 6 years without ever being used again. If you find one of your programmers doing this, my advice is to fire them and be more careful when hiring their replacement.]

wgarvin · Post by **wgarvin** » Tue Aug 02, 2011 2:43 am

If I'm being doubly honest, I should admit that I don't write a lot of new code. I spend most of my time maintaining existing code. When I do write code, its usually simple, low-level or both.

I am probably biased against the STL becuase its a wrong-shaped hammer for my day job. Maybe there are domains where using it heavily is a big win. For game engines its not really that helpful.

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 02, 2011 8:26 am

Hi Wylie,

Thanks for sharing your work experiences with STL code.

I meant to simply present the performance difference of dynamically vs statically allocated move lists. The fact that I used a few STL containers for this experiment, was really for illustration purposes only. I could also have done this with malloc/free and maintaining my own out-of-moves counter etc. It was literally only a single typedef change in my current code to test this.

In any case, I find the common vocabulary of <algorithm> very useful, because it's much easier to remember what a dozen or so commonly used algorithms do, than to mentally parse 10-line hand-written loops of different authors with different naming conventions, carelessness of boundary cases etc.

Rein

BTW, your min_2_elements example can be done without explicit loops:

Code: Select all

min_2_elements_sorted = std&#58;&#58;partial_sort&#40;v.begin&#40;), v.begin&#40;)+2, v.end&#40;));

which will have the first 2 elements sorted and returns an iterator to the 2nd. If you don't want the first 2 elements sorted, you can do

Code: Select all

min_2_elements_unsorted = std&#58;&#58;min_nth_element&#40;v.begin&#40;), v.begin&#40;)+2, v.end&#40;));

wgarvin · Post by **wgarvin** » Tue Aug 02, 2011 9:03 am

Aargh! Both of those examples mutate the container. That's almost always undesirable... Unless the container itself was a temporary being filled by some other overly-generic code, in which case most of the effort spent filling the container was probably wasted because you only needed 2 of the elements. "Thinking generic" can save some time and effort, but it can also be an insidious trap!

Once upon a time I believed in the STL, and I thought templates and functors were great and I could even look at something like Boost without a shudder of horror. The first time you have to debug memory corruption in some complex container class without the aid of a decent debugger, you realize that even good things come with a cost. When you rewrite your math library to use less template metaprogramming and your entire build speeds up by 10%, you sigh with joy--even if the real reason for the rewrite was that all those templates made it a confusing pain in the ass to debug.

Perfection is achieved not when thete's nothing left to add, but when there's nothing left to take away--therefore a perfect program should have no external dependencies, no library includes of any kind!

Okay, I'm being ridiculous... Real programs use standard facilities, and thats a good thing. But I just want to spread the gospel against "abstract" coding.
Making things simple is really hard. Using complex facilities that look simple from the outside can be dangerous... all abstractions are leaky! As the XP peo

wgarvin · Post by **wgarvin** » Tue Aug 02, 2011 9:27 am

Stupid iPhone.. sigh.. the rest of my post went like this:

As the XP people would say, Do The Simplest Thing That Could Possibly Work. Sometimes the simplest thing might be using STL (e.g. Using std::sort will almost always be better than some hand-rolled thing if you need to sort a range of N objects).

Now if you enjoy using the STL to solve this sort of problem, then of course you should go ahead and do that. In my case, I just find the C++ language and libraries to be a bloated and overly-complicated toolbox containing thousands of tools, and the way that I cope is by not using most of them!

wgarvin · Post by **wgarvin** » Tue Aug 02, 2011 9:45 am

Rein Halbersma wrote:I meant to simply present the performance difference of dynamically vs statically allocated move lists. The fact that I used a few STL containers for this experiment, was really for illustration purposes only. I could also have done this with malloc/free and maintaining my own out-of-moves counter etc.

I wonder if its worth trying _alloca? It will waste a little space, but probably be much faster than malloc/new/vector<>. Its not as flexible as the "Big Global Move Stack" because once you _alloca the memory, you're stuck with it until the function returns (whereas if you use the "Big Global Move Stack" to search captures, and then later generate a different list of non-captures, you can overwrite the captures and not waste any space and it doesnt matter if the new set of moves is larger or smaller). In the "Big Global Move Stack", only those sets of moves that you are still busy exploring in some node on the current path are ever taking up space.

[Edit: I guess if you want to add elements to the container and have it grow, _alloca is a bad choice.. never mind.]

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 02, 2011 10:00 am

There is also boost::auto_buffer that will try a stack-allocated fixed size array, and only if that is exhausted will it to a heap-allocation. So the overhead is simply that of size-checked array-access. I tried it and allocating 128 Move elements up front gave almost identical performance (<0.2% penalty) to the std::array that I currently have (which is also identical to the C-style array, but just provides the iterator interface as a convenience).

performance of dynamically allocated move lists

performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists

Re: performance of dynamically allocated move lists