Hyperbola Quiesscene: hardly any improvement

trojanfoe · Post by **trojanfoe** » Tue Jan 13, 2009 1:30 am

Hi there, after Tord Romstad talked about HQ in his PC/iPod comparison thread I jumped onto the excellent chess programming wiki (http://chessprogramming.wikispaces.com) and implemented it in my engine. Before using HQ I was using Kogge-Stone to generate the slider masks.

Unfortunetly, however, I haven't received a very large NPS increase for doing this (around 15%) but I did do something that might cause a problem. With Kogge Stone I was calling occlusion-fill within the sliding piece generate functions as well as the attack mask generation function so I thought I'd be clever and generate it just once (the attack masks) and use those masks in the move generation. To do this I had store alot more data in my position class (offset of each piece, attack mask of each piece as well as global(ish) offset/count values to allow me to quickly locate a piece's data). I am developing on a E8400 Core 2 Duo which although it runs at 3GHz does suffer from reduced cache memory. Is it possible I have thrown away the advantage of HQ by referencing too much memory during the move generation?

Cheers,
Andy

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jan 13, 2009 9:56 am

trojanfoe wrote:Hi there, after Tord Romstad talked about HQ in his PC/iPod comparison thread I jumped onto the excellent chess programming wiki (http://chessprogramming.wikispaces.com) and implemented it in my engine. Before using HQ I was using Kogge-Stone to generate the slider masks.

Unfortunetly, however, I haven't received a very large NPS increase for doing this (around 15%) but I did do something that might cause a problem. With Kogge Stone I was calling occlusion-fill within the sliding piece generate functions as well as the attack mask generation function so I thought I'd be clever and generate it just once (the attack masks) and use those masks in the move generation. To do this I had store alot more data in my position class (offset of each piece, attack mask of each piece as well as global(ish) offset/count values to allow me to quickly locate a piece's data). I am developing on a E8400 Core 2 Duo which although it runs at 3GHz does suffer from reduced cache memory. Is it possible I have thrown away the advantage of HQ by referencing too much memory during the move generation?

Cheers,
Andy

15% looks not that bad. But better compare HQ with magic bitboards rather than with piece-wise Kogge-Stone. Fillstuff is intended to do stuff direction-wise with multiple sliders. You make things differently, like trying to safe stuff on the ply-stack which is otherwise (with magics and HQ) intended to do on the fly.

Gerd

trojanfoe · Post by **trojanfoe** » Tue Jan 13, 2009 10:23 am

Agreed - that's an advantage I lose by performing per-piece (rather than per-piece-type) sliding, however I implemented the 'use attack mask during move generation' scheme on Kogge-Stone before implementing HQ and got a slight performance penalty so it's not completely killing it.

I don't think I am able to try magic bitboards as I cannot understand it, although I probably haven't given it enough time. I understood that HQ was as good as magics (I was delighted to hear that as it's much easier to understand

).

Cheers,
Andy

trojanfoe · Post by **trojanfoe** » Tue Jan 13, 2009 12:20 pm

I have undone the 'save attack masks for use in the move generator' and the performance has gone up (by 50% on 32-bit Windows, and 25% on 64-bit Linux - I am not able to test 64-bit Windows at the moment). However I have conditionally compiled Kogge-Stone and HQ and using KS I get slightly better performance? I think I must have other problems I need to address....

It has surprised me however how 'caching' the attack masks causes a performance drop.

Tord Romstad · Post by **Tord Romstad** » Tue Jan 13, 2009 12:21 pm

trojanfoe wrote:I don't think I am able to try magic bitboards as I cannot understand it, although I probably haven't given it enough time. I understood that HQ was as good as magics

I don't think that is universally true. For instance, as I pointed out myself, magic bitboards were significantly faster in 32-bit mode, both on recent x86 CPUs and on ARMs. In 64-bit mode, I also think it is too early to draw any conclusions. In my own benchmarks, there was hardly any difference in speed (if I recall correctly, magic bitboards were about 1% faster, which of course completely insignificant), but I am not sure my magic bitboard implementation is anywhere close to perfect. It seems that most bitboarders have simply copied and pasted a public domain magic bitboard attack generator by Pradu, whereas I wrote my own code. Given my utter cluelessness and incompetence about anything low-level, it quite possible that my implementation is vastly inferior to Pradu's.

Tord

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jan 13, 2009 12:47 pm

trojanfoe wrote:I have undone the 'save attack masks for use in the move generator' and the performance has gone up (by 50% on 32-bit Windows, and 25% on 64-bit Linux - I am not able to test 64-bit Windows at the moment). However I have conditionally compiled Kogge-Stone and HQ and using KS I get slightly better performance? I think I must have other problems I need to address....

It has surprised me however how 'caching' the attack masks causes a performance drop.

You do Kogge-Stone piece wise and four (bishop, rook) or eight (queen) ray-directions per piece in 32-bit mode and it is faster than HQ!? That is amazing - I can't believe

Possibly the kogge-stone stuff, which however needs a more stores/loads to the processor stack in 32-bit mode due to lack of registers, somehow hides the latency of other memory reads/writes, for instance tt probes, where in HQ that "latency hiding" is not possible due to additional memory references (even if small tables).

Seems your program is rather handicapped by memory latency and stalls - do you have huge position objects on a ply stack and perform copy-make rather than make-unmake?

Tord Romstad · Post by **Tord Romstad** » Tue Jan 13, 2009 12:52 pm

I just hacked up two simple benchmark programs, one for magic bitboards and one for hyperbolic quintessence. Both programs just generate millions of random occupied squares bitboards, loop through all 64 squares for each of the bitboards, and generates queen attacks from all of them. These queen attacks are added together and printed to the standard output when the program exits. It also displays the total computation time, in milliseconds.

In this minimalistic benchmark, magic bitboards are much faster, even in 64-bit mode. The magic mulitiplication program took 8.048 seconds, while the hyperbolic quintessence program took 14.448 seconds.

The source code for the two programs can be downloaded here and here. For use on Windows systems, you probably have to replace the get_system_time() function with whatever function is used to get the time in milliseconds in Windows.

As you can see from the source code, magic bitboards are not really that complicated. The whole source code for magic bitboard attack generation, including initialization code and the huge constant tables, is less than 150 lines of code, and does not require any inline assembly language. As a comparison, the HQ attack generation code (again, including initialization) is about 100 lines of code, without any big tables, but with an ugly little inline assembly language function.

Tord

trojanfoe · Post by **trojanfoe** » Wed Jan 14, 2009 12:17 am

Many thanks Tord, I will implement your magic bitboard code in my engine and see what happens.

What would be your guess for the performance of Kogge-Stone in comparison to HQ or magics? Currently KS is performing slightly better than HQ in my program (having removed the attack mask caching) and I am fairly certain I have other issues:

1) I don't use an unmake_move(), but rather just blat the parent position over the modified child position before moving onto the next move in the list. Again this could be causing problems due to excess memory accesses?

2) I don't use generate_evasions() when in check - I always use generate_moves(), which might save my quite a few clock cycles.

3) The pawn code looks a little unwieldy at the moment, especially with respect to en-passant captures.

Cheers,
Andy

trojanfoe · Post by **trojanfoe** » Wed Jan 14, 2009 1:07 am

I have implemented magic bitboards and have got a very slight increase in performance (1.9Mnps compared with 1.8Mnps for HQ/KS on a perft 6 of the starting position). I can only assume the issues I have outlined above are causing the issues so I will address those before moving on with the searching/evaluation elements of the engine.

Once again many thanks Tord for your efforts helping me; they are much appreciated.

Cheers,
Andy

Edsel Apostol · Post by **Edsel Apostol** » Wed Jan 14, 2009 2:51 am

Hi Tord,

Can you try also the 32 bit version of the Kindergarten Bitboard Method? It seems to be the fastest for me in 32 bit, next is the Magic Move generation in Olithink.

Hyperbola Quiesscene: hardly any improvement

Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement

Re: Hyperbola Quiesscene: hardly any improvement