ABDADA speedup results

Daniel Shawul · Post by **Daniel Shawul** » Fri May 03, 2013 12:54 am

Now I have added work sharing also at the root and i am getting speedup numbers as high as 1.8X. The reason for this pretty good speed up numbers maybe that I am experimenting on evaluation-less Nebiyu. With the old YBW, i almost always get a speed up of 2x on those 30 positions, so it is not something specific for ABDADA. For scorpio the numbers would probably be lower ,but I am only interested in the relative performance of ABDADA compared to YBW so this is fine. The results are different from run to run but 1.7x for Nebiyu seems to be a good estimate

Code: Select all

1.320	1.833	1.817

I tried to play with the minimum work sharing depth. This is not done to reduce splitting as there is none but to reduce the load on TT and ABDADA overhead in general. The above was for min_depth >= 4 , and the speedup for min_depth >= 0 is about 1.74x, which is not far from the above 1.8x speedup. Even though not very definitive, the search over head reduced, and also the nps scaling dropped as well as one would expect with more TT requests.

Code: Select all

1.295	1.749	1.792

Daniel

Michel · Post by **Michel** » Sat May 04, 2013 1:08 am

All this seems pretty encouraging.

I have one more question. How do you deal with LMR and LMP?

I now control these with a move counter. However this will be wrong when a thread revisits moves (I would like to make the revisiting transparent by sending the delayed moves back to the move picker object).

Of course it is possible to fix this by letting the move picker keep track of the move count but I wonder if one should really worry about it.

Daniel Shawul · Post by **Daniel Shawul** » Sat May 04, 2013 2:14 am

Michel wrote:All this seems pretty encouraging.

I have one more question. How do you deal with LMR and LMP?

I now control these with a move counter. However this will be wrong when a thread revisits moves (I would like to make the revisiting transparent by sending the delayed moves back to the move picker object).

Of course it is possible to fix this by letting the move picker keep track of the move count but I wonder if one should really worry about it.

That had me flabbergasted first time I tried a two pass search. I was confused why two pass search was significantly slower than single pass search, and i was looking all the wrong places, too many TT calls? etc. In the end the culprit was wrong application of LMR, which is obvious in hindsight. You have to make sure to reset the legal_moves counter to 0 , and save the location of where you start testing for LMR reduction (usually non-capture phase). Since the moves are already generated in the second pass, you start reducing when legal_moves > non_cap_start. I guess simply generating the moves again may save some trouble for a little cost. It took me a day to solve all the nuisances to make sure every move gets reduced by the same amount when searched with single pass vs two pass scheme. I saved single pass reduction value when skipping the move, and compared it with current reduction amount during second pass and printed any mis-matches. That is how i caught all problems due to LMR.

Some other bug that shocked me but will not affect you is that I was counting internal nodes twice or more. This i found out when i tried to make sure that hash_probe() is called exactly once. Btw you should make sure that you do that too, otherwise your search would think that after you store CRAP to search the move exclusively, and then you probe the TT again to find the CRAP you stored, you would think some other processor did that and skip the move. Anyway this bug was responsible for 30% inflation of scorpios nps since i had nodes++ in similar place as hash_probe(). God knows how many wrong conclusions I may have made in the past with such a wrong nodes count. Funny thing I actually run faster now with the 30% lowered nps because i don't call hash_probe() unnecessarily. I did that when a test for null_move, singular search, iid_search etc fails due to some twisted iterative search, but for a recursive search hash_probes are done once at the top so it will not affect you. Thanks for listening for what is useless to almost anyone but me

Daniel Shawul · Post by **Daniel Shawul** » Sat May 04, 2013 5:33 am

All this seems pretty encouraging.

It is almost too good to be true. Now with 4 threads and I got a speed up of 3x. The search times are getting shorter even though i used fixed depth of 20 so the result is not reliable. But it is indicative of what can be achieved. There are some rare crashes that i haven't fixed yet, and probably wouldn't have happened if i used processes. Also I synchronize the threads after every iteration of the iterative deepening but there is no need for that with processes.
Anyway here are results for that i just run
2 threads:

Code: Select all

1.253   1.697   1.909

4 threads

Code: Select all

1.607   3.010   3.424

I think that for 8 or more threads performance may start to degrade because the way I see it all processors tend to work on the same ply. We usually have a limit <= 4 processors at a split point etc for a YBW search , but I don't know how to apply that for ABDADA. The paper does not mention such performance degradation so it may well scale up to many processors. I just wonder why this algorithm is not as popular as it should ,since it is simple , efficient, fault tolerant etc.
Edit: I just found out why I can't apply the limit on number of processors at a ply. The original algorithm stores 'nproc' in TT which I avoided since I didn't see any use for it. But now I do. They mentioned it is used to control speculative parallelism. I tell you this algorithm has everything in it.

Michel · Post by **Michel** » Sat May 04, 2013 7:09 pm

I think that for 8 or more threads performance may start to degrade because the way I see it all processors tend to work on the same ply. We usually have a limit <= 4 processors at a split point etc for a YBW search , but I don't know how to apply that for ABDADA.

Another paper mentions a speed up of 70 for 128 threads, if I remember correctly.

just found out why I can't apply the limit on number of processors at a ply. The original algorithm stores 'nproc' in TT which I avoided since I didn't see any use for it.

Well nproc gives the number of threads searching a node. I don't see how you can avoid this since it controls the effect of the exclusive access flag.
This is the reason it seems to me that tt access has to be synchronized.
If nproc of a node ever becomes >0 without any thread searching it then the node becomes inaccessible for exclusive search.

Daniel Shawul · Post by **Daniel Shawul** » Sat May 04, 2013 8:14 pm

The reason why I did not need to store nprocs is because I follow YBW 'strictly' i.e. all available processors will search the first move. But you can do speculative parallelization when you have too many processors that it would be a waste to spend them all on one move. For example for 256 processors, one can decide to do speculative parallelization when the first move recieves more than 16 processors (nprocs > 16).

For the scheme I have now it is enough to store a TT entry with a boolean flag indicating exclusive search (2 bit flag for CRAP,UPPER,LOWER,EXACT), and then store back the correct TT entry when you are done. It could happen that two processors simultaneously decide to exclusively search the same move since TT operations are non-atomic, in which case the work will be redundunt but no harm will be done. Note that in my case the TT entry can even be replaced even though a CRAP flag is stored in it, and nothing breaks down except that you will do redundunt search. But if you rely on 'nprocs' for things other than speculative parallelization, then it would be fatal as you mentioned. Btw I use a lockless hashing scheme which may be important for this case since many processors are checking the same TT entry before exclusive search so likelihood of collisions may be higher. Also I set the depth of CRAP TT entries to the highest possible to prevent them from being replaced but it would work without that anyway.

Laszlo Gaspar · Post by **Laszlo Gaspar** » Sat May 04, 2013 8:44 pm

Hi new ABDADA fans

,

I have already implemented ABDADA and found it fascinating too! I had got +40 elo on 2 CPU at that time.
My implementation differed in some ways. I already had a second hash table with 10000 entries for repetition detection which contained only a 1 bit flag besides the 32bit hash signature to help detecting if the position was already visited. I expanded it to a 64bit variable so 64 threads can set their own flag independently. You can use this variable not just determining the 1st pass and 2nd pass moves but you can inform certain processes that "this subtree is ready, stop and do something else".
I'm checking this table after make(move) but before calling alphabeta(depth - 1).

Daniel Shawul · Post by **Daniel Shawul** » Sat May 04, 2013 9:25 pm

Hi Laszlo!
It is good to hear that you got 40 elos which is what most other engines get with a YBW too. Did you use processes or threads ? I am now using threads but I intend to scrap that and rewrite using processes, and hopefully MPI has system independent calls for allocating shared memory for distributed TT. If that is the case the implementation would be really easy which is my main goal here more than efficiency. So I will not be trying to store processor ids in the TT like you did. So far the way I see it the most challenging part is to do a two pass search. My TT did not ta change at all since I had an extra flag.

Laszlo Gaspar · Post by **Laszlo Gaspar** » Sat May 04, 2013 10:08 pm

Hi Daniel,

I used threads as it was easy in Java. Nowadays I am thinking about a cluster aware version too but there is still nothing concrete. Thank you for pointing out the LMR-problem I missed it in my design

!
Now that you have ABDADA you could try out the lazy smp idea by Daniel Homan! I first got improvement but later tests didn't confirm it and I am curious...

Michel · Post by **Michel** » Sat May 04, 2013 11:51 pm

@Daniel
You are a better coder than I am it seems!

Am I correct that schematically your scheme is
as follows (for simplicity I am ignoring the first move)

Code: Select all

first pass: 
for all moves (other than the first one) do
  MakeMove
  search(depth-1,exlusive=1)
  If CRAP is set in child node
    return to parent
    Unmake move and store move for second pass 
  else
    set CRAP flag in child
    search child node
    store result and reset CRAP
    return to parent

second_pass:
    search all stored moves with search(depth-1,exclusive=0); children
    now now ignore the CRAP flag.

It seems that this should indeed work and would be more robust
against synchronization errors than using nproc.

ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results