Developments of the last two years

hgm · Post by **hgm** » Mon Mar 04, 2013 12:34 pm

lucasart wrote:That's interesting, although it's not the original challenge. Basically you take my code, get your compiler to spue out assembly, and optimize manually the assembly.

Well, this was exactly what I did for my own perft. As I already stated, I could optimize away 40% of the instructions, but did not achieve any speedup whatsoever.

lucasart · Post by **lucasart** » Mon Mar 04, 2013 1:08 pm

Rebel wrote:
lucasart wrote: I need to hack the code to remove all the DiscoCheck that is not useful and will send you a cleand up version that only does perft.
Great

Actually if the challenge is to improve assembly code produced by a compiler, then we don't need my code at all. Let's take a simpler one that is written in C (as opposed to C++): how about OliPerft ?
http://home.arcor.de/dreamlike/chess/index.html
(download OliPerft 1.01 source code)

Uri Blass · Post by **Uri Blass** » Mon Mar 04, 2013 1:11 pm

bob wrote:
Don wrote:
Uri Blass wrote:
Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don
I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.
You probably think that is ridiculous, right? I don't.

I think an exceptionally talented assembler guru could squeeze a lot out of Stockfish. I don't know if it would be 2X but it might be even more and that would not surprise me whatsoever.

There are 2 limits to how fast a "functionally identical" optimized version of Stockfish could get. The first limit is simply what is possible which we can probably never know. A metaphor for this is to ask, "what could God achieve?" I can easily imagine that there are many possible optimizations that we never even thought of and probably even a few "head slappers" in there.

The second limitation is the one that is most relevant and that is what we believe is possible - and that is the one that limits us the most. If we become satisfied at some point that we have pretty much gotten close to the limit, we won't look for any more.

If you are talking about pure code optimization where you completely ignore anything that might speed up the code substantially but is more of a algorithmic nature, then there is probably still at least 50% speedup possible. Maybe a lot more. You should realize that compilers are probably missing quite a bit even though they continue to improve. They cannot understand anything that is too complicated and they have to make the most conservative assumptions to be sure of being correct and they cannot rewrite algorithms a better way. Keep in mind that C also has limitations too, such as lack of access to all the instructions that a chip provides. In fact most programs now have to work around the popCount thing - either with special intrinsic's to address this limitation or a small assembly language routine. That should tell you something!
2x is likely possible. More? very difficult to say and it gets more unlikely as you increase the improvement. BUT, and there is a very big BUT here... time. It is a time-consuming project to rewrite to ask. Cray Blitz had 20K lines of ask to go with the base fortran source. It was a daunting task to write, to debug, and to maintain. Much easier to add null-move (1988 or so idea) to the fortran search, than it was to add it to the cray ASM code I had written. One was a 10 minute project, one was a several day project, due to all the hand-optimizing harry and I had done...

It is sometimes worth it for code that will not change, but most parts of a chess program are subject to rewrite/modification at any point in time, making the effort expended go way up.

Even if the code changes then there is no contradiction between modification of the code and optimizing the code.

One programmer in a team of 2 programmers may only modify the slow code and another programmer can simply optimize the code without modifying the algorithm.

The programs will always have 2 version(best slow version based on testing and some fast version based on older code)

If the programmer who optimize the code is fast enough to get 100% speed improvement in 6 months,
then you can expect the old optimized program to be better because usually you do not get something that is equivalent to being twice faster in 6 months.

hgm · Post by **hgm** » Mon Mar 04, 2013 1:47 pm

lucasart wrote:Actually if the challenge is to improve assembly code produced by a compiler, then we don't need my code at all. Let's take a simpler one that is written in C (as opposed to C++): how about OliPerft ?
http://home.arcor.de/dreamlike/chess/index.html
(download OliPerft 1.01 source code)

Actually, why not do this on the AI routine (D(...)) of Fairy-Max, which still seems an order of magnitude smaller than OliPerft (only ~100 short lines of C code). Then it would not just be an academic exercise, but actually have some practical use as well.

http://hgm.nubati.net/fairymax.tar.gz

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Mar 04, 2013 1:50 pm

Rebel wrote:
Gerd Isenberg wrote:
Rebel wrote:
Don wrote: Look at some of the writings of Michael Abrash and I think you will change your tune
Precisely, the man has been my teacher
Hey, you never told us before
Thanks to Don for link and hint.
Gerd, good to see you join the discussion. Is there a good page about how to write good and optimized X64 ASM code?

I only have links to various Intel/AMD manuals, specially the optimization manuals:
https://chessprogramming.wikispaces.com/x86-64#Manuals
and from Agner Fog
http://www.agner.org/optimize/
and just found
http://software.intel.com/en-us/article ... 4-assembly

I never used any x86-64 assembly i.e. MASM64 by myself (a little inline assembly before with VC for x86-32), and SSE2 C++ wrapper classes using SSE2 (and maybe in future with AVX2) intrinsics - the generated assembly (VC2008) was quite fine, and I think I am not able to improve it much or at all with MASM64.

Don · Post by **Don** » Mon Mar 04, 2013 2:04 pm

hgm wrote:Then the challenge is not properly formulated. He asks for an assembly-written perft() that beat his C-code. Apparently any assemby-code would not do, and there are some restrictions on it. Like that the assembly code in question could be considered an optimized version of his C-code algorithm. But that is an ill-defined task. Modern optimizers do amazing things, and when I try to time multiplier performance, and write code like
Code: Select all
cnt=1e9; while(--cnt) a *= 1.00000000001;
it always times as 0, because the optimizer removes the entire loop. (Unless you actually print the value of a afterwards.)

So if the C-code perft is slow, because it Makes/UnMakes the last ply, refraining from making that ply causes an enormous speedup. But a clever optimizer (especially a human optimizer!) can easily recognize the set of assembly instructions that make the move and then immediately take it back as a set of instructions that as a group do nothing, and remove it for that reason.

I remember having to optimize PDP 11 assembly code for Akkerman's function for a course in computer organization, which is a heavily recursively defined functions. The best I could come up with was beaten by orders of magnitude by some others, who had used the fact that for some low value of the arguments there existed a formula for that function in closed form (something like Akk(0, N) = N, when the definition for that case would reduce to Akk(0,N) = Akk(0,N-1) and Akk(0,0)=0. So my optimized code was calculating N recursively by N recursive function calls each doing an INC on the passed value to produce N. Others has 'optimized that away'.

The only reasonable way to have a contest is to have a well defined task and put no constraints on HOW it's approached, except that you stick to the language in question. For example a perft that takes a 4 field fen as input and as output returns perft to the specified depth. It is not a optimization of an existing program and you cannot use assembly routines to speed up the C version - other than that, anything goes. Even so, you would be measuring the performance of the implementation which could be inferior whether for the C version of the assembly version.

Martin Thoresen · Post by **Martin Thoresen** » Mon Mar 04, 2013 2:36 pm

Aaron Becker wrote:About two years ago, I stopped developing Daydreamer, my own engine, because it was taking up time that I needed to spend on school. Now I have more free time again and have been thinking about coming back to computer chess, and I was wondering what's been going on in the last few years.

I see some new names among engine developers, and certainly the engines at the top are stronger. Have there been any important new techniques developed, or are the improvements due to something else, maybe just incremental tuning.

In short, what did I miss?

Hello Aaron!
Quite funny, when selecting participants for the now "resurrected" TCEC, I was thinking of Daydreamer just when I saw that you stopped working on it, so it hadn't been updated since the "old" TCEC.

In any case, welcome back. I can't remember if it was SMP or not, but if it is, then it is a good candidate for Season 2 that will start sometime in April.

Best,
Martin

lucasart · Post by **lucasart** » Mon Mar 04, 2013 3:29 pm

hgm wrote:
lucasart wrote:Actually if the challenge is to improve assembly code produced by a compiler, then we don't need my code at all. Let's take a simpler one that is written in C (as opposed to C++): how about OliPerft ?
http://home.arcor.de/dreamlike/chess/index.html
(download OliPerft 1.01 source code)
Actually, why not do this on the AI routine (D(...)) of Fairy-Max, which still seems an order of magnitude smaller than OliPerft (only ~100 short lines of C code). Then it would not just be an academic exercise, but actually have some practical use as well.

http://hgm.nubati.net/fairymax.tar.gz

Agreed: smaller is better.

hgm · Post by **hgm** » Mon Mar 04, 2013 3:47 pm

Well, to make it really small, and precisely define the task, let's say we count the time-to-depth of micro-Max 1.6 from the opening position. Requirement is that the assembly code not only produces the same move and score, but also the same node count. (You would have to add printing of N for that.)

Only the routine D() consumes any time.

Note that micro-Max 4.8 (not exactly this simplified version) did speed up by a factor 1.7 compared to my plain gcc -O2 compile when someone did a careful PGO on it, so it should definitely be possible for a human to improve on a plain compile by using insight on the number of times a loop will be iterated. My suspicion is that the PGO used heavy unrolling of loops, as the size of the exe increased by an order of magnitude.

Code: Select all

/***************************************************************************/
/*                               micro-Max,                                */
/* A chess program smaller than 2KB (of non-blank source), by H.G. Muller  */
/***************************************************************************/
/* version 1.6 (1433 non-blank characters) features:                       */
/* - recursive negamax search                                              */
/* - quiescence search with recaptures                                     */
/* - recapture extensions                                                  */
/* - (internal) iterative deepening                                        */
/* - best-move-first 'sorting'                                             */
/* - full FIDE rules and move-legality checking                            */

/* accepts under-promotions: type 1,2,3 (=R,B,N) after input move          */
/* (input buffer c[] & *P made global, K and N encoding swapped for this)  */

#define W while

int M=136,S=128,I=8e3,C=799,Q,O,K,N;   

char L,*P,
w[]={0,1,1,-1,3,3,5,9},                     
o[]={-16,-15,-17,0,1,16,0,1,16,15,17,0,14,18,31,33,0,
     7,-1,6,11,8,3,6,                         
     6,4,5,7,3,5,4,6},                         
b[129],

n[]=".?+knbrq?*?KNBRQ",

c[9];

D(k,q,l,e,E,z,n)       
int k,q,l,e,E,z,n;     
{                       
 int j,r,m,v,d,h,i,F,G,s;
 char t,p,u,x,y,X,Y,H,B;

 q--;                                         
 d=X=Y=0;                                     

 W(d++<n||                                     
   z==8&K==I&&(N<1e6&d<98||                   
   (K=X,L=Y&~M,d=2)))                         
 {x=B=X;                                       
  h=Y&S;                                   
  m=d>1?-I:e;                                 
  N++;                                         
  do{u=b[x];                                   
   if(u&k)                                     
   {r=p=u&7;                                   
    j=o[p+16];                                 
    W(r=p>2&r<0?-r:-o[++j])                   
    {A:                                       
     y=x;F=G=S;                               
     do{                                       
      H=y=h?Y^h:y+r;                       
      if(y&M)break;                           
      m=E-S&&b[E]&&y-E<2&E-y<2?I:m;    /* castling-on-Pawn-check bug fixed */
      if(p<3&y==E)H^=16;                       
      t=b[H];if(t&k|p<3&!(y-x&7)-!t)break;       
      i=99*w[t&7];                             
      m=i<0?I:m;                       /* castling-on-Pawn-check bug fixed */
      if(m>=l)goto C;                         

      if(s=d-(y!=z))                           
      {v=p<6?b[x+8]-b[y+8]:0;
       b[G]=b[H]=b[x]=0;b[y]=u|32;             
       if(!(G&M))b[F]=k+6,v+=30;               
       if(p<3)                                 
       {v-=9*((x-2&M||b[x-2]-u)+               
              (x+2&M||b[x+2]-u)-1);           
        if(y+r+1&S)b[y]|=7,i+=C;               
       }
       v=-D(24-k,-l,m>q?-m:-q,-e-v-i,F,y,s);   
       if(K-I)                                 
       {if(v+I&&x==K&y==L&z==8)               
        {Q=-e-i;O=F;
         if(b[y]-u&7&&P-c>5)b[y]-=c[4]&3;        /* under-promotions */
         return l;
        }v=m;                                   
       }                                       
       b[G]=k+6;b[F]=b[y]=0;b[x]=u;b[H]=t;     
       if(v>m)                         
        m=v,X=x,Y=y|S&F;                       
       if(h){h=0;goto A;}                           
      }
      if(x+r-y|u&32|                           
         p>2&(p-3|j-7||                       
         b[G=x+3^r>>1&7]-k-6                   
         ||b[G^1]|b[G^2])                     
        )t+=p<5;                               
      else F=y;                               
     }W(!t);                                   
  }}}W((x=x+9&~M)-B);                         
C:if(m>I-M|m<M-I)d=98;                         
  m=m+I?m:-D(24-k,-I,I,0,S,S,1);   
 }                                             
 return m+=m<e;                               
}

main()
{
 int k=8;

 K=8;W(K--)
 {b[K]=(b[K+112]=o[K+24]+8)+8;b[K+16]=18;b[K+96]=9;
  L=8;W(L--)b[16*L+K+8]=(K-4)*(K-4)+(L-3.5)*(L-3.5);
 }                                                   

 W(1)                                               
 {N=-1;W(++N<121)
   printf(" %c",N&8&&(N+=7)?10:n[b[N]&15]);         
  P=c;W((*P++=getchar())>10);
  K=I;                                               
  if(*c-10)K=*c-16*c[1]+C,L=c[2]-16*c[3]+C;         
  k^=D(k,-I,I,Q,O,8,2)-I?0:24;
 }
}

Don · Post by **Don** » Mon Mar 04, 2013 3:55 pm

hgm wrote:Well, to make it really small, and precisely define the task, let's say we count the time-to-depth of micro-Max 1.6 from the opening position. Requirement is that the assembly code not only produces the same move and score, but also the same node count. (You would have to add printing of N for that.)

Only the routine D() consumes any time.

Note that micro-Max 4.8 (not exactly this simplified version) did speed up by a factor 1.7 compared to my plain gcc -O2 compile when someone did a careful PGO on it, so it should definitely be possible for a human to improve on a plain compile by using insight on the number of times a loop will be iterated. My suspicion is that the PGO used heavy unrolling of loops, as the size of the exe increased by an order of magnitude.

Exactly. It should be no surprise that compilers are off by a lot in the ultimate performance and that humans can get some of that back with assembly. Sometimes just a few option changes make a big difference, or a different compiler give 10%, etc. So it's not like compilers are almost perfect.

Code: Select all

/***************************************************************************/
/*                               micro-Max,                                */
/* A chess program smaller than 2KB (of non-blank source), by H.G. Muller  */
/***************************************************************************/
/* version 1.6 (1433 non-blank characters) features:                       */
/* - recursive negamax search                                              */
/* - quiescence search with recaptures                                     */
/* - recapture extensions                                                  */
/* - (internal) iterative deepening                                        */
/* - best-move-first 'sorting'                                             */
/* - full FIDE rules and move-legality checking                            */

/* accepts under-promotions: type 1,2,3 (=R,B,N) after input move          */
/* (input buffer c[] & *P made global, K and N encoding swapped for this)  */

#define W while

int M=136,S=128,I=8e3,C=799,Q,O,K,N;   

char L,*P,
w[]={0,1,1,-1,3,3,5,9},                     
o[]={-16,-15,-17,0,1,16,0,1,16,15,17,0,14,18,31,33,0,
     7,-1,6,11,8,3,6,                         
     6,4,5,7,3,5,4,6},                         
b[129],

n[]=".?+knbrq?*?KNBRQ",

c[9];

D(k,q,l,e,E,z,n)       
int k,q,l,e,E,z,n;     
{                       
 int j,r,m,v,d,h,i,F,G,s;
 char t,p,u,x,y,X,Y,H,B;

 q--;                                         
 d=X=Y=0;                                     

 W(d++<n||                                     
   z==8&K==I&&(N<1e6&d<98||                   
   (K=X,L=Y&~M,d=2)))                         
 {x=B=X;                                       
  h=Y&S;                                   
  m=d>1?-I:e;                                 
  N++;                                         
  do{u=b[x];                                   
   if(u&k)                                     
   {r=p=u&7;                                   
    j=o[p+16];                                 
    W(r=p>2&r<0?-r:-o[++j])                   
    {A:                                       
     y=x;F=G=S;                               
     do{                                       
      H=y=h?Y^h:y+r;                       
      if(y&M)break;                           
      m=E-S&&b[E]&&y-E<2&E-y<2?I:m;    /* castling-on-Pawn-check bug fixed */
      if(p<3&y==E)H^=16;                       
      t=b[H];if(t&k|p<3&!(y-x&7)-!t)break;       
      i=99*w[t&7];                             
      m=i<0?I:m;                       /* castling-on-Pawn-check bug fixed */
      if(m>=l)goto C;                         

      if(s=d-(y!=z))                           
      {v=p<6?b[x+8]-b[y+8]:0;
       b[G]=b[H]=b[x]=0;b[y]=u|32;             
       if(!(G&M))b[F]=k+6,v+=30;               
       if(p<3)                                 
       {v-=9*((x-2&M||b[x-2]-u)+               
              (x+2&M||b[x+2]-u)-1);           
        if(y+r+1&S)b[y]|=7,i+=C;               
       }
       v=-D(24-k,-l,m>q?-m:-q,-e-v-i,F,y,s);   
       if(K-I)                                 
       {if(v+I&&x==K&y==L&z==8)               
        {Q=-e-i;O=F;
         if(b[y]-u&7&&P-c>5)b[y]-=c[4]&3;        /* under-promotions */
         return l;
        }v=m;                                   
       }                                       
       b[G]=k+6;b[F]=b[y]=0;b[x]=u;b[H]=t;     
       if(v>m)                         
        m=v,X=x,Y=y|S&F;                       
       if(h){h=0;goto A;}                           
      }
      if(x+r-y|u&32|                           
         p>2&(p-3|j-7||                       
         b[G=x+3^r>>1&7]-k-6                   
         ||b[G^1]|b[G^2])                     
        )t+=p<5;                               
      else F=y;                               
     }W(!t);                                   
  }}}W((x=x+9&~M)-B);                         
C:if(m>I-M|m<M-I)d=98;                         
  m=m+I?m:-D(24-k,-I,I,0,S,S,1);   
 }                                             
 return m+=m<e;                               
}

main()
{
 int k=8;

 K=8;W(K--)
 {b[K]=(b[K+112]=o[K+24]+8)+8;b[K+16]=18;b[K+96]=9;
  L=8;W(L--)b[16*L+K+8]=(K-4)*(K-4)+(L-3.5)*(L-3.5);
 }                                                   

 W(1)                                               
 {N=-1;W(++N<121)
   printf(" %c",N&8&&(N+=7)?10:n[b[N]&15]);         
  P=c;W((*P++=getchar())>10);
  K=I;                                               
  if(*c-10)K=*c-16*c[1]+C,L=c[2]-16*c[3]+C;         
  k^=D(k,-I,I,Q,O,8,2)-I?0:24;
 }
}

Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years