Back to assembly

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Back to assembly

Post by stegemma »

Here's the two games:

The first one, 5 minutes+10 s for Satana but 6 plyes fixed for Drago. Drago played in a few seconds per move, while Satana has used better its time:

[pgn]
[Event "Drago 0.21 - Satana 2.0.7"]
[Site "LIGSWIN"]
[Date "2015.02.18"]
[Round "1"]
[White "Drago 0.21"]
[Black "Satana.2.0.7.w64bit"]
[Result "0-1"]
[BlackElo "1400"]
[ECO "D00"]
[Opening "Blackmar-Diemer"]
[Time "17:32:29"]
[Variation "Lemberger Countergambit"]
[WhiteElo "2400"]
[TimeControl "300+10"]
[Termination "normal"]
[PlyCount "60"]
[WhiteType "program"]
[BlackType "program"]

1. e4 d5 {(d7d5 <e7e6-9128310 nodes -0.02s left>) -1.08/6 7} 2. Nc3 dxe4
{(d5e4 <g7g5-7983305 nodes -0.02s left>) -1.07/6 7} 3. d4 e5 {(e7e5
<e4d3-7051828 nodes -0.03s left>) -0.07/6 7} 4. dxe5 Qxd1+ {(d8d1
<b8d7-6383527 nodes -0.03s left>) -1.06/6 7} 5. Kxd1 Nc6 {(b8c6
<c8f5-7500463 nodes -0.02s left>) -1.00/6 7} 6. Bb5 a6 {(a7a6 <h7h6-5878476
nodes -0.03s left>) -1.04/6 7} 7. Bxc6+ bxc6 {(b7c6 <b7c6-8446210 nodes
-0.02s left>) -1.03/6 7} 8. Be3 f6 {(f7f6 <f8e7-8378355 nodes -0.02s left>)
-1.05/6 7} 9. exf6 Nxf6 {(g8f6 <g8f6-8661488 nodes -0.03s left>) -1.05/6 8}
10. a4 Nd5 {(f6d5 <c8e6-9137777 nodes -0.03s left>) -1.00/6 8} 11. Nxd5
cxd5 {(c6d5 <c6d5-9349430 nodes -0.03s left>) -0.08/6 8} 12. Ne2 Bd6 {(f8d6
<a6a5-9795866 nodes -0.03s left>) -0.99/6 8} 13. Nc3 Be5 {(d6e5
<c8d7-9586561 nodes -0.03s left>) -0.99/6 8} 14. Nxd5 Be6 {(c8e6
<c8f5-8443417 nodes -0.03s left>) -1.01/6 8} 15. Nc3 Bf5 {(e6f5
<e8e7-8494867 nodes -0.02s left>) -1.04/6 8} 16. Ke1 c6 {(c7c6
<a8c8-8708181 nodes -0.02s left>) -1.05/6 8} 17. a5 Rb8 {(a8b8
<f5e6-9217368 nodes -0.03s left>) -1.04/6 8} 18. f4 exf3 {(e4f3
<e4f3-9649563 nodes -0.02s left>) +0.99/6 8} 19. gxf3 Rxb2 {(b8b2
<b8a8-8883173 nodes -0.03s left>) +1.00/6 8} 20. Ne2 Bxc2 {(f5c2
<e5d6-9362117 nodes -0.02s left>) +1.00/6 8} 21. Bd4 Bxd4 {(e5d4
<b2b5-9674838 nodes -0.01s left>) +0.07/8 8} 22. Nxd4 O-O {(e8g8
<e8g8-11134139 nodes -0.02s left>) +0.07/6 8} 23. Kd2 Bg6+ {(c2g6
<c2g6-10065800 nodes -0.02s left>) +0.99/6 8} 24. Kc3 Rg2 {(b2g2
<b2g2-9365110 nodes -0.02s left>) +1.00/6 8} 25. h4 c5 {(c6c5 <c6c5-8969182
nodes -0.02s left>) +2.10/6 8} 26. Nc6 Rc2+ {(g2c2 <g2c2-205158 nodes 8.42s
left>) +319.00/6 0} 27. Kb3 Rxf3+ {(f8f3 <f8f3-4104 nodes 8.87s left>)
+319.00/4 0} 28. Ka4 Rc4+ {(c2c4 <c2c4-1321 nodes 9.12s left>) +319.00/2 0}
29. Nb4 cxb4 {(c5b4 <c5b4-137 nodes 9.37s left>) +319.00/2 0} 30. h5 Be8#
{(g6e8 <g6e8-358 nodes 9.62s left>) +319.00/2 0} 0-1
[/pgn]

The second one, still 5 minutes+10s for Satana and 8 plies fixed for Drago. The right time would be about 15 minutes, at this ply depth and i was executing manually the moves switching between real and virtual machine, so Drago loses for time, in an interesting position:

[pgn]
[Event "Drago 0.21 - Satana 2.0.7 - 8"]
[Site "LIGSWIN"]
[Date "2015.02.18"]
[Round "1"]
[White "Drago 0.21"]
[Black "Satana.2.0.7.w64bit"]
[Result "0-1"]
[BlackElo "1400"]
[ECO "B12"]
[Opening "Caro-Kann"]
[Time "17:43:48"]
[Variation "2.d4 d5"]
[WhiteElo "2400"]
[TimeControl "300+10"]
[Termination "time forfeit"]
[PlyCount "12"]
[WhiteType "program"]
[BlackType "program"]

1. e4 d5 {(d7d5 <e7e6-9093374 nodes -0.02s left>) -1.08/6 7} 2. d4 c6
{(c7c6 <d8d7-7840589 nodes -0.02s left>) -1.06/6 7} 3. Bf4 dxe4 {(d5e4
<b7b5-7365178 nodes -0.03s left>) -0.09/6 7} 4. Nd2 Qxd4 {(d8d4
<g8f6-6973971 nodes -0.03s left>) +0.92/6 7} 5. c3 Qf6 {(d4f6 <d4d2-6738029
nodes -0.02s left>) -0.02/6 7} 6. Bc7 e3 {(e4e3 <f6e6-6102870 nodes -0.03s
left>) -0.05/6 7} 7. ... {White forfeits on time} 0-1
[/pgn]

And now Drago could play Ne4!, that seem strong, to me.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Back to assembly

Post by matthewlai »

stegemma wrote:With my last engine Satana i've experimented programming a chess engine in C++; surely it is my own limitation but i can't get a decent program, that way. I thinked that programming in C++ would easyer the developing stage (i use C++ for business application, so i know it) and above all speed-up debugging but i was wrong. I find that the bugs are the same and the time needed to find them is not less in C++ than in assembly: they both depends on me, not on the language.

So i decided to come back to assembly programming, now that i can use the full power of the new 64bit registers. I've used partially MMX registers in Freccia but never the R8...R15 and XMM, available in Intel and AMD CPUs . It would be not fully portable, my engine, but i could continue the journey made with Drago -> Raffaela -> Freccia, in a more modern and fast way.

Above all, i'm the fool who still program in assembly, that's my role! ;)
One of the biggest advantages I found of writing something in C/C++ vs assembly is being able to write simple and intuitive code, and have the compiler/optimizer do all the ugly transformations to make it fast.

Things like function inlining, partial loop unrolling, reordering instructions to more evenly use CPU resources, and using less obvious instructions to do things slightly faster ("xor rax, rax" instead of "mov rax, 0", or masking and shifting for modulus).

Do you do those things by hand? Or do you try to keep the code easy to read and maintain, and give up some performance for it?
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Back to assembly

Post by mar »

matthewlai wrote:and using less obvious instructions to do things slightly faster ("xor rax, rax" instead of "mov rax, 0", or masking and shifting for modulus).
I don't see how xor is faster than mov. Point is xor takes 3 bytes instead of 7 for mov. Also xor changes flags (unlike mov). So I agree it's better to let the compiler decide.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Back to assembly

Post by matthewlai »

mar wrote:
matthewlai wrote:and using less obvious instructions to do things slightly faster ("xor rax, rax" instead of "mov rax, 0", or masking and shifting for modulus).
I don't see how xor is faster than mov. Point is xor takes 3 bytes instead of 7 for mov. Also xor changes flags (unlike mov). So I agree it's better to let the compiler decide.
That was just a random example I remember hearing about. But yes, you clearly see what I meant.

PS. In some cases smaller = faster, if actively used sections of code is bigger than i-cache.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Back to assembly

Post by stegemma »

mar wrote:
matthewlai wrote:and using less obvious instructions to do things slightly faster ("xor rax, rax" instead of "mov rax, 0", or masking and shifting for modulus).
I don't see how xor is faster than mov. Point is xor takes 3 bytes instead of 7 for mov. Also xor changes flags (unlike mov). So I agree it's better to let the compiler decide.
It was faster in old CPU, maybe it is not the same in new ones. Still an experienced assembly programmer use "xor reg,reg" as in C anybody use ++x instead of x = x + 1.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Back to assembly

Post by stegemma »

matthewlai wrote:
stegemma wrote:With my last engine Satana i've experimented programming a chess engine in C++; surely it is my own limitation but i can't get a decent program, that way. I thinked that programming in C++ would easyer the developing stage (i use C++ for business application, so i know it) and above all speed-up debugging but i was wrong. I find that the bugs are the same and the time needed to find them is not less in C++ than in assembly: they both depends on me, not on the language.

So i decided to come back to assembly programming, now that i can use the full power of the new 64bit registers. I've used partially MMX registers in Freccia but never the R8...R15 and XMM, available in Intel and AMD CPUs . It would be not fully portable, my engine, but i could continue the journey made with Drago -> Raffaela -> Freccia, in a more modern and fast way.

Above all, i'm the fool who still program in assembly, that's my role! ;)
One of the biggest advantages I found of writing something in C/C++ vs assembly is being able to write simple and intuitive code, and have the compiler/optimizer do all the ugly transformations to make it fast.

Things like function inlining, partial loop unrolling, reordering instructions to more evenly use CPU resources, and using less obvious instructions to do things slightly faster ("xor rax, rax" instead of "mov rax, 0", or masking and shifting for modulus).

Do you do those things by hand? Or do you try to keep the code easy to read and maintain, and give up some performance for it?
You can do the same things in assembly, sometimes using macros. Instructions reordering has been done by the CPU itself or you can do it by hand, for small amount of code.

My idea is, for now, to use all of the new 64 bit registers, to improve move generation. This could be done by the compiler, maybe, but is more interesting to study how to do it by hand. The best goal could be to create a Zero-RAM move generation, that use all and only the CPU registers, without never accessing RAM. This could be almost easy done for perft, hardly for alfa-beta, of course. The move generation by itself could fit in the CPU registers, this could be useless as... counting perft 15 ;)
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Back to assembly

Post by matthewlai »

stegemma wrote:
matthewlai wrote:
stegemma wrote:With my last engine Satana i've experimented programming a chess engine in C++; surely it is my own limitation but i can't get a decent program, that way. I thinked that programming in C++ would easyer the developing stage (i use C++ for business application, so i know it) and above all speed-up debugging but i was wrong. I find that the bugs are the same and the time needed to find them is not less in C++ than in assembly: they both depends on me, not on the language.

So i decided to come back to assembly programming, now that i can use the full power of the new 64bit registers. I've used partially MMX registers in Freccia but never the R8...R15 and XMM, available in Intel and AMD CPUs . It would be not fully portable, my engine, but i could continue the journey made with Drago -> Raffaela -> Freccia, in a more modern and fast way.

Above all, i'm the fool who still program in assembly, that's my role! ;)
One of the biggest advantages I found of writing something in C/C++ vs assembly is being able to write simple and intuitive code, and have the compiler/optimizer do all the ugly transformations to make it fast.

Things like function inlining, partial loop unrolling, reordering instructions to more evenly use CPU resources, and using less obvious instructions to do things slightly faster ("xor rax, rax" instead of "mov rax, 0", or masking and shifting for modulus).

Do you do those things by hand? Or do you try to keep the code easy to read and maintain, and give up some performance for it?
You can do the same things in assembly, sometimes using macros. Instructions reordering has been done by the CPU itself or you can do it by hand, for small amount of code.

My idea is, for now, to use all of the new 64 bit registers, to improve move generation. This could be done by the compiler, maybe, but is more interesting to study how to do it by hand. The best goal could be to create a Zero-RAM move generation, that use all and only the CPU registers, without never accessing RAM. This could be almost easy done for perft, hardly for alfa-beta, of course. The move generation by itself could fit in the CPU registers, this could be useless as... counting perft 15 ;)
Ah ok that's true. I guess a lot of it can be done with macros.

I have read quite a bit of assembly code (mostly compiler generated), so I'm fairly familiar with it, but I almost never actually write assembly code myself, so I never see things like macros, since compilers don't use them :D.

The biggest program I've written in asm was tic-tac-toe (with minimax). Took me 8 hours. I decided to not write more assembly programs :D (I wrote an equivalent C version first, and it took me about half an hour).

I guess that's the problem with programming - we always stick with what we know, and keep getting better at it, while never improving on things we aren't good at.

The new registers sound like a lot of fun! I remember how annoying it was in x86-32 with all the non-orthogonal registers, and never having enough of them!

One thing I have learned very early on is it's almost hopeless for me to compete with the compiler on speed of generated code. I have also read some code generated by "gcc -O3", and it always felt like some asm master sat down and spent 1 hour per 10 lines optimizing the hell out of it. It's typically very hard to understand, but once you do, it's mind-blowing how efficient it is, exploiting every little quirk of the CPU.

Would definitely still be a fun exercise, though :D.

And auto-vectorization still doesn't work very well from what I heard. So if your code does a lot of SIMD, it may still pay off to write asm (though in most cases intrinsics are just as good).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
sean_vn
Posts: 4
Joined: Thu Feb 05, 2015 4:49 pm

Re: Back to assembly

Post by sean_vn »

The main reason to use assembly languge is to use unusual instructions in unusual ways to gain speed. You would know the instruction set for the machine and think of a fast algorithm based on those instructions. It would be difficult even to express the idea in C.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Back to assembly

Post by wgarvin »

sean_vn wrote:The main reason to use assembly languge is to use unusual instructions in unusual ways to gain speed. You would know the instruction set for the machine and think of a fast algorithm based on those instructions. It would be difficult even to express the idea in C.
I'm not sure I agree with that, at least with respect to x86/x64. The "unusual" instructions tend to exist for backward-compatibility reasons, and be much slower than the "typical" ones used by compilers. Except of course for SIMD code, which you can still write in C or C++ using compiler intrinsics (and leave the register allocation, spilling, inlining, loop unrolling, etc. for the compiler to deal with).

The only reasons I would suggest for writing anything in x86/x64 assembly these days are:
(1) for fun or the learning experience, or
(2) you are working on a device driver or embedded system (and maybe not even then), or
(3) you've profiled your program and 9% of the runtime is spent inside one relatively small function and you think its worth spending the next 2 days trying to see if you can beat the compiler! :lol:

And for (3) there's a definite chance that you can, if you understand the hardware really well... but it might be a lot of work compared to writing the C++ version.

[Edit: as for the example of "xor rax, rax" versus "mov rax, 0" mentioned above in the thread, on modern CPUs those are both recognized as dependency-breaking instructions, and they are about equally good. I think there's a four-byte version of the mov rax,0 one as well, but I can't remember for sure. To anyone interested in this stuff, google "Agner Fog" and read his microarchitecture docs, they are a great source of arcane details about modern x86 chips!]
sean_vn
Posts: 4
Joined: Thu Feb 05, 2015 4:49 pm

Re: Back to assembly

Post by sean_vn »

Hey, there is a lot of junk in the Intel instruction set, but bswap, haddps, rdrand, and some of the crc instructions do things that are not so expressable in c. Sometimes you can gain.
If there is no special instruction you can exploit then gcc or java hotspot will generally do better than you can.