Page 18 of 36

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 4:24 pm
by lithander
mvanthoor wrote: Sun Feb 21, 2021 10:11 pm At that point, the move generators and make/unmake functions were near-identical, one written completely in C, the other in Rust (so, completely different structure, but the same functionality), and they were also near-identical in speed at +/- 40 million leaves/sec on my system.
I didn't find a Dev-enabled Windows build of Weiss and don't know how to compile 'makefiles' under Windows and so I compiled it via Cygwin instead but it only reached 20 NPS. I don't think it was a debug build but maybe 32bit?

Code: Select all

tjahn@DESKTOP-NTIPCRE ~/weiss-master/src
$ ./weiss-popcnt.exe
perft

Perft starting:
Depth : 5
FEN   : r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1

Perft complete:
Time : 9408ms
NPS  : 20587870
Nodes: 193690690
For Rustic i used the builds you provided. Both the popcount and bmi2 versions reached 36M NPS, generic only 20M!

Code: Select all

rustic-alpha-1_64-bit-popcnt.exe -p 6

Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (5 ms, 39456200 leaves/sec)
Perft 5: 4865609 (138 ms, 35258036 leaves/sec)
Perft 6: 119060324 (3296 ms, 36122671 leaves/sec)
Total time spent: 3439 ms
Execution speed: 36095532 leaves/second
mvanthoor wrote: Sun Feb 21, 2021 10:11 pm There's nothing stopping you from implementing a magic bitboard move generator in C#...
I will put it on my ToDo! ;)

I know at least two other active community members here have written engines in C# that use Bitboards.
emadsen wrote: Sun Feb 21, 2021 10:11 pm Mad Chess
and
jshriver wrote: Sun Feb 21, 2021 10:11 pm Ceres
I wonder how fast their move generators are? And if in their opinion there's a significant performance penalty for using C# instead of C or Rust?

(Btw... is there a more elegant way to mention users on PhpBB other than hacking the quote markup tag like I did?)

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 4:44 pm
by Guenther
lithander wrote: Mon Feb 22, 2021 4:24 pm
I know at least two other active community members here have written engines in C# that use Bitboards.
emadsen wrote: Sun Feb 21, 2021 10:11 pm Mad Chess
and
jshriver wrote: Sun Feb 21, 2021 10:11 pm Ceres
I wonder how fast their move generators are? And if in their opinion there's a significant performance penalty for using C# instead of C or Rust?

(Btw... is there a more elegant way to mention users on PhpBB other than hacking the quote markup tag like I did?)
Just nitpicking, the second one is wrong, Ceres is not written by an active member and not by Joshua.

From still active programs those are C# too (none of them active here though, both bitboards AFAIK)

https://github.com/Tearth/Cosette
https://github.com/zongzhengli/Absolute-Zero

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 5:39 pm
by lithander
When Joshua announced Ceres it I mistook him for the author! :oops:
Guenther wrote: Mon Feb 22, 2021 4:44 pm From still active programs those are C# too (none of them active here though, both bitboards AFAIK)
https://github.com/Tearth/Cosette
https://github.com/zongzhengli/Absolute-Zero
Good finds! Both implement 'perft' command.

Absolute-Zero computes 'perft 6' in less than a second at 126MN/s! :shock:

Code: Select all

perft 6
Depth     Time       Speed         Nodes
-----------------------------------------------------------------------
1         0 ms       11111 kN/s    20
2         0 ms       86957 kN/s    400
3         0 ms       132273 kN/s   8902
4         2 ms       127492 kN/s   197281
5         36 ms      133332 kN/s   4865609
6         939 ms     126768 kN/s   119060324
-----------------------------------------------------------------------
Cosette takes 10x longer!

Code: Select all

> perft 6
Depth 0 - Leafs: 1 (0.003 ML/s), Time: 0.000 s, Time per leaf: 355000.000 ns
Depth 1 - Leafs: 20 (0.003 ML/s), Time: 0.007 s, Time per leaf: 373230.000 ns
Depth 2 - Leafs: 400 (1.528 ML/s), Time: 0.000 s, Time per leaf: 654.250 ns
Depth 3 - Leafs: 8902 (2.605 ML/s), Time: 0.003 s, Time per leaf: 383.914 ns
Depth 4 - Leafs: 197281 (3.075 ML/s), Time: 0.064 s, Time per leaf: 325.200 ns
Depth 5 - Leafs: 4865609 (10.759 ML/s), Time: 0.452 s, Time per leaf: 92.941 ns
Depth 6 - Leafs: 119060324 (13.535 ML/s), Time: 8.796 s, Time per leaf: 73.882 ns
Both use Bitboards as far as I can see.

Marcel, sorry for hijacking your thread. I was just typing out the question 'Any ideas why there's an order of magnitude difference in speed?' when I realized that this is really offtopic, right? This should be about Rustic! Sorry, I'm going to shut up now. :(

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 6:30 pm
by mvanthoor
lithander wrote: Mon Feb 22, 2021 5:39 pm Marcel, sorry for hijacking your thread. I was just typing out the question 'Any ideas why there's an order of magnitude difference in speed?' when I realized that this is really offtopic, right? This should be about Rustic! Sorry, I'm going to shut up now. :(
It's alright; you could (unintentionally) post something that's useful to me, even if it's not about Rustic or Rust itself.

With regard to perft speed comparison, you have to be careful.

- If a move generator uses bulk-counting, that will speed up perft extremely.
- Speed will be even higher if there's even a 1 MB hash table available.
- A move generator can use "tricks" to be fully legal, such as not generating moves for pinned pieces, etc... this will speed up the move generator, and thus perft. (It also has a small impact on the speed of the search, obviously, but not as much as it has on perft.)

At this point, Rustic does not do bulk counting, the has can be disabled, and it doesn't do any special tricks in the move generator.

36M leaves/sec seem to be right for Rustic's popcnt / bmi2 versions, for the starting position, at least on a CPU such as a i7-6700K.

20 Mlps feels slow, even for the generic 64-bit code, but that can be because of the CPU. I even don't know for sure (can't find any definitive information) if "target-cpu" in Rust actually enables ALL of the available instruction sets, or only the default ones, and then generates code for the actual micro-architecture.

For Alpha 2, I'm going to look into Stockfish's makefile, to see how it compiles for Clang. I can then do the same for Rust, because both Clang and Rust use LLVM as a backend (and thus have the same options).

With regard to other engines, the problem with speed can be manifold. Using bitboards is not a silver bullet.

- It could be that they use massive data structures such as: Move {int ..., int ..., int ...., int ....} instead of single int for the entire move (working by bits), and then copy those structures around.
- It could be that they allocate memory where no allocation is necessary. For example, use a vector of size 1 for the move list, and then it needs to be reallocated for each move that is added (better start with something like size 256, or better yet, use an array and keep the move list on the stack)
- It could be that they are using objects for everything, such as a King object, Queen object, even with each Square being an object, all packed within a Board object, everything with its own interface. That can be very neat if you tend to have 5 board implementations and several different piece implementations, but as a piece can also just be a number in an integer array, the object-oriented approach is probably slower.
- it could be that code is just written inefficiently. Instead of assigning a sort order score to moves and then just picking the fastest one, it could be that they actually sort the list with something very slow like bubble sort (and then move entire Move { } objects around instead of pointers, or integers)
- ... and so on.

One would need to profile an engine to see where the problems are, and then start refactoring at the point of the worst one. (Which could actually mean that starting from scratch can be faster.)

===

I just (like five minutes ago) nuked my entire hash table. It worked, but I ended up with a generic object (the HashData struct) in a dynamic object (the Hash bucket), in a dynamic object (the Hash table), which were all stuck together using traits, some of which were generic themselves.

Even though it worked, and speed didn't seem to suffer (at least not measurably), the code was... somewhat... much... more convoluted than intended.

I'm going to redesign the hash table from scratch, because I probably made some massive mistakes a year ago by not doing things by the Rust philosophy. I wouldn't be surprised if this caused those +/- 100 lines of massively complicated and unreadable code. Ah, well. I'll just start over. It's not a long term project for nothing. Writing this chess engine is as much about learning about Rust than it is about writing a chess engine. (Some people wrote 3-4 engines before they actually started on their main one, so I think I can rewrite a hash table if need be :lol: )

I want the hash table to be generic, so I can use it for perft, search, pawns, and later, for evaluation stuff. However, I don't want to use a HashMap (which would probably be much easier), because I want to be able to control the details such as replacement schemes and such myself, without being dependent on HashMap's implementation.

See you later, when I've learned to design generic hash table in Rust that looks more like this:

Image

and not like this:

Image

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 6:46 pm
by mvanthoor
lithander wrote: Mon Feb 22, 2021 4:24 pm I didn't find a Dev-enabled Windows build of Weiss and don't know how to compile 'makefiles' under Windows and so I compiled it via Cygwin instead but it only reached 20 NPS. I don't think it was a debug build but maybe 32bit?
Cygwin is slower than compiles done in MSYS2.

20 Mlps is about half speed for Weiss; same happens to Rustic if compiled as a 32-bit engine. I assume something is wrong with the Weiss compile, because it should be about as fast as Rustic (probably a bit faster now, because Rustic does some more incremental updates in the move functions that Weiss omits; at least, that was the case about 9 months ago).

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 9:22 pm
by Ras
lithander wrote: Mon Feb 22, 2021 4:24 pm I compiled it via Cygwin instead but it only reached 20 NPS. I don't think it was a debug build but maybe 32bit?
Open a Cygwin terminal window, go to the path of the generated executable, and use file myexename. The file command will tell you whether it's 32 or 64 bit.

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 10:33 pm
by lithander
Ras wrote: Mon Feb 22, 2021 9:22 pm Open a Cygwin terminal window, go to the path of the generated executable, and use file myexename. The file command will tell you whether it's 32 or 64 bit.

Code: Select all

tjahn@DESKTOP-NTIPCRE ~/weiss-master/src
$ file weiss-popcnt.exe
weiss-popcnt.exe: PE32+ executable (console) x86-64, for MS Windows
PE32+ is 64bit, right? But then I have no explanation why it's only half as fast as Marcel expected it to be.

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 11:24 pm
by mvanthoor
lithander wrote: Mon Feb 22, 2021 10:33 pm PE32+ is 64bit, right? But then I have no explanation why it's only half as fast as Marcel expected it to be.
I don't know either; I haven't used Cygwin for a LONG time. Maybe it's slowed down by Cygwin1.dll. Every Cygwin compile depends on tit. (But that would be one massive slowdown.) I'll compile Weiss in MSYS2 somewhere during the coming week and post the results.

====-

I've rewritten the hash table from scratch. I can now initialize different hash tables by using different structs:

"let hash_table = HashTable::<PerftData>::new(64) " will create a 64 MB hash table for perft.
"let hash_table = HashTable::<SearchData>::new(64) " will create a 64 MB hash table for Search.

Now I just need to create an interface that works for both (and all others; probably I'll just return a copy of the Entry struct), and that would be it.

One thing I noticed: when the bucket is an array of 4 elements, a 64 MB hash table is actually 64 MB; Rustic uses 64 MB + a bit of memory for itself. If I make the bucket a Vector with the same number of elements, the overhead becomes huge: a 64 MB hash table takes 106 MB of space. I wonder why.

@Ras: See? :D Code in Rust can be written in less than 3 weeks if you know how to do it. I just needed more practice (and a less crappy starting point than my old hash table). The current implementation is actually quite logical and simple... now that I know how to actually do this correctly, instead of trying to plug holes that the compiler points out :oops:

Re: Progress on Rustic

Posted: Mon Feb 22, 2021 11:45 pm
by Ras
lithander wrote: Mon Feb 22, 2021 10:33 pmPE32+ is 64bit, right?
Yes.
But then I have no explanation why it's only half as fast as Marcel expected it to be.
I checked out Weiss 1.3 on my Linux machine with AMD 3400G (older Zen+ architecture), and you have to edit the makefile a bit. The release versions with added "-DDEV" (for enabling perft) are at around 15 MNPS, except the PEXT one which is half of that, but that's expected due to my CPU with inefficient BMI2. The dev version itself is about 16 MNPS, i.e. the debug build isn't slow. The PGO (profile guided optimisation) build with "-DDEV" added is at 18 MNPS.

For reference, my engine does use check evasions also in perft, but otherwise doesn't have bulk counting either, and no bitboards, no hash table use in perft, the move generator is not the most efficient one, and I don't use PGO, but I get around 20 MNPS in perft 5 from the position Weiss 1.3 is using for perft. Since a 6700K should be about 6% faster than my 3400G in singlethread, you could download my engine to cross-check under Windows - perft is enabled also in the release EXE. Just remember to set the same position as Weiss.

Re: Progress on Rustic

Posted: Tue Feb 23, 2021 12:14 am
by Ras
mvanthoor wrote: Mon Feb 22, 2021 11:24 pm@Ras: See? :D Code in Rust can be written in less than 3 weeks if you know how to do it.
Sure, after enough puzzling and several rewrites, you may arrive at some point where the compiler doesn't complain. If you were ever to do that multi-threaded, you'd probably go for unsafe anyway because mutexing the access isn't an option.