Crafty 23.6 released

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Consideration of a C++ Crafty

Post by bob »

Don wrote:
bob wrote: My approach is similar. An array of thread pointers to access local data, local data (Per thread) allocated in the usual way, with a little care exercised due to NUMA considerations (first CPU to touch a thread block gets that block in its local memory, so this 'first to touch" needs to be spread over all threads and processors to spread the data uniformly, then one wants to use the local data on your core, not on a remote one, etc...
I remember that being an issue with Cilkchess on a big SGI machine we used with 256 processors.

But I did not realize that it was an issue here. So you can control which processor gets the data? How do I do that? I basically keep everything for a single thread in this structure and I have an array of them. I allocate memory for the evaluation cache and pawn structure caches and other read/write tables (such as the history table) are declared statically because their size are small and fixed.

So how can I control which processes get assigned to a given thread?
Easy.

1. Malloc the data you need. You need to make things align on a page size. IE if a split block is 1024 bytes, each thread should get a multiple of 4 split blocks allocated. You malloc at one point, something like this:

split_blocks = malloc(4 * sizeof(split_block) * 64) or something similar.

2. Now, each individual thread should initialize the split blocks it prefers to use without touching any others. When a page initially faults in, it faults in to the local memory of the processor that caused the fault.

That's it...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Consideration of a C++ Crafty

Post by bob »

Norbert Raimund Leisner wrote:Which software for compiling Crafty 23.6 was used?

http://sourceforge.net/projects/orwelldevcpp , http://gcc.gnu.org or another?

Norbert
I use either gcc 4.7.x or intel's latest... either works fine, although Intel's is better if you use intel cpus and not AMD.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Consideration of a C++ Crafty

Post by Don »

bob wrote:
Don wrote:
bob wrote: My approach is similar. An array of thread pointers to access local data, local data (Per thread) allocated in the usual way, with a little care exercised due to NUMA considerations (first CPU to touch a thread block gets that block in its local memory, so this 'first to touch" needs to be spread over all threads and processors to spread the data uniformly, then one wants to use the local data on your core, not on a remote one, etc...
I remember that being an issue with Cilkchess on a big SGI machine we used with 256 processors.

But I did not realize that it was an issue here. So you can control which processor gets the data? How do I do that? I basically keep everything for a single thread in this structure and I have an array of them. I allocate memory for the evaluation cache and pawn structure caches and other read/write tables (such as the history table) are declared statically because their size are small and fixed.

So how can I control which processes get assigned to a given thread?
Easy.

1. Malloc the data you need. You need to make things align on a page size. IE if a split block is 1024 bytes, each thread should get a multiple of 4 split blocks allocated. You malloc at one point, something like this:

split_blocks = malloc(4 * sizeof(split_block) * 64) or something similar.

2. Now, each individual thread should initialize the split blocks it prefers to use without touching any others. When a page initially faults in, it faults in to the local memory of the processor that caused the fault.

That's it...
Thanks Bob, I am eager to try this out.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Consideration of a C++ Crafty

Post by bob »

Don wrote:
bob wrote:
Don wrote:
bob wrote: My approach is similar. An array of thread pointers to access local data, local data (Per thread) allocated in the usual way, with a little care exercised due to NUMA considerations (first CPU to touch a thread block gets that block in its local memory, so this 'first to touch" needs to be spread over all threads and processors to spread the data uniformly, then one wants to use the local data on your core, not on a remote one, etc...
I remember that being an issue with Cilkchess on a big SGI machine we used with 256 processors.

But I did not realize that it was an issue here. So you can control which processor gets the data? How do I do that? I basically keep everything for a single thread in this structure and I have an array of them. I allocate memory for the evaluation cache and pawn structure caches and other read/write tables (such as the history table) are declared statically because their size are small and fixed.

So how can I control which processes get assigned to a given thread?
Easy.

1. Malloc the data you need. You need to make things align on a page size. IE if a split block is 1024 bytes, each thread should get a multiple of 4 split blocks allocated. You malloc at one point, something like this:

split_blocks = malloc(4 * sizeof(split_block) * 64) or something similar.

2. Now, each individual thread should initialize the split blocks it prefers to use without touching any others. When a page initially faults in, it faults in to the local memory of the processor that caused the fault.

That's it...
Thanks Bob, I am eager to try this out.

Don
One other note I forgot about. You should lock a thread on a particular processor (core) before doing this. Then let each thread use memset() or whatever to zero the data, which faults it in onto the right local memory pages. This assumes NUMA hardware run in NUMA mode (each CPU has N pages of memory with consecutive physical addresses as opposed to interleaving the address space across all processors as some NUMA boxes can do.
User avatar
Werner
Posts: 2876
Joined: Wed Mar 08, 2006 10:09 pm
Location: Germany
Full name: Werner Schüle

Re: Consideration of a C++ Crafty

Post by Werner »

Werner wrote:I have started a match on my I7 920 with crafty-236-64-speed-ja.exe for CEGT 40/20.
After 50 games:

Code: Select all

1   Crafty 23.6 x64 1CPU  2706  +21/-17/=12 54.00%   27.0/50
2   Texel 1.02 x64        2679  +17/-21/=12 46.00%   23.0/50
(Crafty 23.5 x64 1CPU 2630 )
Werner
...and here after 100 games:

Code: Select all

1   Texel 1.02 x64        2679  +40/-34/=26 53.00%   53.0/100
2   Crafty 23.6 x64 1CPU  2659  +34/-40/=26 47.00%   47.0/100
Werner
Jimbo I
Posts: 149
Joined: Thu Feb 15, 2007 4:34 am
Location: USA

Re: Crafty 23.6 released

Post by Jimbo I »

I'm having a problem with the 23.6 Ablett full-feature compile. (I used the 64-bit version; I didn't try the 32-bit version.) I set up an Arena tournament with skill levels 1 through 5, along with a few non-crafty engines. All of the Crafty 23.6 skill levels lose just about all games by time forfeit. Crafty just doesn't move fast enough and usually doesn't make the 40/4 time control. I do have the noise 100000 command in the crafty.rc file, which seems to fix time forfeit issues when using full-strength Crafties. (It doesn't seem to make a difference with the low skill values, so I guess it's OK to leave it in the configuration file.)

I tried the same tournament with the same crafty.rc file using Crafty 23.5, and I had no problems. Crafty 23.5 would use most of the clock early and would have to make a mad scramble to make time control, but it always did make it.

I know that Jim won't be making any more compiles, but I thought this was worth mentioning just in case the problem is a source code issue.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty 23.6 released

Post by bob »

Jimbo I wrote:I'm having a problem with the 23.6 Ablett full-feature compile. (I used the 64-bit version; I didn't try the 32-bit version.) I set up an Arena tournament with skill levels 1 through 5, along with a few non-crafty engines. All of the Crafty 23.6 skill levels lose just about all games by time forfeit. Crafty just doesn't move fast enough and usually doesn't make the 40/4 time control. I do have the noise 100000 command in the crafty.rc file, which seems to fix time forfeit issues when using full-strength Crafties. (It doesn't seem to make a difference with the low skill values, so I guess it's OK to leave it in the configuration file.)

I tried the same tournament with the same crafty.rc file using Crafty 23.5, and I had no problems. Crafty 23.5 would use most of the clock early and would have to make a mad scramble to make time control, but it always did make it.

I know that Jim won't be making any more compiles, but I thought this was worth mentioning just in case the problem is a source code issue.
It is probably related to how I "slow down" the search. The "spin loop" does not pay attention to the time, which might be a problem at very fast games...

On your hardware, can you tell me the normal NPS, and the NPS at those very low skill settings? I might can tune that a bit...
Jimbo I
Posts: 149
Joined: Thu Feb 15, 2007 4:34 am
Location: USA

Re: Crafty 23.6 released

Post by Jimbo I »

OK, I checked. It's a dual core laptop, but I only run the engine on one thread.

The normal Crafty runs at around 2,600 to 3,000 KNPS, and the low skill level Crafties run at around 3 KNPS normally, but sometimes get up to 6 or 8 KNPS.

I hope that's enough information to go on. Thanks Bob!
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty 23.6 released

Post by bob »

Jimbo I wrote:OK, I checked. It's a dual core laptop, but I only run the engine on one thread.

The normal Crafty runs at around 2,600 to 3,000 KNPS, and the low skill level Crafties run at around 3 KNPS normally, but sometimes get up to 6 or 8 KNPS.

I hope that's enough information to go on. Thanks Bob!
That 6-8K might be too slow. I see about 5M nps (one thread) on my dual-core 2.0ghz i7 macbook. I'll take a look at my slowdown and see if there is another solution. Tried nano sleep() but it did not work very well and slowed it down too much... will keep looking...
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Crafty 23.6 released

Post by Sven »

bob wrote:
Jimbo I wrote:OK, I checked. It's a dual core laptop, but I only run the engine on one thread.

The normal Crafty runs at around 2,600 to 3,000 KNPS, and the low skill level Crafties run at around 3 KNPS normally, but sometimes get up to 6 or 8 KNPS.

I hope that's enough information to go on. Thanks Bob!
That 6-8K might be too slow. I see about 5M nps (one thread) on my dual-core 2.0ghz i7 macbook. I'll take a look at my slowdown and see if there is another solution. Tried nano sleep() but it did not work very well and slowed it down too much... will keep looking...
Just a quick thought: maybe the number of nodes until checking the clock again needs to be adapted to NPS? E.g. if you always check after 64k nodes then with 5MNPS you check the clock about every 13ms but with 8KNPS you do that every 8 seconds ...