Pondering and memory bandwidth

MattieShoes · Post by **MattieShoes** » Mon Apr 13, 2009 6:15 am

So I have a Core2 quad 2.5 GHz box sitting here. When I'm making an engine-engine match, I've left pondering off. It occurs to me that I could turn it on and the non-SMP engines wouldn't be stealing cycles from the other -- they'd be running on different cores. But I I have no real idea how hard the engine hits the memory. Obviously it'll vary from engine to engine, but I guess what I'm wondering is does a "generic single threaded engine" use enough memory bandwidth that pondering on one engine would significantly impact the search for the other? FSB should be 1.333 GHz in this case

It also occurred to me that I could run four multiple simultaneous matches with pondering off, but that'd put even more load on the bus... I'd probably have to add some cooling too -- I imagine 4x100% on the CPU for extended periods will let me fry an egg on it.

bob · Post by **bob** » Mon Apr 13, 2009 8:57 am

MattieShoes wrote:So I have a Core2 quad 2.5 GHz box sitting here. When I'm making an engine-engine match, I've left pondering off. It occurs to me that I could turn it on and the non-SMP engines wouldn't be stealing cycles from the other -- they'd be running on different cores. But I I have no real idea how hard the engine hits the memory. Obviously it'll vary from engine to engine, but I guess what I'm wondering is does a "generic single threaded engine" use enough memory bandwidth that pondering on one engine would significantly impact the search for the other? FSB should be 1.333 GHz in this case

It also occurred to me that I could run four multiple simultaneous matches with pondering off, but that'd put even more load on the bus... I'd probably have to add some cooling too -- I imagine 4x100% on the CPU for extended periods will let me fry an egg on it.

I test like that all the time and don't see any difference. Easy test is to run one program, on a particular test position, and see how long it takes. Then run the same program two times to use both cores and see if there is any difference. In the case of Crafty, it is minimal...

Gian-Carlo Pascutto · Mon Apr 13, 2009 11:49 am

It's safe to assume that for the majority of engines, you will get one 64-byte cacheline transfer per node searched, and almost no (unpredictable/uncached) access to main memory besides that.

(Engines which don't probe in quiescent will be about 30%-40% of that)

This is far below the memory speed of most contemporary systems.

hgm · Post by **hgm** » Mon Apr 13, 2009 12:40 pm

For my engines, they were ony about 1% slower if a Chess game as in progress on the other core. In theory it is possible on C2D to totally wreck the performance of the other core by flushing the shared L2 cache as fast as you can during ponder time. This potentially gives you a much larger advatage than pondering. I don't know of any engines that implement such a 'bugger mode'. Perhaps I will try it one time in micro-Max, which does not ponder anyway.

bob · Post by **bob** » Mon Apr 13, 2009 6:58 pm

Gian-Carlo Pascutto wrote:It's safe to assume that for the majority of engines, you will get one 64-byte cacheline transfer per node searched, and almost no (unpredictable/uncached) access to main memory besides that.

(Engines which don't probe in quiescent will be about 30%-40% of that)

This is far below the memory speed of most contemporary systems.

There are some other issues, such as the magic move generation tables, that will stress most any cache size prior to the nehalem boxes with 8mb L3 and beyond...

CRoberson · Post by **CRoberson** » Mon Apr 13, 2009 7:45 pm

You can test this quite simply.

1) Run engine A on some benchmark or position for 30 to 60 seconds.
Note the nps, depth and total nodes. (this should be a single threaded
program).

2) exit out of engine A.
3) Run the test again but with 2 copies of engine A simultaneously.
Again, note the same info.

4) exit out of both engines.
5) Run the test again but with 3 copies of engine A simultaneously.
Again, note the same info.

6) exit out of both programs.
7) Do the test again with 4 copies of engine A and note the same info.

Try an engine that is fairly good. Don't allow the sum of 4 engines
to overload your memory which would create a swapping scenario.

If the notes you took are the same for 1 run vs 4 runs, there are
not any problems. Years ago, I noticed some machines having issues
at 4 engines, but not at 3 engines. In such situations, 2 engines
are not a problem.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Mon Apr 13, 2009 9:04 pm

bob wrote: There are some other issues, such as the magic move generation tables, that will stress most any cache size prior to the nehalem boxes with 8mb L3 and beyond...

In real games only a small part of those tables is active at any given time. Rank/File occupation is not random.

bob · Post by **bob** » Tue Apr 14, 2009 12:00 am

Gian-Carlo Pascutto wrote:
bob wrote: There are some other issues, such as the magic move generation tables, that will stress most any cache size prior to the nehalem boxes with 8mb L3 and beyond...
In real games only a small part of those tables is active at any given time. Rank/File occupation is not random.

It's not random, but in a 20 ply search, it comes pretty close. It would be pretty easy to set up a bitmap for those tables, and during the indexing, set a bit in the bitmap to show which value is used. I'd bet the bitmap ends up almost all 1's after a 60 second move...

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Apr 14, 2009 7:34 am

bob wrote:
Gian-Carlo Pascutto wrote: In real games only a small part of those tables is active at any given time. Rank/File occupation is not random.
It's not random, but in a 20 ply search, it comes pretty close. It would be pretty easy to set up a bitmap for those tables, and during the indexing, set a bit in the bitmap to show which value is used. I'd bet the bitmap ends up almost all 1's after a 60 second move...

After 60 seconds, perhaps, but that doesn't say a lot about the effectiveness of the cache. 99% of the accesses could have been local and 1% random all over. You will not stress memory then. You can profile this with valgrind, I think. I am sure that you will find that even 1M of cache will give very high hitrates.

Gian-Carlo Pascutto · Tue Apr 14, 2009 10:00 am

Gian-Carlo Pascutto wrote:I am sure that you will find that even 1M of cache will give very high hitrates.

Put differently: do you really believe that your move generation is hitting main memory each node, or even regularly? That would kill performance pretty badly, and it would show up clearly in profiling.

Pondering and memory bandwidth

Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth