Pondering and memory bandwidth

MattieShoes · Post by **MattieShoes** » Tue Apr 14, 2009 11:57 am

Thanks for the help.

Strange, it never even occurred to me that I could test this in a minute or two without bothering y'all

I started four engines so all memory was allocated but they were idle. then i threw the first one into analyze mode, waited 30 seconds, threw the second into analyze mode, waited 30 seconds, etc.

First three are essentially identical -- the second one actually ran marginally faster than the first. memory alignment, OS overhead, I don't know. Fourth one is about 2-3% slower.

After the first four, I figured I should test a 5th just to verify that it would in fact suck to run 5 engines simultaneously on a 4 core box. It came out about 87% slower than the average of the first four.

And because my life isn't complete if I don't graph everything I can get my hands on:

[/img]

wgarvin · Post by **wgarvin** » Tue Apr 14, 2009 4:10 pm

Gian-Carlo Pascutto wrote:
bob wrote: There are some other issues, such as the magic move generation tables, that will stress most any cache size prior to the nehalem boxes with 8mb L3 and beyond...
In real games only a small part of those tables is active at any given time. Rank/File occupation is not random.

Are you sure about that? The magic tables are used to answer questions about all of the positions encountered in the search tree, not just the over-the-board position. I would kind of expect a large fraction of the cachelines making up the table to be touched during any given search?

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Apr 14, 2009 4:40 pm

wgarvin wrote:
Gian-Carlo Pascutto wrote:
bob wrote: There are some other issues, such as the magic move generation tables, that will stress most any cache size prior to the nehalem boxes with 8mb L3 and beyond...
In real games only a small part of those tables is active at any given time. Rank/File occupation is not random.
Are you sure about that? The magic tables are used to answer questions about all of the positions encountered in the search tree, not just the over-the-board position. I would kind of expect a large fraction of the cachelines making up the table to be touched during any given search?

You're repeating what Bob already said and what I rebutted above.

Even if you touch the entire table, as long as temporal locality is good enough for the cache to be effective, you're not hitting main memory a lot.

If you'd be hitting main memory a lot, you'd be slow, and magic movegen wouldn't be interesting.

bob · Post by **bob** » Tue Apr 14, 2009 6:07 pm

Gian-Carlo Pascutto wrote:
Gian-Carlo Pascutto wrote:I am sure that you will find that even 1M of cache will give very high hitrates.
Put differently: do you really believe that your move generation is hitting main memory each node, or even regularly? That would kill performance pretty badly, and it would show up clearly in profiling.

I certainly believe it hits it every node. Not for every piece, but with 2 bishops, 2 rooks and a queen, that turns into 6 magic move generations, and I'd bet at least one of those goes to memory unless you are talking about the Nehalems with 8mb and beyond of L3. Every node goes to the hash table, which means something is going to get displaced. Ditto for pawn hash. I'd agree that the hitrates are quite high, else the processors would be very slow. But at least 1/2 the hits are instructions, if not more.

bob · Post by **bob** » Tue Apr 14, 2009 6:13 pm

Gian-Carlo Pascutto wrote:
wgarvin wrote:
Gian-Carlo Pascutto wrote:
bob wrote: There are some other issues, such as the magic move generation tables, that will stress most any cache size prior to the nehalem boxes with 8mb L3 and beyond...
In real games only a small part of those tables is active at any given time. Rank/File occupation is not random.
Are you sure about that? The magic tables are used to answer questions about all of the positions encountered in the search tree, not just the over-the-board position. I would kind of expect a large fraction of the cachelines making up the table to be touched during any given search?
You're repeating what Bob already said and what I rebutted above.

Even if you touch the entire table, as long as temporal locality is good enough for the cache to be effective, you're not hitting main memory a lot.

If you'd be hitting main memory a lot, you'd be slow, and magic movegen wouldn't be interesting.

I'm not following your "rebuttal". I take the occupied squares and with a mask to extract just the diagonal occupied bits, multiply by a magic number that is then shifted a number of bits. This value then is used as an index to another big table that contains the resulting moves. Somewhere in that mess, it seems likely that a cache miss will happen somewhere in all of that, when you repeat it 6 times for every node searched. Not to mention the times the above is used for things besides move generation (mobility calculations, for one thing. So that is probably off by a factor of 2.

bob · Post by **bob** » Tue Apr 14, 2009 6:17 pm

MattieShoes wrote:Thanks for the help. Strange, it never even occurred to me that I could test this in a minute or two without bothering y'all

I started four engines so all memory was allocated but they were idle. then i threw the first one into analyze mode, waited 30 seconds, threw the second into analyze mode, waited 30 seconds, etc.

First three are essentially identical -- the second one actually ran marginally faster than the first. memory alignment, OS overhead, I don't know. Fourth one is about 2-3% slower.

After the first four, I figured I should test a 5th just to verify that it would in fact suck to run 5 engines simultaneously on a 4 core box. It came out about 87% slower than the average of the first four.

And because my life isn't complete if I don't graph everything I can get my hands on:
[/img]

Sounds about right. Note that this isn't a perfect test, because depending on the processor, you have some local cache and some shared cache. On the Nehalem I was testing on, each core had 32k x 32k L1, 256k L2, and an 8mb L3 shared between the 4 cores. Running the same program 4 times will cause the instructions and non-modified data to be shared among all cores with just one copy in the shared cache, where running four different programs would change this a bit. So you should probably re-run the test but try 4 different programs to see how much that hurts. Hopefully not much.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Apr 14, 2009 6:23 pm

Bob: Can you do a test with cachegrind?

You can let it simulate various cache configurations. L1, L2, associativity...

MattieShoes · Post by **MattieShoes** » Tue Apr 14, 2009 7:10 pm

That makes sense, I'll have to try some more tests later. For now, I don't really care if there's a performance hit as long as it's a consistent one. I've only been running two matches at a time anyway, without pondering.

It's the lower end yorkfield with 6 meg L2 cache. I know next to nothing about low level processor stuff, like what gets shoved into cache and what doesn't, branch prediction... It's on my to-learn list, along with statistical analysis, LMR, matrix math, how EGTB's work, how the sensors on gas pumps work, how to make honey walnut shrimp, python, how to integrate results from a PN and A/B search effectively, etc. . .

So another simple question I can't find an answer to... Bayeselo has the rating command, then it lists + and -. I assume this is a confidence interval based on the bayesian analysis (also on the to-learn list) but I don't know exactly what the confidence interval is (is it 95% here too? one-tailed, two tailed?), and I can't find it documented anywhere. No doubt this is because the people that actually care already know...

bob · Post by **bob** » Tue Apr 14, 2009 8:00 pm

Gian-Carlo Pascutto wrote:Bob: Can you do a test with cachegrind?

You can let it simulate various cache configurations. L1, L2, associativity...

I'll have to find it and download again, I have just upgraded all my linux boxes and the old versions don't run due to new glibc and .so lib versions...

bob · Post by **bob** » Tue Apr 14, 2009 8:02 pm

MattieShoes wrote:That makes sense, I'll have to try some more tests later. For now, I don't really care if there's a performance hit as long as it's a consistent one. I've only been running two matches at a time anyway, without pondering.

It's the lower end yorkfield with 6 meg L2 cache. I know next to nothing about low level processor stuff, like what gets shoved into cache and what doesn't, branch prediction... It's on my to-learn list, along with statistical analysis, LMR, matrix math, how EGTB's work, how the sensors on gas pumps work, how to make honey walnut shrimp, python, how to integrate results from a PN and A/B search effectively, etc. . .

So another simple question I can't find an answer to... Bayeselo has the rating command, then it lists + and -. I assume this is a confidence interval based on the bayesian analysis (also on the to-learn list) but I don't know exactly what the confidence interval is (is it 95% here too? one-tailed, two tailed?), and I can't find it documented anywhere. No doubt this is because the people that actually care already know...

I believe that when Remi joined a discussion about this last year, this is a two-tailed test, 95% confidence interval. Using 1 SD really gives too wide a range to be useful for measuring small changes, and 2 SD can be problematic if the change is very small..

Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth

Re: Pondering and memory bandwidth