Peculiarity of Komodo 5.1MP

bob · Post by **bob** » Thu Jun 27, 2013 9:35 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
duncan wrote:
bob wrote:
The test did not address your speculation.

The correct test is to take a group of parallel programs and play 'em in a large match, using 1 core per engine. Then take each engine, one at a time, and run 'em at 4 cores. Measure the Elo improvement. If Komodo gains more than the others, I'll stand ready to eat my hat. the test, as run, doesn't show a single thing about this topic, however...
is your hat safe ? or were you not referring to eqivalent gain.

duncan
I was referring to Elo gain, period. And it seems my hat is safe...
In other words Komodo has to gain more than ANY other program before you will eat your hat. If it's about the same, then you feel it has horrible parallel scalability. Nothing in between. Is that about right?

What part of "If komodo gains more than the others" don't you understand? The claim here has been that somehow your search doesn't gain as much depth as others, but it gains more from this "widening."
I never made such a claim though.

That was implied by others. Hence my jumping into the discussion. Yet the original data only shows that for 4 cores, there is a significant Elo gain going to the same depth, assuming time does not count at all. Which is an incorrect assumption since chess is a timed game. My point was, and still is, that (a) SMP efficiency is a precise term that is being misused/abused here; (b) the original data shows Komodo does something different, but doesn't give even a hint about whether it is better or worse since there is no timing information provided. If time is irrelevant, a wider search will always be an improvement since it makes fewer mistakes. And since time is irrelevant, the computational cost of the wider search doesn't matter. But in the real world...

It is an easy enough claim to back up, obviously. I know where my gains come from, and it is not from searching more nodes than the serial version. Searching more nodes is a loss for me, not a gain. If your SMP algorithm scales better than the rest due to your widening, it ought to be easy enough to prove.

I never made such a claim and I don't remember anyone else making it either, but it's possible that someone has. If so I'm not responsible for that.

I think I said right after release on some thread here that we don't have a strong sense of exactly how will it scales which is how I still feel. I am actually happy with the reports so far but I am under no illusion that Komodo scales better with cores than other programs.

I might be able to set up a user account for you where you can test this hypothesis on big hardware, if you can't do it on what you have... Then we can decide whether I need to eat my hat or not.

At the moment, I am unconvinced. Real data can convince me, as always.

Hand-waving, not so much...

How much CPU power do you have? I would love to do a big study. Do I have to compile it on your system? I can do that of course.

Don[/quote]

On one cluster, I have 70 nodes, 8 cores per node. 12gb ram per node. Yes, you have to compile as it is a lightweight kernel that tries to stay out of the way of the program as much as possible, not really intended to run more processes than cores (although it will).

syzygy · Post by **syzygy** » Thu Jun 27, 2013 9:44 pm

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:That Komodo's SMP implementation is effective in terms of playing strength / elo gain per core follows from other threads.
Nope. Nothing to base that on.
What in "follows from other threads" is unclear to you?

bob wrote:Elo is based on time, where both players have equal time to play the game. This test does not even approximate that standard.
"Follows from other threads." I am sorry, but I am not going to spell it out any further.
Please don't. This thread is where this is being discussed. Nothing of interest regarding parallel performance here. And an incorrect usage of a precise term to boot...

I explicitly stated it followed from other threads. Now it is suddenly illegal to refer to facts observed in other threads?

Very reasonable person you are.

This thread was not started to convince you of anything. To appreciate the point Kai is making, one needs to have a little more background than you have. You don't like that, fine, but then maybe don't protest so loudly in this thread and let the people actually interested in the point raised discuss this further. Thank you.

(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)

bob · Post by **bob** » Fri Jun 28, 2013 7:54 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:That Komodo's SMP implementation is effective in terms of playing strength / elo gain per core follows from other threads.
Nope. Nothing to base that on.
What in "follows from other threads" is unclear to you?

bob wrote:Elo is based on time, where both players have equal time to play the game. This test does not even approximate that standard.
"Follows from other threads." I am sorry, but I am not going to spell it out any further.
Please don't. This thread is where this is being discussed. Nothing of interest regarding parallel performance here. And an incorrect usage of a precise term to boot...
I explicitly stated it followed from other threads. Now it is suddenly illegal to refer to facts observed in other threads?

Very reasonable person you are.

This thread was not started to convince you of anything. To appreciate the point Kai is making, one needs to have a little more background than you have. You don't like that, fine, but then maybe don't protest so loudly in this thread and let the people actually interested in the point raised discuss this further. Thank you.

(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)

Just don't expect ME to be aware of what is posted in other threads. If the info is important, reference the posts with a link or a quote.

As far as the data reported, it doesn't show anything since it is incomplete. One can't possibly draw any conclusion from it without times reported, other than "Komodo is different." Better or worse is completely unknown based on the data presented.

End of story.

bob · Post by **bob** » Fri Jun 28, 2013 7:57 am

Laskos wrote:
bob wrote:
Laskos wrote:
Modern Times wrote:
Laskos wrote:I think we should agree that SMP efficiency in chess is measured in Elos, not depths.
For an end user, yes that would be the definition. But Bob may call that "SMP Effectiveness" or something like that, which he views as different from his narrower technical definition of SMP efficiency.
I think for everybody, Bob included, was clear what was meant. Going from 1 to N cores, how large is the factor time(1cpu)/time(Ncpu) at equal _strength_. Bob started this "misunderstanding" game after 60-70 posts in this thread, and tha shows his character, no more.

The fact is time-to-depth is not a measure of the effectiveness of SMP implementation (as defined above), because parallelly one doesn't search the same tree as sequentially.
And again, WITHOUT TIME, your data says nothing about quality or quantity. If there is an 80 elo gain due to wider search, and a 120 elo loss caused by slowing the search down, one is not "winning" there. And without any reference to time, one can't conclude anything at all, other than the parallel search is "bigger". But is "bigger" the same as "better"? Not without a time reference.
First, to be your quality of unnecessary pedantic when at a loss with arguments, it's not "elo", do your homework if you talk of chess, it's "Elo" from Arpad Emrick Elo, the creator of the Elo rating system for two-player games such as chess.

If those +80 Elos from the first test don't give a clue of what's happening, then how I predicted that time-to-depth factor from 1 to 4 cores is 1.5-2 for Komodo instead of 3-3.2 for typical engines? It turned out to be 1.68. Read my earlier post, where I predicted that without evidence, being away from my desktop. If you do not think further than pedantic definitions, when everything I stated in my first post is pretty clear, sure you will be stuck in absurd arguments. "To think is hard" (Goethe).

To understand is hard if consistent terminology is not used.

SMP efficiency is a well-known and frequently used term.

Why does one have to "predict" time to depth? It is absolutely trivial to measure empirically...

Laskos · Post by **Laskos** » Fri Jun 28, 2013 7:37 pm

syzygy wrote:
(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)

The problem is the times are not universal, they depend on TC or depth. I tried to give time benefit to 1 core Houdin and Komodo to match in strength 4 cores Houdini and Komodo respectively at ultra-short control 5''+0.1''. The benefit is 2.3 for Houdini and 2.0 for Komodo at this TC, not quite close to ~3 at some 1 minute/move TC. So, these time-to-depth ratios and the gain at the same depth (of Komodo) depend on TC and depth.

I am a bit surprised that it's hard to find a reliable data of Elo gain for x2 time at 40/4' on some 3GHz core. Many are basing their assumptions on the gain from 1CPU to 2 or 4CPU, inferring from these the gain at 2x time. I decided to test for myself the gain of Houdini at 2x time at 40/4' 1 core on my i7, if it's larger than 80, we are for a small surprise, but it takes time.

Don · Post by **Don** » Fri Jun 28, 2013 7:45 pm

Laskos wrote:
syzygy wrote:
(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)
The problem is the times are not universal, they depend on TC or depth. I tried to give time benefit to 1 core Houdin and Komodo to match in strength 4 cores Houdini and Komodo respectively at ultra-short control 5''+0.1''. The benefit is 2.3 for Houdini and 2.0 for Komodo at this TC, not quite close to ~3 at some 1 minute/move TC. So, these time-to-depth ratios and the gain at the same depth (of Komodo) depend on TC and depth.

I have tried to explain something similar to non-technical people and it's difficult to make it clear.

So it's incorrect to say, "program 1 gains 120 ELO going from one to 4 cores" because the gain will be very much dependent on the time control and hardware. If you are doing a 3 ply search you will get 200 ELO or more going to 3 ply but you will only get perhaps 10 or 20 ELO going from 30 to 31 ply.

But even on the same hardware and time control it's incorrect to say, "program X gets more than program Y going from 1 to 4 cores" because the weaker program has a big advantage for the reason I stated in the previous paragraph. So all these comparisons are at least slightly unfair to Houdini.

To accurately measure who gets the most benefit you have to NORMALIZE the levels with handicaps or time bonuses so that you have all the participating programs playing the same ELO on 1 core. Then, using the same normalized levels you run a second series with 4 cores to see which program gets the most benefit.

syzygy · Post by **syzygy** » Fri Jun 28, 2013 9:10 pm

Laskos wrote:
syzygy wrote:(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)
The problem is the times are not universal, they depend on TC or depth. (...)

Please know that what I wrote between parentheses was purposeful nonsense

The way you measured it was the proper way to do it. Bob did it in exactly the same way not long ago. See here, here and here:

bob wrote:To measure this "the extra search nodes can hall" one can do a very simple test.

Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.

I can run that test although I already am almost 100% certain about the outcome, thanks to past testing...

bob wrote:This is going to only answer one question, "Is the parallel search overhead all wasted effort or is there some accuracy/quality gained from it?"

bob wrote:After the first two-thread match is almost complete, zero change in Elo from the two 1-thread runs. Got another 2 thread run for verification queued up, and then 4 thread runs. Should be able to answer this subject definitively tomorrow if nothing goes wrong overnight.

Laskos · Post by **Laskos** » Fri Jun 28, 2013 10:23 pm

syzygy wrote:
Laskos wrote:
syzygy wrote:(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)
The problem is the times are not universal, they depend on TC or depth. (...)
Please know that what I wrote between parentheses was purposeful nonsense

The way you measured it was the proper way to do it. Bob did it in exactly the same way not long ago. See here, here and here:
bob wrote:To measure this "the extra search nodes can hall" one can do a very simple test.

Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.

I can run that test although I already am almost 100% certain about the outcome, thanks to past testing...

bob wrote:This is going to only answer one question, "Is the parallel search overhead all wasted effort or is there some accuracy/quality gained from it?"

bob wrote:After the first two-thread match is almost complete, zero change in Elo from the two 1-thread runs. Got another 2 thread run for verification queued up, and then 4 thread runs. Should be able to answer this subject definitively tomorrow if nothing goes wrong overnight.

Thanks Ronald, I was unaware of these posts. So, Bob was not so out of his depth as he seemed to be in this thread. The result in my first post contradicted his "no gain" rule (belief?), and he started to post some preposterous arguments. A bit strange to deny and distort hard facts on the very shaky grounds that it cannot happen. "That can not possibly be, because it could never possibly be." (Chekhov)

syzygy · Post by **syzygy** » Fri Jun 28, 2013 10:51 pm

Laskos wrote:Thanks Ronald, I was unaware of these posts. So, Bob was not so out of his depth as he seemed to be in this thread. The result in my first post contradicted his "no gain" rule (belief?), and he started to post some preposterous arguments. A bit strange to deny and distort hard facts on the very shaky grounds that it cannot happen. "That can not possibly be, because it could never possibly be." (Chekhov)

Also note that he talks about "zero change in Elo" in the context of fixed depth searches with unlimited time. Of course there is absolutely no problem with that, because any reasonable person knows exactly what he means. Unfortunately Bob regularly chooses to be completely unreasonable only to not have to admit that he got something wrong.

bob · Post by **bob** » Sat Jun 29, 2013 3:09 am

Laskos wrote:
syzygy wrote:
Laskos wrote:
syzygy wrote:(Maybe Kai should have taken the elo gain per core observed in other threads rounded to one decimal and based on that reconstruct the times to say 3 decimals? Would have made a fine scientific paper. However, this is just a forum thread.)
The problem is the times are not universal, they depend on TC or depth. (...)
Please know that what I wrote between parentheses was purposeful nonsense

The way you measured it was the proper way to do it. Bob did it in exactly the same way not long ago. See here, here and here:
bob wrote:To measure this "the extra search nodes can hall" one can do a very simple test.

Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.

I can run that test although I already am almost 100% certain about the outcome, thanks to past testing...

bob wrote:This is going to only answer one question, "Is the parallel search overhead all wasted effort or is there some accuracy/quality gained from it?"

bob wrote:After the first two-thread match is almost complete, zero change in Elo from the two 1-thread runs. Got another 2 thread run for verification queued up, and then 4 thread runs. Should be able to answer this subject definitively tomorrow if nothing goes wrong overnight.
Thanks Ronald, I was unaware of these posts. So, Bob was not so out of his depth as he seemed to be in this thread. The result in my first post contradicted his "no gain" rule (belief?), and he started to post some preposterous arguments. A bit strange to deny and distort hard facts on the very shaky grounds that it cannot happen. "That can not possibly be, because it could never possibly be." (Chekhov)

Here's some context. Dating back to the 70's. Including names such as Marsland, Schaeffer, Newborn, in the beginning, and ending with names like Hsu, Hyatt, etc in recent years.

EVERYONE has used parallel search to increase depth. Until such point that the search depth doesn't improve further. Schaeffer with Sun Phoenix comes to mind, using part of the nodes to run a normal search, part running a material-only search. The ONLY reason anyone has done anything similar is that they reach a point (Schaeffer's program ran on a local network of suns interconnected with 10mbit/sec ethernet, so communication was a big issue) where additional processors don't help with the depth at all and the speedup flattens out (or even turns south in some cases).

So, depth is key, and always has been. For this "widening" to be realistic, we first assume that Komodo's parallel search simply can't produce a reasonable speedup with a paltry 4 cpus. Where EVERYONE is getting something in the range of 3.1-3.3, Komodo falls short. That leaves one possibility, searching wider (less pruning) to make the tree bigger, where it is easier to search in parallel, but with less depth increase. But the width will help. Enough to offset the loss in depth? Has not happened to date. Particularly with just 4 cores. Hence I consider this to be beyond unlikely...

ANY parallel search searches a wider tree. Most know that YBW is the common "trigger" criterion. Once you have searched one move at any particular node, you can reasonably assume that is an ALL node that needs every branch searched. A split there is perfect. The fly in the ointment is that we are only right about 90% of the time (At least in Crafty, 90+% of the time where a fail-high occurs, it happens on the first move). That means that 10% of the splits incur some overhead. If you use 4 cores, in the worse case you waste 3 of the cores as the second move causes the fail high. In fewer cases, there is less overhead when you fail high on the 3rd, and finally no overhead if you fail high on the 5th. That makes the tree wider. But to date, no experimental results has shown that such overhead improves the quality. I have run several such tests, solely to verity that the parallel search is not broken (a searched to fixed depth with 1, 2 or 4 cores should produce about the same Elo against a common set of opponents. If it drops with > 1 core, I know I have a bug.

So experience, and lots of it, by MANY people, has shown that going for more depth offers the greatest return from additional cores. For a while. I have run thousands of tests with 8-12 cores, quite a few with 16, and just a few (ignoring Cray Blitz) with 32. And even some with Eugene Nalimov using a 64-cpu Itanium box Microsoft had at the time. And in every case, additional cores added additional depth. For 1-12 there is no doubt this is the way to go for every program I have seen that uses a parallel search (reasonable one, forget the hash table nonsense and such).

The fact that Komodo searches a wider tree just means it searches a wider tree. Why? No idea. I can think of several programming bugs that would cause this. One simple one as an example: Suppose you use classic LMR where you sort the moves and as you step through the move list you increase the reduction every few moves. Now suppose you split after searching the first move, and each thread counts just the moves it searches. Since each thread searches 1/4 of the moves on average, each thread will not search as many moves as just one would, the moves searched counter doesn't get nearly as large, and the reductions don't reach the same level. That's the sort of bug I have had in the past, and one I would look for if the size of my fixed depth parallel search grew faster than about 30% per core added. Is that the case here? Absolutely unknown, and there is no data to suggest any sort of explanation other than "it happens."

My quibble with the discussion has been that the term "SMP efficiency" is well-defined and is purely based on parallel performance improvements. If your parallel search intentionally does NOT search the same tree as the serial search, then the term is not applicable. If it unintentionally does not search the same tree, that just reduces the SMP efficiency because you are searching nodes that the serial search does not NEED to search (because of alpha/beta backward pruning or normal forward pruning or reductions). Here we can't even mention SMP efficiency since we have no way to compute t(4cores)/t(1core), since no time was provided.

Those points summarize what I have repeated several times by now. Nothing new. Nothing controversial. IF Komodo intentionally changes the shape of the tree, then that would be interesting. If the parallel search can't search the same tree at least 3x faster, that is also interesting since many others are doing so.

My judgement, based on doing parallel search for 35 years now (first parallel version of blitz was 1978 using a dual-cpu Univac 1100/20 machine) is that depth comes first, until you begin to reach asymptotic behavior, at which time you might try to alter the asymptote, or else use the remaining cpu power to do something else, such as the previously mentioned material-only tactical verification search and the like...

I don't claim to know more about parallel search than anyone else, as applied to alpha/beta and chess. But I would certainly claim to not know any less than anyone else currently (or previously) involved. Experience is a necessary tool. And that comes from actually doing it, as opposed to guessing about it.

That's my take on the discussion to date...

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP