44 elo swing depending on hardware!

lkaufman · Post by **lkaufman** » Wed Oct 16, 2013 7:00 pm

I ran a match on a third generation I7 quad between Komodo (latest dev. version) and Houdini 3 at bullet speed (30 seconds plus a quarter of a second increment), overprovisioned (meaning one test per thread rather than per core). Just as the light-speed list shows, at these speeds there is a sizable gap; Houdini won by 34 elo after 2600 games. No surprise.
Then I ran the same match on my sixteen core machine, with the time limit increased by a third to make the average depth reached about the same. Much to my surprise, Komodo won the match by ten elo after 2300 games! This is a swing of 44 elo points. Incredible!
The relative NPS of the two programs explalns the result. On the quad the ratio (Houdini NPS divided by Komodo) was 1.53. On the 16 core the same ratio was 1.25. This huge difference can account for something like 44 elo points at this fast a level.
But the question is WHY is there such a huge difference in the ratio between the two programs?
I ran the same test with Komodo against Stockfish but got only a negligible difference in results or in NPS ratio. Overprovisioning doesn't seem to be a big factor; when I run the same tests limited to the number of cores I get smaller NPS ratios on both machines, still much higher on the quad.
I also tried timing the opening position, 20 ply search on a single core. Here the ratios both grew: on the quad it was 1.68, on the 16 core it was 1.64.
Can anyone explain these huge disparities in NPS ratios?

pohl4711 · Post by **pohl4711** » Wed Oct 16, 2013 7:35 pm

lkaufman wrote:I ran a match on a third generation I7 quad between Komodo (latest dev. version) and Houdini 3 at bullet speed (30 seconds plus a quarter of a second increment), overprovisioned (meaning one test per thread rather than per core). Just as the light-speed list shows, at these speeds there is a sizable gap; Houdini won by 34 elo after 2600 games. No surprise.
Then I ran the same match on my sixteen core machine, with the time limit increased by a third to make the average depth reached about the same. Much to my surprise, Komodo won the match by ten elo after 2300 games! This is a swing of 44 elo points. Incredible!
The relative NPS of the two programs explalns the result. On the quad the ratio (Houdini NPS divided by Komodo) was 1.53. On the 16 core the same ratio was 1.25. This huge difference can account for something like 44 elo points at this fast a level.
But the question is WHY is there such a huge difference in the ratio between the two programs?
I ran the same test with Komodo against Stockfish but got only a negligible difference in results or in NPS ratio. Overprovisioning doesn't seem to be a big factor; when I run the same tests limited to the number of cores I get smaller NPS ratios on both machines, still much higher on the quad.
I also tried timing the opening position, 20 ply search on a single core. Here the ratios both grew: on the quad it was 1.68, on the 16 core it was 1.64.
Can anyone explain these huge disparities in NPS ratios?

On my Quadnotebooks, the ratio is around 1.5 (Houdini 2.2 MN/s and Komodo 1.5 MN/s), when I use the LittleBlitzerGUI with 3 or 4 games at the same time.
Perhaps you should try out, what happens, when you switch Hyperthreading off and play only 4 games at the same time. Then the LittleBlitzerGUI should use only one CPU on your sixteencore-machine (check out the TaskManager of Windows). Perhaps Houdini has some problems with so many CPUs/Cores (perhaps its Numa-architecture?) in one machine, nevertheless it uses only a single thread. Its a try...Hyperthreading can have strange effects, when playing some engine-matches in parallel. I would strongly recommend to swith Hyperthreading off for all engine-testing!

Stefan

Modern Times · Post by **Modern Times** » Wed Oct 16, 2013 8:24 pm

lkaufman wrote:I ran a match on a third generation I7 quad between Komodo (latest dev. version) and Houdini 3 at bullet speed (30 seconds plus a quarter of a second increment), overprovisioned (meaning one test per thread rather than per core)

In my mind, this over-propositioning completely invalidates the test. You would never catch a CCRL, CEGT tester doing that.

lkaufman · Post by **lkaufman** » Thu Oct 17, 2013 12:45 am

Modern Times wrote:
lkaufman wrote:I ran a match on a third generation I7 quad between Komodo (latest dev. version) and Houdini 3 at bullet speed (30 seconds plus a quarter of a second increment), overprovisioned (meaning one test per thread rather than per core)
In my mind, this over-propositioning completely invalidates the test. You would never catch a CCRL, CEGT tester doing that.

OK, but when I ran it with 15 matches (instead of 32)(keeping one for the gui) the NPS ratio dropped even further (to 1.22), as well as the depth advantage shown by Houdini. Results were nearly even (about 3 elo lead for Houdini) after 1300 games, nothing like the beating we took on the quad. So aside from whether overprovisioning is valid or not, this is not the explanation of the huge disparity between the two machines. Maybe others can report on the NPS ratio shown by LittleBlitzer in direct matches between Houdini 3 and Komodo 6 at levels around 30" + .25", more or less depending on whether you have slow or fast hardware. Maybe we can see whether other hardware besides by 16 core machine shows ratios around 1 1/4 instead of 1 1/2.

Eelco de Groot · Post by **Eelco de Groot** » Thu Oct 17, 2013 3:59 am

I did not really understand what Larry meant with overprovisioning. But it seems that from your post Larry, on a quad, normally you would run four matches without pondering, but now with hyperthreading on, you use eight logical cores and run eight tests in parallel without pondering? Especially on a large machine like a sixteen core, does it not increase statistical noise? Maybe for running many fast games, the extra number of games you can do in the same time compensates, if it is just for a first stagetest for instance. But the errorbars will possibly be larger. It should not be so hard to test that.

I don't think for this kind of measurement, if there is already discrepancy shown in the nodes per second, running matches is giving much extra information. Would it not be reasonable to assume that Houdini is not very well optimized for the sixteen core machine, even though Robert Houdart also uses an AMD as part of his test set-up and I assume this is also AMD (it is not specified in the post what sort of CPU is in that machine)?

Eelco

lkaufman · Post by **lkaufman** » Thu Oct 17, 2013 5:50 am

Eelco de Groot wrote:I did not really understand what Larry meant with overprovisioning. But it seems that from your post Larry, on a quad, normally you would run four matches without pondering, but now with hyperthreading on, you use eight logical cores and run eight tests in parallel without pondering? Especially on a large machine like a sixteen core, does it not increase statistical noise? Maybe for running many fast games, the extra number of games you can do in the same time compensates, if it is just for a first stagetest for instance. But the errorbars will possibly be larger. It should not be so hard to test that.

I don't think for this kind of measurement, if there is already discrepancy shown in the nodes per second, running matches is giving much extra information. Would it not be reasonable to assume that Houdini is not very well optimized for the sixteen core machine, even though Robert Houdart also uses an AMD as part of his test set-up and I assume this is also AMD (it is not specified in the post what sort of CPU is in that machine)?

Eelco

Yes, you explained overprovisioning correctly. When running one version of Komodo against another, it is pretty clear to me that overprovisioning is the right way to test, as a 20-25% increase in games per minute (when levels are equivalent, for example fixed depth) is hard to turn down when there are no apparent side effects.
But when I run unrelated engines the issue is less clear. Against Stockfish it doesn't seem to matter whether I overprovision or not, results and relative node rates are similar. But it does seem to matter a little against Houdini.
All of my machines are Intel. The sixteen core has two 8 core xeon processors. I just today got a 20 core machine (two new 10 core xeons) to replace my older 12 core, which I'm giving to Don. With the new hardware and the new partner (Mark L.) progress should accelerate.
I don't believe the nature of the hardware we had has much to do with Komodo. I mean we ran many tests on single-processor (i.e. quadcore) machines, and many on dual-processor (i.e. 12 and 16 core) machines, without ever noticing any particular discrepancy when testing Komodo vs Komodo. Maybe because Houdini was developed on AMD, it somehow made it work better on Intel machines with one processor than those with two. But the huge disparity (relative to Komodo) in performance (about 20%, with or without overprovisioning) seems hard to explain. I would think it would be very difficult to write a chess program with the goal of having performances differences this large between machines (relative to other engines), without doing anything obviously silly or stupid.
I'm hoping that someone can come up with a clear technical explanation of how such a disparity might occur. Perhaps if we know the cause, we can improve Komodo on the single processor machines. My results on the 16 core indicate that even at bullet chess where Houdini excels we have now reached parity on the big machine, but we still have a big of a gap at this speed on normal quad machines.

Modern Times · Post by **Modern Times** » Thu Oct 17, 2013 7:29 am

I have an Intel Sandybridge laptop, so I may run some fast blitz or bullet tests on that vs my AMD and see what happens.

Michel · Post by **Michel** » Thu Oct 17, 2013 9:39 am

A littlebit OT.

Recently on fishtest some people had started abusing the system by running many more tests in parallel than free cores on their machines. Presumably to
get higher up in the list of played games more quickly.

This behaviour was brought into the spotlight by Alexandre Meirelles. Initially however his suspicions were discarded my most people, including me, as FUD.

Nonetheless to put the matter to rest a statistical test was implemented which showed the abuse was real.

Another contribution was by Marco who implemented code to detect time losses (responding to suspicions by Uri).

These two sanity checks on the results returned by fishtest make it possible to detect abusers which in turn seems to have eliminated the abuse!

Modern Times · Post by **Modern Times** » Thu Oct 17, 2013 10:00 am

Michel wrote:A littlebit OT.

These two sanity checks on the results returned by fishtest make it possible to detect abusers which in turn seems to have eliminated the abuse!

Good to hear, because results in those circumstances can't be trusted, and may lead to wrong decisions being taken on code changes.

Vinvin · Post by **Vinvin** » Thu Oct 17, 2013 11:16 am

lkaufman wrote:meaning one test per thread rather than per core

Do you mean you use hyper-threading ?

44 elo swing depending on hardware!

44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!

Re: 44 elo swing depending on hardware!