Hardware vs Software

bob · Post by **bob** » Wed Dec 03, 2008 7:41 am

Here are the results. Crafty-22.9X1 is normal crafty. 22.9X2 is normal except null-move completely commented out. 22.9X3 is normal except that LMR has been completely disabled. 22.9X4 is normal but with both LMR and null-move removed. the -1 or -2 just means run #1 or run#2 to give a fell for what kind of variation there is between runs.

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2695    4    4 62256   65%  2585   20% 
   2 Toga2             2695    4    3 62256   64%  2585   20% 
   3 Crafty-22.9X1-1   2638    4    4 31128   51%  2629   21% 
   4 Crafty-22.9X1-2   2636    5    5 31128   51%  2629   21% 
   5 Fruit 2.1         2597    3    3 62256   52%  2585   22% 
   6 Crafty-22.9X2-2   2596    4    4 31128   45%  2629   21% 
   7 Crafty-22.9X3-2   2596    4    4 31128   46%  2629   20% 
   8 Crafty-22.9X2-1   2594    4    5 31128   45%  2629   21% 
   9 Crafty-22.9X3-1   2591    4    5 31128   45%  2629   20% 
  10 Glaurung 1.1 SMP  2530    3    4 62256   43%  2585   19% 
  11 Crafty-22.9X4-1   2517    5    5 31128   35%  2629   19% 
  12 Crafty-22.9X4-2   2514    5    5 31128   35%  2629   18%

normal is roughly 2637 in this test. Removing null-move or LMR drops this by approximately 40 Elo. Removing both drops the rating by around 120 Elo.

Null-move and LMR are the two biggest search enhancements of the past 15 years. And they added +120 Elo. I could always try normal crafty, but take the NPS from about 1M on this hardware (I only test with 1 cpu here) and back it down to about 75K which is what I was getting in 1996 on a pentium pro 200, roughly a factor of 15x and see how that impacts performance. Although I should probably factor in the single-core pentium pro vs a quad-core xeon for today, which runs around 10M nps, as that is a more representative example of what hardware speeds have done since 1996. So 75K to 10M is a factor of 24 or so. But something tells me that factor of 24-25x is _way_ more than 120 Elo...

Other suggestions???

BTW how many are surprised that removing both is only a 120 Elo loss???

bob · Post by **bob** » Wed Dec 03, 2008 7:43 am

Uri Blass wrote:
bob wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.
I believe my statement was more along the lines "Hardware has had a _larger_ influence over program performance increases over the past 20 years than the software has." This is based on my cluster testing where I now know just what some of the "great enhancements" have brought. If you'd like to pick one "revolutionary idea" (null-move? LMR? check extension? some evaluation concept? etc...) I can give it a test using crafty with and without, assuming crafty uses the idea you want to compare. I have not found a single +100 elo idea i Crafty. LMR is a modest improvement. I don't recall the exact amount at present but could compute it. Remember, just because we are physically searching deeper, today's "ply" is not comparable to "ply" of 20 years ago. Todays "plies" have significantly more errors in them due to various types of pruning and reductions going on...

I'll test whatever you think is the "biggest" to see what happens...
Your cluster tests can prove that hardware helped Crafty more than software.
They cannot prove that it is the case in general.

Uri

I'd claim Crafty is pretty representative of _most_ programs of today, null-move R=3, LMR, check extensions, iterative deepening, PVS, simple q-search + checks.. Etc...

Now if you want to claim most programs are far different from Crafty with respect to search space/techniques, feel free to do so. But I doubt that will convince anyone...

bob · Post by **bob** » Wed Dec 03, 2008 7:49 am

Uri Blass wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.

Branching factor proves nothing because programs that do more pruning play weaker at fixed depth but I can say not based on branching factor that the improvement in software in the last years is very big and bigger than the improvement in hardware(not sure about improvement since 1992 because it is not clear how we define it but sure about improvement from 2005 to 2008).

Note that the tests of Bob can show only that hardware helped more than software for Crafty.

Tests of the SSDF showed the following results.

Rybka 3 A1200 - Deep Shredder 11 Q6600 20-19
Rybka 3 A1200 -Zappa Mexico II Q6600 20-20

A1200 = 1 x 1.2 GHz
Q6600 = 4 x 2.4 GHz

Note that both Zappa and Shredder are clearly stronger than Fruit that was the leading program in 2005 for single processor machines

I think that we can safely say that the software improvement in the last 3 years were more than 10:1 and I do not see hardware improvement of 10:1 in the last 3 years.

Uri

Can we stop with the amateurish comparisons/ Why pick different programs to compare. We had better and worse programs in 1995 as well. The question was, what have the software advances actually produced.

Feel free to name significant ones. Most would put null-move at the top, and LMR right behind it. Then we could factor in razoring, futility and extended futility from Heinz, but I already know those are very small improvements from testing by turning each off a month back or so.

what else is _significant_? Evaluation is not so interesting. I was doing passed pawn races in the 1970's. Outside passed pawns in 1995 in Crafty. So what _new_ thing since 1995 is such a big contributor? No hand-waving, no talking about ideas that _might_ have been developed. Actual documented techniques...

bob · Post by **bob** » Wed Dec 03, 2008 7:49 am

Uri Blass wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.

Branching factor proves nothing because programs that do more pruning play weaker at fixed depth but I can say not based on branching factor that the improvement in software in the last years is very big and bigger than the improvement in hardware(not sure about improvement since 1992 because it is not clear how we define it but sure about improvement from 2005 to 2008).

Note that the tests of Bob can show only that hardware helped more than software for Crafty.

Tests of the SSDF showed the following results.

Rybka 3 A1200 - Deep Shredder 11 Q6600 20-19
Rybka 3 A1200 -Zappa Mexico II Q6600 20-20

A1200 = 1 x 1.2 GHz
Q6600 = 4 x 2.4 GHz

Note that both Zappa and Shredder are clearly stronger than Fruit that was the leading program in 2005 for single processor machines

I think that we can safely say that the software improvement in the last 3 years were more than 10:1 and I do not see hardware improvement of 10:1 in the last 3 years.

Uri

Can we stop with the amateurish comparisons? Why pick different programs to compare. We had better and worse programs in 1995 as well. The question was, what have the software advances actually produced.

Feel free to name significant ones. Most would put null-move at the top, and LMR right behind it. Then we could factor in razoring, futility and extended futility from Heinz, but I already know those are very small improvements from testing by turning each off a month back or so.

what else is _significant_? Evaluation is not so interesting. I was doing passed pawn races in the 1970's. Outside passed pawns in 1995 in Crafty. So what _new_ thing since 1995 is such a big contributor? No hand-waving, no talking about ideas that _might_ have been developed. Actual documented techniques...

Uri Blass · Post by **Uri Blass** » Wed Dec 03, 2008 7:59 am

bob wrote:
Uri Blass wrote:
bob wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.
I believe my statement was more along the lines "Hardware has had a _larger_ influence over program performance increases over the past 20 years than the software has." This is based on my cluster testing where I now know just what some of the "great enhancements" have brought. If you'd like to pick one "revolutionary idea" (null-move? LMR? check extension? some evaluation concept? etc...) I can give it a test using crafty with and without, assuming crafty uses the idea you want to compare. I have not found a single +100 elo idea i Crafty. LMR is a modest improvement. I don't recall the exact amount at present but could compute it. Remember, just because we are physically searching deeper, today's "ply" is not comparable to "ply" of 20 years ago. Todays "plies" have significantly more errors in them due to various types of pruning and reductions going on...

I'll test whatever you think is the "biggest" to see what happens...
Your cluster tests can prove that hardware helped Crafty more than software.
They cannot prove that it is the case in general.

Uri
I'd claim Crafty is pretty representative of _most_ programs of today, null-move R=3, LMR, check extensions, iterative deepening, PVS, simple q-search + checks.. Etc...

Now if you want to claim most programs are far different from Crafty with respect to search space/techniques, feel free to do so. But I doubt that will convince anyone...

I think that the implementation of LMR may be different for different programs.

It may be interesting to do the same comparison for Glaurung to see how much rating Glaurung earns from LMR and null move.

Uri

bob · Post by **bob** » Wed Dec 03, 2008 8:10 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.
I believe my statement was more along the lines "Hardware has had a _larger_ influence over program performance increases over the past 20 years than the software has." This is based on my cluster testing where I now know just what some of the "great enhancements" have brought. If you'd like to pick one "revolutionary idea" (null-move? LMR? check extension? some evaluation concept? etc...) I can give it a test using crafty with and without, assuming crafty uses the idea you want to compare. I have not found a single +100 elo idea i Crafty. LMR is a modest improvement. I don't recall the exact amount at present but could compute it. Remember, just because we are physically searching deeper, today's "ply" is not comparable to "ply" of 20 years ago. Todays "plies" have significantly more errors in them due to various types of pruning and reductions going on...

I'll test whatever you think is the "biggest" to see what happens...
Your cluster tests can prove that hardware helped Crafty more than software.
They cannot prove that it is the case in general.

Uri
I'd claim Crafty is pretty representative of _most_ programs of today, null-move R=3, LMR, check extensions, iterative deepening, PVS, simple q-search + checks.. Etc...

Now if you want to claim most programs are far different from Crafty with respect to search space/techniques, feel free to do so. But I doubt that will convince anyone...
I think that the implementation of LMR may be different for different programs.

It may be interesting to do the same comparison for Glaurung to see how much rating Glaurung earns from LMR and null move.

Uri

if you have the time, go for it. I don't have the time to study the source to see what needs to be commented out. And I don't have the time to try this on every program available. Null-move doesn't give crafty any more or less improvement than any other program. LMR is not a giant step for mankind, so implementation details are going to be about small improvements or degradations, not big ones. I've never seen a quantitative analysis on precisely what null-move search does to a programs skill level, other than "it is clearly better". I just published the _precise_ data for this, played over enough games to be highly accurate.

CRoberson · Post by **CRoberson** » Wed Dec 03, 2008 8:10 am

Here is what I meant by PV verification:

Code: Select all

    for all moves 
    {
        if first move
             v = - S(-beta,-alpha)
        else
        {
            v = -S(-alpha-1,-alpha)
            if (v>alpha) && (v<beta)
               v = -S(-beta,-alpha)
        }
    }

Change that to:

Code: Select all

      for all moves
      {
           v = -S(-beta,-alpha)
      }

michiguel · Post by **michiguel** » Wed Dec 03, 2008 8:21 am

bob wrote:Here are the results. Crafty-22.9X1 is normal crafty. 22.9X2 is normal except null-move completely commented out. 22.9X3 is normal except that LMR has been completely disabled. 22.9X4 is normal but with both LMR and null-move removed. the -1 or -2 just means run #1 or run#2 to give a fell for what kind of variation there is between runs.
Code: Select all
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2695    4    4 62256   65%  2585   20% 
   2 Toga2             2695    4    3 62256   64%  2585   20% 
   3 Crafty-22.9X1-1   2638    4    4 31128   51%  2629   21% 
   4 Crafty-22.9X1-2   2636    5    5 31128   51%  2629   21% 
   5 Fruit 2.1         2597    3    3 62256   52%  2585   22% 
   6 Crafty-22.9X2-2   2596    4    4 31128   45%  2629   21% 
   7 Crafty-22.9X3-2   2596    4    4 31128   46%  2629   20% 
   8 Crafty-22.9X2-1   2594    4    5 31128   45%  2629   21% 
   9 Crafty-22.9X3-1   2591    4    5 31128   45%  2629   20% 
  10 Glaurung 1.1 SMP  2530    3    4 62256   43%  2585   19% 
  11 Crafty-22.9X4-1   2517    5    5 31128   35%  2629   19% 
  12 Crafty-22.9X4-2   2514    5    5 31128   35%  2629   18% 
normal is roughly 2637 in this test. Removing null-move or LMR drops this by approximately 40 Elo. Removing both drops the rating by around 120 Elo.

Null-move and LMR are the two biggest search enhancements of the past 15 years. And they added +120 Elo. I could always try normal crafty, but take the NPS from about 1M on this hardware (I only test with 1 cpu here) and back it down to about 75K which is what I was getting in 1996 on a pentium pro 200, roughly a factor of 15x and see how that impacts performance. Although I should probably factor in the single-core pentium pro vs a quad-core xeon for today, which runs around 10M nps, as that is a more representative example of what hardware speeds have done since 1996. So 75K to 10M is a factor of 24 or so. But something tells me that factor of 24-25x is _way_ more than 120 Elo...

You may be right but this is not a valid comparison! you should compare the improvement between Crafty model 1996 vs. Crafty model 2008, not the contribution of two single techniques as implemented in 2008.

What is the Elo difference between Crafty 1996 vs Crafty 2008 running in equal hardware? This is not the optimum either but it is closer to something more meaningful.

Miguel

Other suggestions???

BTW how many are surprised that removing both is only a 120 Elo loss???

CRoberson · Post by **CRoberson** » Wed Dec 03, 2008 8:29 am

bob wrote:Other suggestions???

What eval changes have been made since 1992? At least the ones
suggested by Roman D. have been added.

Bo Persson · Post by **Bo Persson** » Wed Dec 03, 2008 5:37 pm

bob wrote: I have the test running. 8 x 32,000 games will take around 8 hours so I should have the results around 11:00pm CST.

Isn't this in itself enough evidence that the hardware evolution is a major factor?

Running 256,000 tests wasn't even considered a few years ago.

Hardware vs Software

Re: Hardware vs Software - test results

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software - test results

Re: Hardware vs Software - test results

Re: Hardware vs Software