ICCR project is planning to be canceled

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Sedat Canbaz
Posts: 3018
Joined: Thu Mar 09, 2006 11:58 am
Location: Antalya/Turkey

Re: ICCR project is planning to be canceled

Post by Sedat Canbaz »

ICCR project is already canceled-for more details:
http://www.sedatcanbaz.com/chess/iccr/

Best Regards,
Sedat Canbaz
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: ICCR project is planning to be canceled

Post by Don »

Sedat Canbaz wrote:ICCR project is already canceled-for more details:
http://www.sedatcanbaz.com/chess/iccr/

Best Regards,
Sedat Canbaz
Hi Sedat,

Sorry to hear that you cancelled your project. I read your announcement and I understand the difficulties with testing - there is much more noise than just error bars.

One comment about different hardware: I don't believe different hardware is guaranteed to return same result even if you apply some adjustment to the time control, unless you apply that adjustment differently for each program and that is pretty dicey. Each program responds differently to a particular machine. You can get relatively close but not right on.

It's probably not practical, but technically each program should be running on it's own dedicated machine. Running 2 programs on the same machine introduces some issues. They each have an impact on the other, even if small despite the fact that computer hardware technology has worked to minimize that as much as possible. All the programs share resources which can impact the others. We assume that most of this cancels and probably most of it does and I would take a guess and say that this introduces another 5 ELO of uncertainty - but it's a wild guess. Ponder exacerbates this problem even more so I think there is a bit more uncertainty with ponder games which share the same machine when testing.

When testing nearly identical versions of the same program it's probably less of an issue, but that's not what you are doing and even though we do some of that it's not our standard testing mode.

I believe MP programs should always be run on dedicated hardware when testing.

Hope things work out well for you in future computer chess endeavors!
Sedat Canbaz
Posts: 3018
Joined: Thu Mar 09, 2006 11:58 am
Location: Antalya/Turkey

Re: ICCR project is planning to be canceled

Post by Sedat Canbaz »

Dear Chess Friends,

First of all,just i'd like to mention again that i have big big respect to all members of CEGT/CCRL Team
The both great Teams are doing great job,where many chess friends (including me) benefit from their useful work

Thanks in advance for your understanding....

But however...its still not clear for me some Elo calculating issues

Lets say...i started a new Auto232 adapted rating based only on my hardwares

I have a few questions for all Chess Experts (under the below adapted conditions):

1st question:
-is there any chance to see approx.160-170 Elo difference in favor for i7 980X @4.33GHz 6 core ?
*I mean exactly for the bellow adapted time control conditions,if we compare i7980X @4.33GHz 6c to Intel Celeron 1.70 GHz 1c

2nd question:
-What about other (if CEGT/CCRL are based on AMD 4600 2.40GHz) adapted rating list-why there is a lot of difference between 1c and 6c ?
*Note:I know very well that they are testing on same CPU machine the matches e.g 6c against 4c or 4c against 2c...
-in my opinion,with right/accurate adapted time controls, no any hardware should be stronger in Elo points or maybe i am missing something ?


3th question:
-Is this my adapted time control calculation method a right/accurate ?


Conditions:
--------------
-Time Control:75 Min (adapted to Celeron 1.7GHz)
-Auto232 mode (all games will be played between each other on two separate computers via null-modem cable)
-Eng vs Eng Matches on same PC is not allowed
-Ponder ON
-Perfect 2012a
-128 MB Hashtable
-Gaviota TB
-16 MB TB Caches
-All processors speed will be based on Fritz Benchmark
-Engine:Houdini 2.0c (with maximum cores)
-With processors (which i have):

Code: Select all

-------------------------------------------------------------------
Hardware-Processor        Speed      Cores    kN/s   Time Control
-------------------------------------------------------------------
Intel Core i7 980X      @ 4.33 GHz     6      18709      2min
Intel Core i7 970       @ 4.33 GHz     6      18706      2min
Intel Core i7 920       @ 4.00 GHz     4      12454      3min
Intel Core 2 Q9650      @ 3.82 GHz     4      10730      4min
Intel Core 2 QX6700       2.66 GHz     4       7231      5min
AMD Athlon 64 X2 4600+    2.40 GHz     2       2695     14min
AMD Turion 64 Mobile      2.20 GHz     2       2442     16min  
AMD Athlon 64 3400+       2.40 GHz     1       1367     28min 
Intel Pentium 4           2.66 GHz     1        762     50min
Intel Celeron             1.70 GHz     1        506     75min

More notes/details:
-----------------------
*From my experience i can say:
-All games will be performed approx. with similar equal Elo points (in case of creating a such adapted rating 75 min)
I mean,in case of no buggy mp engines:6 core or 4 core or 2 core or 1 core should have approx.same Elo performance

-I still believe that (with right adapted time controls) its wrong in case of calculating that 6c is approx.160-170 stronger than 1c
I mean in case of mixing the games,which are played on same i7 980X @4.33GHz with the games played on Intel Celeron 1.7GHz

*In other words (for adapting time controls):
-its not good idea to play matches 6c against 1c on same machine and later to mix the games played on Intel Celeron 75min

*See also SCCT Auto232 Rating-there is 162 Elo difference (between 1c and 6c):
http://www.sedatcanbaz.com/chess/ratings/scct-auto232/
Note:All games are played with same time control:4min + 2sec

*The reason (about right adapted time controls):
-simply because its just a adapted rating list
-or in reality, Intel Celeron 1.70GHz 1c 75min is not weaker 160-170 Elo weaker than i7 980X @ 4.33 GHz 6c 2min
-Personally,i expect to see approx. same Elo performance between both machines
-and testing 6c against 1c on same PC will lead to another misunderstandings (different Elo points,min 30-40 Elo)

My final note about this issue (who still have difficulties to understand me)
*Remember that any Engine Elo performance depends from:
- processor speed
- time control
...



Best,
Sedat
Sedat Canbaz
Posts: 3018
Joined: Thu Mar 09, 2006 11:58 am
Location: Antalya/Turkey

Re: ICCR project is planning to be canceled

Post by Sedat Canbaz »

Don wrote:
Sedat Canbaz wrote:ICCR project is already canceled-for more details:
http://www.sedatcanbaz.com/chess/iccr/

Best Regards,
Sedat Canbaz
Hi Sedat,

Sorry to hear that you cancelled your project. I read your announcement and I understand the difficulties with testing - there is much more noise than just error bars.

One comment about different hardware: I don't believe different hardware is guaranteed to return same result even if you apply some adjustment to the time control, unless you apply that adjustment differently for each program and that is pretty dicey. Each program responds differently to a particular machine. You can get relatively close but not right on.

It's probably not practical, but technically each program should be running on it's own dedicated machine. Running 2 programs on the same machine introduces some issues. They each have an impact on the other, even if small despite the fact that computer hardware technology has worked to minimize that as much as possible. All the programs share resources which can impact the others. We assume that most of this cancels and probably most of it does and I would take a guess and say that this introduces another 5 ELO of uncertainty - but it's a wild guess. Ponder exacerbates this problem even more so I think there is a bit more uncertainty with ponder games which share the same machine when testing.

When testing nearly identical versions of the same program it's probably less of an issue, but that's not what you are doing and even though we do some of that it's not our standard testing mode.

I believe MP programs should always be run on dedicated hardware when testing.

Hope things work out well for you in future computer chess endeavors!
Thank you for your useful notes dear Don

BTW,i have no patience to test your new MP engine
Can you inform me please about the expected release date of Komodo MP ?

Best,
Sedat
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: ICCR project is planning to be canceled

Post by Don »

Sedat Canbaz wrote:Dear Chess Friends,

First of all,just i'd like to mention again that i have big big respect to all members of CEGT/CCRL Team
The both great Teams are doing great job,where many chess friends (including me) benefit from their useful work

Thanks in advance for your understanding....

But however...its still not clear for me some Elo calculating issues

Lets say...i started a new Auto232 adapted rating based only on my hardwares

I have a few questions for all Chess Experts (under the below adapted conditions):

1st question:
-is there any chance to see approx.160-170 Elo difference in favor for i7 980X @4.33GHz 6 core ?
*I mean exactly for the bellow adapted time control conditions,if we compare i7980X @4.33GHz 6c to Intel Celeron 1.70 GHz 1c
I assume that when you say "adapted time control" you mean to make an adjustment for hardware, increasing the time control for a slow processor for example so that it is equivalent to a faster time control and that you do it like CCRL and others do, standardize this based on some reference hardware.

This is common practice and it's not a bad thing, however I would point out that it is not perfect. I'll give you an example. Look at the language benchmarks pages. It's clear that every benchmark has different performance characteristics and some will run better on on one platform and/or language than another.

With chess it's not nearly as bad because every chess program has a lot of similarities - you might say it's the same "type" of code. However each program is still different and some will run better on a particular machine than another but it might be exactly the opposite for a different program.

There are reasons also that how you calibrate the time control is probably on an approximation - I don't think that is a huge problem because if you are using time controls that are very similar (after being adjusted) the advantage or disadvantage for programs which scale better or worse if second order and very minor. The bigger area of concern is that one program will just be "happier" on one machine than another.

I can give you an example of this. Many programs support an sse4.2 mode which gives them a small performance boost. So imagine you have 2 testers and one of them has an older machine that does not have sse4.2 support and the other does have this support. Even though BOTH testers run the calibration test (presumably using Crafty) to determine what is equivalent to the standard time control they want to test, a 32 bit program such as spike is going to "prefer" to play on the non-sse4.2 machine. It will have a definite advantage on that hardware. It's purely a matter of semantics of course to say whether it has an advantage on the non-see hardware or whether to say that it is handicapped on the sse4.2 hardware over the other programs. But the real point is that this is additional noise being added to the testing procedure. In other words, you will see a consistent but small bias in favor or against any given program on one testers machine over another no matter what you do. So the results will have a lot to do with who ran the most games as well as which matches they set up.

There is a way around this which will at least give self-consistent results once you accept the fact that the results are always going to be influenced by the testing procedure and conditions - you can standardize the testing conditions themselves. I described that in another forum post and won't go into that.

The question remains though about how much difference does this really make, is it worth obsessing over? Probably not. It depends on how accurately and how self-consistent you want the results to be, but you can get a pretty good sense of the error just by comparing results on various lists. You cannot go by absolute ratings, you have to pick 2 programs that are close together and compare differences on one list versus another and note how much they vary. A lot of this is simple sample error but a portion of it is an artifact of how each program responds to the test conditions of that particular test. You will find that in most cases there is general agreement within 20 or 30 ELO even in the worst cases when the number of games is significant.

Of course we are naturally most interested in Komodo test results and we have noticed that we get consistently worse results with some agencies over others. This is not a criticism because it can just as easily be said that we get consistently better results with some agencies over others - it's merely an artifact of the testing conditions (as well as the time controls.)

I can give you a trivial example of how this makes a difference. Suppose one agency tests with Fischer time controls and another uses the classic repeating time controls and there is a bug in how the author implemented one of these time controls? Which agency will report better results for your program?

That is not a hypothetical example because we recently discovered that our time control mechanism was less than optimal and certain time controls suffered more than others.


2nd question:
-What about other (if CEGT/CCRL are based on AMD 4600 2.40GHz) adapted rating list-why there is a lot of difference between 1c and 6c ?
*Note:I know very well that they are testing on same CPU machine the matches e.g 6c against 4c or 4c against 2c...
-in my opinion,with right/accurate adapted time controls, no any hardware should be stronger in Elo points or maybe i am missing something ?


3th question:
-Is this my adapted time control calculation method a right/accurate ?


Conditions:
--------------
-Time Control:75 Min (adapted to Celeron 1.7GHz)
-Auto232 mode (all games will be played between each other on two separate computers via null-modem cable)
-Eng vs Eng Matches on same PC is not allowed
-Ponder ON
-Perfect 2012a
-128 MB Hashtable
-Gaviota TB
-16 MB TB Caches
-All processors speed will be based on Fritz Benchmark
-Engine:Houdini 2.0c (with maximum cores)
-With processors (which i have):

Code: Select all

-------------------------------------------------------------------
Hardware-Processor        Speed      Cores    kN/s   Time Control
-------------------------------------------------------------------
Intel Core i7 980X      @ 4.33 GHz     6      18709      2min
Intel Core i7 970       @ 4.33 GHz     6      18706      2min
Intel Core i7 920       @ 4.00 GHz     4      12454      3min
Intel Core 2 Q9650      @ 3.82 GHz     4      10730      4min
Intel Core 2 QX6700       2.66 GHz     4       7231      5min
AMD Athlon 64 X2 4600+    2.40 GHz     2       2695     14min
AMD Turion 64 Mobile      2.20 GHz     2       2442     16min  
AMD Athlon 64 3400+       2.40 GHz     1       1367     28min 
Intel Pentium 4           2.66 GHz     1        762     50min
Intel Celeron             1.70 GHz     1        506     75min

More notes/details:
-----------------------
*From my experience i can say:
-All games will be performed approx. with similar equal Elo points (in case of creating a such adapted rating 75 min)
I mean,in case of no buggy mp engines:6 core or 4 core or 2 core or 1 core should have approx.same Elo performance

-I still believe that (with right adapted time controls) its wrong in case of calculating that 6c is approx.160-170 stronger than 1c
I mean in case of mixing the games,which are played on same i7 980X @4.33GHz with the games played on Intel Celeron 1.7GHz

*In other words (for adapting time controls):
-its not good idea to play matches 6c against 1c on same machine and later to mix the games played on Intel Celeron 75min

*See also SCCT Auto232 Rating-there is 162 Elo difference (between 1c and 6c):
http://www.sedatcanbaz.com/chess/ratings/scct-auto232/
Note:All games are played with same time control:4min + 2sec

*The reason (about right adapted time controls):
-simply because its just a adapted rating list
-or in reality, Intel Celeron 1.70GHz 1c 75min is not weaker 160-170 Elo weaker than i7 980X @ 4.33 GHz 6c 2min
-Personally,i expect to see approx. same Elo performance between both machines
-and testing 6c against 1c on same PC will lead to another misunderstandings (different Elo points,min 30-40 Elo)

My final note about this issue (who still have difficulties to understand me)
*Remember that any Engine Elo performance depends from:
- processor speed
- time control
...



Best,
Sedat
Roger Brown
Posts: 782
Joined: Wed Mar 08, 2006 9:22 pm

Re: ICCR project is planning to be canceled

Post by Roger Brown »

Sedat Canbaz wrote: Thank you for your useful notes dear Don

BTW,i have no patience to test your new MP engine
Can you inform me please about the expected release date of Komodo MP ?

Best,
Sedat



Hello Sedat,

Speaking of impatience, I am gently reminding you about your promise to produce a polyglot Perfect Book for the Winboard user.

:-)

Later.
Sedat Canbaz
Posts: 3018
Joined: Thu Mar 09, 2006 11:58 am
Location: Antalya/Turkey

Re: ICCR project is planning to be canceled

Post by Sedat Canbaz »

Roger Brown wrote:
Sedat Canbaz wrote: Thank you for your useful notes dear Don

BTW,i have no patience to test your new MP engine
Can you inform me please about the expected release date of Komodo MP ?

Best,
Sedat



Hello Sedat,

Speaking of impatience, I am gently reminding you about your promise to produce a polyglot Perfect Book for the Winboard user.

:-)

Later.
Dear Roger,

Yes...i still remember (even each day) my promise and i'm sorry for the delayed release

But,as you see... this delay is due to my many chess activities

And be sure...in a few days (in between 1-2 weeks):Perfect 2012a.bin book will be available free for all Chess Lovers

Actually Perfect 2012a.bin is ready,just i will release it with full package (with many book GUI formats)

Best Regards,
Sedat
Sedat Canbaz
Posts: 3018
Joined: Thu Mar 09, 2006 11:58 am
Location: Antalya/Turkey

Re: ICCR project is planning to be canceled

Post by Sedat Canbaz »

I assume that when you say "adapted time control" you mean to make an adjustment for hardware, increasing the time control for a slow processor for example so that it is equivalent to a faster time control and that you do it like CCRL and others do, standardize this based on some reference hardware.

This is common practice and it's not a bad thing, however I would point out that it is not perfect. I'll give you an example. Look at the language benchmarks pages. It's clear that every benchmark has different performance characteristics and some will run better on on one platform and/or language than another.

With chess it's not nearly as bad because every chess program has a lot of similarities - you might say it's the same "type" of code. However each program is still different and some will run better on a particular machine than another but it might be exactly the opposite for a different program.

There are reasons also that how you calibrate the time control is probably on an approximation - I don't think that is a huge problem because if you are using time controls that are very similar (after being adjusted) the advantage or disadvantage for programs which scale better or worse if second order and very minor. The bigger area of concern is that one program will just be "happier" on one machine than another.

I can give you an example of this. Many programs support an sse4.2 mode which gives them a small performance boost. So imagine you have 2 testers and one of them has an older machine that does not have sse4.2 support and the other does have this support. Even though BOTH testers run the calibration test (presumably using Crafty) to determine what is equivalent to the standard time control they want to test, a 32 bit program such as spike is going to "prefer" to play on the non-sse4.2 machine. It will have a definite advantage on that hardware. It's purely a matter of semantics of course to say whether it has an advantage on the non-see hardware or whether to say that it is handicapped on the sse4.2 hardware over the other programs. But the real point is that this is additional noise being added to the testing procedure. In other words, you will see a consistent but small bias in favor or against any given program on one testers machine over another no matter what you do. So the results will have a lot to do with who ran the most games as well as which matches they set up.

There is a way around this which will at least give self-consistent results once you accept the fact that the results are always going to be influenced by the testing procedure and conditions - you can standardize the testing conditions themselves. I described that in another forum post and won't go into that.

The question remains though about how much difference does this really make, is it worth obsessing over? Probably not. It depends on how accurately and how self-consistent you want the results to be, but you can get a pretty good sense of the error just by comparing results on various lists. You cannot go by absolute ratings, you have to pick 2 programs that are close together and compare differences on one list versus another and note how much they vary. A lot of this is simple sample error but a portion of it is an artifact of how each program responds to the test conditions of that particular test. You will find that in most cases there is general agreement within 20 or 30 ELO even in the worst cases when the number of games is significant.

Of course we are naturally most interested in Komodo test results and we have noticed that we get consistently worse results with some agencies over others. This is not a criticism because it can just as easily be said that we get consistently better results with some agencies over others - it's merely an artifact of the testing conditions (as well as the time controls.)

I can give you a trivial example of how this makes a difference. Suppose one agency tests with Fischer time controls and another uses the classic repeating time controls and there is a bug in how the author implemented one of these time controls? Which agency will report better results for your program?

That is not a hypothetical example because we recently discovered that our time control mechanism was less than optimal and certain time controls suffered more than others.
Hello Don,

Actually you are right with most issues

But i think you understand what i mean...

Sure... there is no any perfect rating list,any list has own advantages or disadvantages

But the most strange thing (without to not mentioned this i can't)
-What about if am a Tester of CCRL Team

So...i tested my engines on my QUAD i7 920 @4.2GHz,lets say i sent the games to CCRL (Graham)

Then i have a question to all:
-my played games (on i7 920 @4.2GHz 4c) in which places will be ranked ?

-i mean bellow than AMD Phenom II X6 6CPUs or higher than 6CPUs ?

Best,
Sedat