Scalability study for chess.

Don · Post by **Don** » Thu May 01, 2008 4:42 pm

Will a program with a very good evaluation function improve more with depth than one with a poor evaluation function?

I'm doing a scalability study for chess to help understand this. It's similar to other studies that have been done in the past which determines how much a chess program improves with depth of search, but this study also tries to answer the above question.

In the study, I'm testing Toga with 11 different levels using 2 different evaluation functions, one weak and one strong. The details of the study are posted here with numbers and graphs:

http://greencheeks.homelinux.org:8015/

Any comments are welcome. I may extend the study to depth 12 or 13 after I've accumulated more data at the current levels.

This site is temporary and I'm simply hosting it on my home computer, while I'm looking for a more permanent place to host it. Does anyone have a publicly available site I can use for this?

I will also be making the PGN files containing all the games available. The study is ongoing and will continue for as long as my time and patience permits.

I am also interested in other studies which may have a bearing on this. I know that Hans Berliner did a study many years ago with 2 versions of Hi-tech (one of them he called lo-tech) but the number of games was too low to have any real statistical meaning.

bob · Post by **bob** » Thu May 01, 2008 5:39 pm

Don wrote:Will a program with a very good evaluation function improve more with depth than one with a poor evaluation function?

I'm doing a scalability study for chess to help understand this. It's similar to other studies that have been done in the past which determines how much a chess program improves with depth of search, but this study also tries to answer the above question.

In the study, I'm testing Toga with 11 different levels using 2 different evaluation functions, one weak and one strong. The details of the study are posted here with numbers and graphs:

http://greencheeks.homelinux.org:8015/

Any comments are welcome. I may extend the study to depth 12 or 13 after I've accumulated more data at the current levels.

This site is temporary and I'm simply hosting it on my home computer, while I'm looking for a more permanent place to host it. Does anyone have a publicly available site I can use for this?

I will also be making the PGN files containing all the games available. The study is ongoing and will continue for as long as my time and patience permits.

I am also interested in other studies which may have a bearing on this. I know that Hans Berliner did a study many years ago with 2 versions of Hi-tech (one of them he called lo-tech) but the number of games was too low to have any real statistical meaning.

My initial thought is that there is no significant difference. The difference between the two programs is less on the lower end, but that could just be compression since the Elo down there is so low. On the upper end, the gap widens but not by a lot, and I suspect that is normal, since the "smarter program" ought to play better, and may well gain just a bit since it is artificially faster than its opponent. (two programs searching to the same depth gives an advantage to the slower program since it gets to do more computation per move than its opponent. Best example might be chessmaster, which is very slow, and searching to depth 11 might take forever in wild positions, but it would hardly miss a thing at that depth.

Edmund · Post by **Edmund** » Thu May 01, 2008 5:43 pm

By doing a fixed depth search you are missing out on the interesting question, whether the time and depth gained through less evaluation compensates for the lost knowledge.

Don · Post by **Don** » Thu May 01, 2008 6:11 pm

It's very odd that Tord recently posted about fixed node levels because this was a factor here. My testing method of choice has always been "fixed nodes" and I tested that way for years. However most programs do not seem to support this method of testing. Had Toga supported this I would have tested levels that were each 2X more nodes than the previous level.

Testing by time, as you suggest, is subject to more ambiguities, especially when I'm doing this test on a computer that does not have a constant load. Fixed depth was my only reasonable choice without hacking Toga.

From a casual glance at the game scores, I don't see an obvious difference in thinking time between the strong and weak versions of each program but I can average these up (per game) at some point and we can get a good estimate of how much more time one version uses per game than another.

When I do this I'll post this information on the same web site with an explanation.[/quote]

Zach Wegner · Post by **Zach Wegner** » Thu May 01, 2008 6:25 pm

Don wrote:From a casual glance at the game scores, I don't see an obvious difference in thinking time between the strong and weak versions of each program but I can average these up (per game) at some point and we can get a good estimate of how much more time one version uses per game than another.

There probably won't be, because you're using parameters. The terms are just calculated and then multiplied by 0.

BTW, I think the best way to test would be to hack Toga and use node counts. I might do this later if I get the time...

Don · Post by **Don** » Thu May 01, 2008 6:27 pm

bob wrote:
Don wrote:Will a program with a very good evaluation function improve more with depth than one with a poor evaluation function?

I'm doing a scalability study for chess to help understand this. It's similar to other studies that have been done in the past which determines how much a chess program improves with depth of search, but this study also tries to answer the above question.

In the study, I'm testing Toga with 11 different levels using 2 different evaluation functions, one weak and one strong. The details of the study are posted here with numbers and graphs:

http://greencheeks.homelinux.org:8015/

Any comments are welcome. I may extend the study to depth 12 or 13 after I've accumulated more data at the current levels.

This site is temporary and I'm simply hosting it on my home computer, while I'm looking for a more permanent place to host it. Does anyone have a publicly available site I can use for this?

I will also be making the PGN files containing all the games available. The study is ongoing and will continue for as long as my time and patience permits.

I am also interested in other studies which may have a bearing on this. I know that Hans Berliner did a study many years ago with 2 versions of Hi-tech (one of them he called lo-tech) but the number of games was too low to have any real statistical meaning.
My initial thought is that there is no significant difference. The difference between the two programs is less on the lower end, but that could just be compression since the Elo down there is so low. On the upper end, the gap widens but not by a lot, and I suspect that is normal, since the "smarter program" ought to play better, and may well gain just a bit since it is artificially faster than its opponent. (two programs searching to the same depth gives an advantage to the slower program since it gets to do more computation per move than its opponent. Best example might be chessmaster, which is very slow, and searching to depth 11 might take forever in wild positions, but it would hardly miss a thing at that depth.

Bob,

I thought there would more improvement than I actually see for the strong version, but we clearly need more data to get a good picture.

I would have preferred to do a time (or nodes) based study, as I already mentioned in another post, since that is the most relevant for practical programs. Such a study would show the "real" difference in the two evaluation functions.

You pose an interesting question about the older programs that were far less selective. How would they do against modern programs at the same depth? How much is modern selectivity hurting programs today?

If you were to do a study of full width vs selectivity by time, I'm pretty sure you would see that selective program improve much faster. They play weaker at a given depth, but the extra speed more than compensates for this disadvantage and indeed is the whole point of selectivity.

I could test full width vs selective later, by depth. Perhaps I will hack Toga later to allow testing by fixed nodes and do this very study too.

Tord Romstad · Post by **Tord Romstad** » Thu May 01, 2008 6:30 pm

Don wrote:It's very odd that Tord recently posted about fixed node levels because this was a factor here. My testing method of choice has always been "fixed nodes" and I tested that way for years. However most programs do not seem to support this method of testing. Had Toga supported this I would have tested levels that were each 2X more nodes than the previous level.

Testing by time, as you suggest, is subject to more ambiguities, especially when I'm doing this test on a computer that does not have a constant load. Fixed depth was my only reasonable choice without hacking Toga.

You could use Glaurung instead, you know. It supports fixed-node searches, and if you ask the author for the latest development version, you'll get a program not significantly weaker than Toga for your tests.

Tord

Don · Post by **Don** » Thu May 01, 2008 6:30 pm

Zach Wegner wrote:
Don wrote:From a casual glance at the game scores, I don't see an obvious difference in thinking time between the strong and weak versions of each program but I can average these up (per game) at some point and we can get a good estimate of how much more time one version uses per game than another.
There probably won't be, because you're using parameters. The terms are just calculated and then multiplied by 0.

BTW, I think the best way to test would be to hack Toga and use node counts. I might do this later if I get the time...

Yes, please do. If you do this and make the source available, I can make a version for my testing machine (64 bit Linux) and I would do another study based on this.

Don · Post by **Don** » Thu May 01, 2008 6:41 pm

Tord Romstad wrote:
Don wrote:It's very odd that Tord recently posted about fixed node levels because this was a factor here. My testing method of choice has always been "fixed nodes" and I tested that way for years. However most programs do not seem to support this method of testing. Had Toga supported this I would have tested levels that were each 2X more nodes than the previous level.

Testing by time, as you suggest, is subject to more ambiguities, especially when I'm doing this test on a computer that does not have a constant load. Fixed depth was my only reasonable choice without hacking Toga.
You could use Glaurung instead, you know. It supports fixed-node searches, and if you ask the author for the latest development version, you'll get a program not significantly weaker than Toga for your tests.

Tord

Tord,

That would be perfect for this. Can you make me a Linux version of your best stable version for this study? I would prefer a 64 bit version if it performs better, but I have the libraries installed to be able to run 32 bit binaries.

Tord Romstad · Post by **Tord Romstad** » Thu May 01, 2008 6:53 pm

Don,

Please check your private messages.

Tord

Scalability study for chess.

Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.

Re: Scalability study for chess.