PREFETCH vs POPCNT vs ...

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

kingliveson

PREFETCH vs POPCNT vs ...

Post by kingliveson »

What kind of performance relative to one another should one expect with the following compile options:

POPCNT
PREFETCH
POPCNT/PREFETCH
DEFAULT

I am running a test with the same engine compiled with these different options; 300 1+0 games, and will post result when done. Anyone care to guess which will come out on top? The System is an AMD Phenom II 940.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: PREFETCH vs POPCNT vs ...

Post by Milos »

They combined can give like 10-15% speed improvement, so 10-12 ELO max.
On 300 games you error bars will around 30 ELO. So not conclusive which ever result you get...
kingliveson

Re: PREFETCH vs POPCNT vs ...

Post by kingliveson »

Milos wrote:They combined can give like 10-15% speed improvement, so 10-12 ELO max.
On 300 games you error bars will around 30 ELO. So not conclusive which ever result you get...
yea, I agree this result will not be conclusive. It may give a little hint though. Maybe some day when I have more time, I will do an extensive test. On paper, the combined should be better. I think the head-to-head match up will be really interesting...
bnemias
Posts: 373
Joined: Thu Aug 14, 2008 3:21 am
Location: Albuquerque, NM

Re: PREFETCH vs POPCNT vs ...

Post by bnemias »

kingliveson wrote:What kind of performance relative to one another should one expect with the following compile options:

POPCNT
PREFETCH
POPCNT/PREFETCH
DEFAULT
Can't you determine performance benefit without running a tournament?

I tested prefetch and alignment with 2 engines on two platforms. I found prefetch to be worthless. I could at least measure alignment benefit, but it was still very small. I tested on MIPS32 and core2.
kingliveson

Re: PREFETCH vs POPCNT vs ...

Post by kingliveson »

bnemias wrote:
kingliveson wrote:What kind of performance relative to one another should one expect with the following compile options:

POPCNT
PREFETCH
POPCNT/PREFETCH
DEFAULT
Can't you determine performance benefit without running a tournament?

I tested prefetch and alignment with 2 engines on two platforms. I found prefetch to be worthless. I could at least measure alignment benefit, but it was still very small. I tested on MIPS32 and core2.
One can always do bench marks...raw data like a tournament is another way. I have always wondered whether prefetching would actually have a negative effect or any effect at all for chess engines. Perhaps a smart developer can manipulate the behavior of the instruction set call--how often and how it is used in a chess program to take full advantage.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: PREFETCH vs POPCNT vs ...

Post by bob »

kingliveson wrote:
Milos wrote:They combined can give like 10-15% speed improvement, so 10-12 ELO max.
On 300 games you error bars will around 30 ELO. So not conclusive which ever result you get...
yea, I agree this result will not be conclusive. It may give a little hint though. Maybe some day when I have more time, I will do an extensive test. On paper, the combined should be better. I think the head-to-head match up will be really interesting...
You have to be careful with this "little hint" idea, which suggests that if A is better than B then that means something, whereas with a large error bar, it really means nothing at all, not even a "little hint".
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: PREFETCH vs POPCNT vs ...

Post by bob »

kingliveson wrote:What kind of performance relative to one another should one expect with the following compile options:

POPCNT
PREFETCH
POPCNT/PREFETCH
DEFAULT

I am running a test with the same engine compiled with these different options; 300 1+0 games, and will post result when done. Anyone care to guess which will come out on top? The System is an AMD Phenom II 940.
I am not sure how a compiler is going to use POPCNT, or what it is supposed to do, since there is no C semantic for doing that. Which means those of us doing a popcnt operation use some sort of table lookup idea, or else inline asm based on the A & A-1 trick. You'd have to either use inline asm and use a popcnt instruction, or else use the MSCV popcnt intrinsic which is not compatible with other compilers.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: PREFETCH vs POPCNT vs ...

Post by Milos »

bob wrote:I am not sure how a compiler is going to use POPCNT, or what it is supposed to do, since there is no C semantic for doing that. Which means those of us doing a popcnt operation use some sort of table lookup idea, or else inline asm based on the A & A-1 trick. You'd have to either use inline asm and use a popcnt instruction, or else use the MSCV popcnt intrinsic which is not compatible with other compilers.
The story about popcnt is quite simple.
If your processor supports SSE4a use hardware one. If it doesn't unless your bitboards are really sparse (when you use assemble optimized B&(B-1)) use the famous one from:
http://www.amd.com/us-en/assets/content ... /25112.PDF.
Everything else (especially compiler ones) is slower in practice for chess implementations...
kingliveson

Re: PREFETCH vs POPCNT vs ...

Post by kingliveson »

Milos wrote:
bob wrote:I am not sure how a compiler is going to use POPCNT, or what it is supposed to do, since there is no C semantic for doing that. Which means those of us doing a popcnt operation use some sort of table lookup idea, or else inline asm based on the A & A-1 trick. You'd have to either use inline asm and use a popcnt instruction, or else use the MSCV popcnt intrinsic which is not compatible with other compilers.
The story about popcnt is quite simple.
If your processor supports SSE4a use hardware one. If it doesn't unless your bitboards are really sparse (when you use assemble optimized B&(B-1)) use the famous one from:
http://www.amd.com/us-en/assets/content ... /25112.PDF.
Everything else (especially compiler ones) is slower in practice for chess implementations...
This is just a bit manipulation that is specific to hardware and not compilers or language standards. The code will compile, but if your hardware does not understand the machine language instructions, the program crashes. The algorithm can also be accomplished for machines without popcnt instruction set using inline asm that may not be as efficient. I am interested in the benefit it provides for chess programs.


bob wrote:
kingliveson wrote:
Milos wrote:They combined can give like 10-15% speed improvement, so 10-12 ELO max.
On 300 games you error bars will around 30 ELO. So not conclusive which ever result you get...
yea, I agree this result will not be conclusive. It may give a little hint though. Maybe some day when I have more time, I will do an extensive test. On paper, the combined should be better. I think the head-to-head match up will be really interesting...
You have to be careful with this "little hint" idea, which suggests that if A is better than B then that means something, whereas with a large error bar, it really means nothing at all, not even a "little hint".
I stand corrected. Since I promised to post the outcome, I will do just that. I would like suggestion though on the minimum games that one would need to start seeing a pattern. This question has been asked repeatedly as to how these optimizations affect chess engine perfomance. I and many others are interested in the final out come.

There was a crash along the way somewhere, but here is the results:

Code: Select all

Individual statistics:

1 DEFAULT                   :    8  124 (+ 24,= 80,- 20), 51.6 %

PREFETCH/POPCNT               :  42 (+  5,= 31,-  6), 48.8 %
POPCNT                        :  41 (+  9,= 27,-  5), 54.9 %
PREFETCH                      :  41 (+ 10,= 22,-  9), 51.2 %

2 PREFETCH/POPCNT           :    6  124 (+ 18,= 91,- 15), 51.2 %

DEFAULT                       :  42 (+  6,= 31,-  5), 51.2 %
POPCNT                        :  41 (+  4,= 31,-  6), 47.6 %
PREFETCH                      :  41 (+  8,= 29,-  4), 54.9 %

3 PREFETCH                  :   -2  123 (+ 20,= 82,- 21), 49.6 %

DEFAULT                       :  41 (+  9,= 22,- 10), 48.8 %
PREFETCH/POPCNT               :  41 (+  4,= 29,-  8), 45.1 %
POPCNT                        :  41 (+  7,= 31,-  3), 54.9 %

4 POPCNT                    :  -13  123 (+ 14,= 89,- 20), 47.6 %

DEFAULT                       :  41 (+  5,= 27,-  9), 45.1 %
PREFETCH/POPCNT               :  41 (+  6,= 31,-  4), 52.4 %
PREFETCH                      :  41 (+  3,= 31,-  7), 45.1 %

The test was done using 25 positions with color switch.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: PREFETCH vs POPCNT vs ...

Post by bob »

kingliveson wrote:
Milos wrote:
bob wrote:I am not sure how a compiler is going to use POPCNT, or what it is supposed to do, since there is no C semantic for doing that. Which means those of us doing a popcnt operation use some sort of table lookup idea, or else inline asm based on the A & A-1 trick. You'd have to either use inline asm and use a popcnt instruction, or else use the MSCV popcnt intrinsic which is not compatible with other compilers.
The story about popcnt is quite simple.
If your processor supports SSE4a use hardware one. If it doesn't unless your bitboards are really sparse (when you use assemble optimized B&(B-1)) use the famous one from:
http://www.amd.com/us-en/assets/content ... /25112.PDF.
Everything else (especially compiler ones) is slower in practice for chess implementations...
This is just a bit manipulation that is specific to hardware and not compilers or language standards. The code will compile, but if your hardware does not understand the machine language instructions, the program crashes. The algorithm can also be accomplished for machines without popcnt instruction set using inline asm that may not be as efficient. I am interested in the benefit it provides for chess programs.


bob wrote:
kingliveson wrote:
Milos wrote:They combined can give like 10-15% speed improvement, so 10-12 ELO max.
On 300 games you error bars will around 30 ELO. So not conclusive which ever result you get...
yea, I agree this result will not be conclusive. It may give a little hint though. Maybe some day when I have more time, I will do an extensive test. On paper, the combined should be better. I think the head-to-head match up will be really interesting...
You have to be careful with this "little hint" idea, which suggests that if A is better than B then that means something, whereas with a large error bar, it really means nothing at all, not even a "little hint".
I stand corrected. Since I promised to post the outcome, I will do just that. I would like suggestion though on the minimum games that one would need to start seeing a pattern. This question has been asked repeatedly as to how these optimizations affect chess engine perfomance. I and many others are interested in the final out come.

There was a crash along the way somewhere, but here is the results:

Code: Select all

Individual statistics:

1 DEFAULT                   :    8  124 (+ 24,= 80,- 20), 51.6 %

PREFETCH/POPCNT               :  42 (+  5,= 31,-  6), 48.8 %
POPCNT                        :  41 (+  9,= 27,-  5), 54.9 %
PREFETCH                      :  41 (+ 10,= 22,-  9), 51.2 %

2 PREFETCH/POPCNT           :    6  124 (+ 18,= 91,- 15), 51.2 %

DEFAULT                       :  42 (+  6,= 31,-  5), 51.2 %
POPCNT                        :  41 (+  4,= 31,-  6), 47.6 %
PREFETCH                      :  41 (+  8,= 29,-  4), 54.9 %

3 PREFETCH                  :   -2  123 (+ 20,= 82,- 21), 49.6 %

DEFAULT                       :  41 (+  9,= 22,- 10), 48.8 %
PREFETCH/POPCNT               :  41 (+  4,= 29,-  8), 45.1 %
POPCNT                        :  41 (+  7,= 31,-  3), 54.9 %

4 POPCNT                    :  -13  123 (+ 14,= 89,- 20), 47.6 %

DEFAULT                       :  41 (+  5,= 27,-  9), 45.1 %
PREFETCH/POPCNT               :  41 (+  6,= 31,-  4), 52.4 %
PREFETCH                      :  41 (+  3,= 31,-  7), 45.1 %

The test was done using 25 positions with color switch.
Look at the "interesting test results" thread I started. Things can change significantly even after thousands of games. The data in that post is pretty easy to follow.