Another Crafty-23.1 Nehalem scaling problem

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Another Crafty-23.1 Nehalem scaling problem

Post by zullil »

It occurs to me that Snow Leopard on my box can be booted with either a 32-bit kernel (the default, and what I have been using) or a 64-bit kernel.

I wonder if the choice I make will have any effect on scaling.

Are there particular lines in the output from

sysctl -a

that would be of interest here?
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Another Crafty-23.1 Nehalem scaling problem

Post by zullil »

I've come to several conclusions:

1) Crafty is not the problem. :)

2) icc is not working correctly for me.

3) The slightly poor scaling with mt=8 shown below is likely due to the fact that hyper-threading currently cannot be disabled in OS X 10.6.2. :shock: There is a checkbox in a preference pane that is supposed to do this, but testing (with Crafty and gnubg) has confirmed that toggling the status of that checkbox does not disable hyper-threading. Am about to file a bug report with Apple.

Thanks, Bob, for your assistance.



zullil wrote:
bob wrote:Here's what I would expect:

log.001: time=30.20 mat=0 n=97218373 fh=95% nps=3.2M
log.002: time=30.35 mat=0 n=198382851 fh=95% nps=6.5M
log.003: time=31.01 mat=0 n=399493603 fh=94% nps=12.9M
log.004: time=30.89 mat=0 n=690102470 fh=94% nps=22.3M

I ran the same position for 30 seconds, using 1, 2, 4 and 8 cpus. Scaling is about 7 on this box with the current version (almost identical to 23.1). I had slightly better scaling numbers on a nehalem, but we don't currently have one up and running... But Nehalem ought to be somewhat better than this core-2 xeon box since the Nehalem has a better memory system.
Something seems amiss with my icc, so I switched back to gcc. This is on the new Snow Leopard system. Scaling seems pretty good, I guess. Will repeat this exact experiment with my old Leopard system on the same hardware.

Code: Select all

darwin:
        $(MAKE) target=FreeBSD \
                CC=gcc CXX=g++ \
                CFLAGS='$(CFLAGS) -O3 -msse4.2' \
                CXFLAGS='$(CFLAGS) -O3 -msse4.2' \
                LDFLAGS=$(LDFLAGS) \
                LIBS='-lpthread -lstdc++' \
                opt='-DCPUS=8 -DINLINE64' \
                crafty-make

Code: Select all

max threads set to 1.
Crafty v23.1 (1 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.59  mat=0  n=97395988  fh=91%  nps=3.2M
extensions=3.3M qchecks=2.9M reduced=7.8M pruned=38.2M
predicted=0  evals=44.0M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=0  aborts=0  data=0/512  elap=30.59


max threads set to 2.
Crafty v23.1 (2 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=31.11  mat=0  n=175166321  fh=91%  nps=5.6M
extensions=5.9M qchecks=5.1M reduced=13.6M pruned=66.9M
predicted=0  evals=81.0M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=490  aborts=69  data=5/512  elap=31.11


max threads set to 4.
Crafty v23.1 (4 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.93  mat=0  n=345958868  fh=91%  nps=11.2M
extensions=12.7M qchecks=11.2M reduced=27.4M pruned=137.6M
predicted=0  evals=153.1M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=4036  aborts=648  data=14/512  elap=30.93


max threads set to 8.
Crafty v23.1 (8 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.78  mat=0  n=614520779  fh=90%  nps=20.0M
extensions=24.6M qchecks=22.7M reduced=48.3M pruned=250.0M
predicted=0  evals=262.9M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=72982  aborts=13450  data=41/512  elap=30.78

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Another Crafty-23.1 Nehalem scaling problem

Post by bob »

zullil wrote:I've come to several conclusions:

1) Crafty is not the problem. :)

2) icc is not working correctly for me.

3) The slightly poor scaling with mt=8 shown below is likely due to the fact that hyper-threading currently cannot be disabled in OS X 10.6.2. :shock: There is a checkbox in a preference pane that is supposed to do this, but testing (with Crafty and gnubg) has confirmed that toggling the status of that checkbox does not disable hyper-threading. Am about to file a bug report with Apple.

Thanks, Bob, for your assistance.

Is there no BIOS setting that turns this off? Every machine I have worked on allowed this. More importantly, any recent Linux kernel handles hyperthreading just fine, understanding that you first schedule one process per physical core, before you start scheduling processes on logical cores which share resources between the two logical cores on a single physical core. If the MACOS scheduler doesn't understand this, then HT on is a real problem.

In general, when these machines boot up, you see something like "Press <F2> to enter setup" or some such. There you will find somewhere a setting like "logical processor" which is on or off. You want it off. Then HT is disabled. The wording is not always clear, but look for something like that. it does vary but has always been there on every box I have used if they had PIV or Nehalem processors.


zullil wrote:
bob wrote:Here's what I would expect:

log.001: time=30.20 mat=0 n=97218373 fh=95% nps=3.2M
log.002: time=30.35 mat=0 n=198382851 fh=95% nps=6.5M
log.003: time=31.01 mat=0 n=399493603 fh=94% nps=12.9M
log.004: time=30.89 mat=0 n=690102470 fh=94% nps=22.3M

I ran the same position for 30 seconds, using 1, 2, 4 and 8 cpus. Scaling is about 7 on this box with the current version (almost identical to 23.1). I had slightly better scaling numbers on a nehalem, but we don't currently have one up and running... But Nehalem ought to be somewhat better than this core-2 xeon box since the Nehalem has a better memory system.
Something seems amiss with my icc, so I switched back to gcc. This is on the new Snow Leopard system. Scaling seems pretty good, I guess. Will repeat this exact experiment with my old Leopard system on the same hardware.

Code: Select all

darwin:
        $(MAKE) target=FreeBSD \
                CC=gcc CXX=g++ \
                CFLAGS='$(CFLAGS) -O3 -msse4.2' \
                CXFLAGS='$(CFLAGS) -O3 -msse4.2' \
                LDFLAGS=$(LDFLAGS) \
                LIBS='-lpthread -lstdc++' \
                opt='-DCPUS=8 -DINLINE64' \
                crafty-make

Code: Select all

max threads set to 1.
Crafty v23.1 (1 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.59  mat=0  n=97395988  fh=91%  nps=3.2M
extensions=3.3M qchecks=2.9M reduced=7.8M pruned=38.2M
predicted=0  evals=44.0M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=0  aborts=0  data=0/512  elap=30.59


max threads set to 2.
Crafty v23.1 (2 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=31.11  mat=0  n=175166321  fh=91%  nps=5.6M
extensions=5.9M qchecks=5.1M reduced=13.6M pruned=66.9M
predicted=0  evals=81.0M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=490  aborts=69  data=5/512  elap=31.11


max threads set to 4.
Crafty v23.1 (4 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.93  mat=0  n=345958868  fh=91%  nps=11.2M
extensions=12.7M qchecks=11.2M reduced=27.4M pruned=137.6M
predicted=0  evals=153.1M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=4036  aborts=648  data=14/512  elap=30.93


max threads set to 8.
Crafty v23.1 (8 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.78  mat=0  n=614520779  fh=90%  nps=20.0M
extensions=24.6M qchecks=22.7M reduced=48.3M pruned=250.0M
predicted=0  evals=262.9M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=72982  aborts=13450  data=41/512  elap=30.78

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Another Crafty-23.1 Nehalem scaling problem

Post by bob »

zullil wrote:Here are results from my 10.5.8 Leopard system on the same box. I'm coming to the conclusion that OS X is still learning to deal with Nehalem. I'm giving up on this now--too frustrating. Thanks for the help.

Code: Select all

max threads set to 1.
Crafty v23.1 (1 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.79  mat=0  n=97395988  fh=91%  nps=3.2M
extensions=3.3M qchecks=2.9M reduced=7.8M pruned=38.2M
predicted=0  evals=44.0M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=0  aborts=0  data=0/512  elap=30.79


max threads set to 2.

Crafty v23.1 (2 cpus)

White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.44  mat=0  n=163726848  fh=91%  nps=5.4M
extensions=5.8M qchecks=5.1M reduced=12.9M pruned=64.3M
predicted=0  evals=73.3M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=523  aborts=84  data=7/512  elap=30.44


max threads set to 4.
Crafty v23.1 (4 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.32  mat=0  n=301277071  fh=91%  nps=9.9M
extensions=11.1M qchecks=9.8M reduced=24.3M pruned=120.4M
predicted=0  evals=132.3M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=3574  aborts=643  data=15/512  elap=30.32


max threads set to 8.
Crafty v23.1 (8 cpus)
White(1): setboard 1rbr2k1/1q2bpp1/2pppn2/6B1/p3P3/2N2P2/PPP4P/1K1RQBR1 w - - 1 19

time=30.05  mat=0  n=468325941  fh=91%  nps=15.6M
extensions=17.6M qchecks=15.8M reduced=37.9M pruned=192.6M
predicted=0  evals=200.4M  50move=0  EGTBprobes=0  hits=0
SMP->  splits=53739  aborts=9453  data=38/512  elap=30.05

Something is definitely up. When you got the Nehalem box, does it have a new OSX kernel to go along with it, one that actually understands hyper-threading??? I used to see just your kind of results when HT first came out. The older linux kernels would see 4 logical CPUS on my dual PIV box, and running 2 threads, it would just as likely run the two threads on one physical processor using both logical processors, which is far worse than running one thread per physical CPU. Newer kernels handle this perfectly now.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Another Crafty-23.1 Nehalem scaling problem

Post by zullil »

I've just discovered a bit more.

I can boot OS X 10.6.2 with either a 32-bit kernel or a 64-bit kernel. If I boot using the 64-bit kernel, then the Preference pane checkbox to enable/disable H-T doesn't work. But with the 32-bit kernel it functions correctly. With H-T off and mt=8, all eight physical cores are active.

Hoping to have this all sorted out soon!
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Another Crafty-23.1 Nehalem scaling problem

Post by zullil »

zullil wrote:I've just discovered a bit more.

I can boot OS X 10.6.2 with either a 32-bit kernel or a 64-bit kernel. If I boot using the 64-bit kernel, then the Preference pane checkbox to enable/disable H-T doesn't work. But with the 32-bit kernel it functions correctly. With H-T off and mt=8, all eight physical cores are active.

Hoping to have this all sorted out soon!
(And with H-T on and mt=8 it uses 8 distinct physical cores. So I guess it really shouldn't matter how H-T is set, as long as I stick to mt <= 8. )
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Another Crafty-23.1 Nehalem scaling problem

Post by bob »

zullil wrote:
zullil wrote:I've just discovered a bit more.

I can boot OS X 10.6.2 with either a 32-bit kernel or a 64-bit kernel. If I boot using the 64-bit kernel, then the Preference pane checkbox to enable/disable H-T doesn't work. But with the 32-bit kernel it functions correctly. With H-T off and mt=8, all eight physical cores are active.

Hoping to have this all sorted out soon!
(And with H-T on and mt=8 it uses 8 distinct physical cores. So I guess it really shouldn't matter how H-T is set, as long as I stick to mt <= 8. )
SO long as you can verify that 8 physical cores are being used. The cores are numbered quite non-intuitively to me, in that numbering for logical/physical occurs in an order that it not what I would expect. One of the guys at Intel that works on the linux process scheduler explained the scheme to me, and I am not sure I would be able to remember it now since 6 months have elapsed.