yorkman wrote: ↑Thu Apr 09, 2020 4:37 am
Here are my results with SF9, the same as in the benchmark from this thread:
./stockfish-9-popcnt bench 1024 256 26
Total time (ms) : 519221
Nodes searched : 45414888487
Nodes/second : 87467356
Look at the Total time! Took forever to finish this benchmark in latest CentOS 8.1. Temps were fine in Windows and speed, also in Windows, was also good at 3.20 GHz. Win'10 Ent reported 4 logical processors with HT enabled. I have two physical processors @ 64 cores each. I'm now using Linux but will probably go back to Win'10 Ent. since I now know the problem persists in Linux too. Strangely, I did get 160,000 kN/s to 170,000 kN/s a few times in Linux but for some reason not anymore. It may be because I went from 4 to 8 dimms (8x8=64GB).
@zenmastur: That's the only ram I have at the moment. I wanted to buy 16 * 3200MHz but it's very hard to get that right now with the covid19 virus. And the prices are ridiculous too, especially since the cad$ dropped a lot so for the time being I have to wait and test with what I have.
You need to use a utility that lets you see the individual temperature and clock speed/%cpu load of each physical core in real time while the benchmark is running. Otherwise you are only guessing what's going on. I just looked at 512GB (8 x 64GB) of ECC RDIMMS DDR4-3200 for < $3200. It wasn't hard to find and considering the cost of one Epyc 7742 is over $7500 I wouldn't call it expensive. Even if you put 2TB of ram in your system it's still less than the cost of the CPU's.
I would try some simple tests before I got my panties in a bunch. Like what is the speed in Mnps of a single core running SF on your system. Then you can judge if this is what you should expect considering the CPU clock speed as compared to other RYZEN systems. If it's NOT a "reasonable" number then you have a startinging point. If it is a "reasonable" number then double the core count and repeat.
It should scale until 64 cores are reached. You do need to make sure all cores are on the same NUMA node for the scaling to be "good". once you go over the 64-core limit you should see the scaling change ( or any time the core are split between to NUMA nodes). If it's a NUMA issue you shouldn't see much scaling loss UNTIL you get core or more than one NUMA node. If it's a memory latency issue you should see LOW NPS on a single core. If it's a band-width issue you should see it scale poorly as soon as the memory is near saturation. If you test the system with an app like Aida64 you can see both the latency and max memory bandwidth so should be able to guess based on NPS per core when the memory bus should saturate.
Regards,
Zenmastur
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
Yes of course I used a utilty to see the individual temps/clock speeds and each node is 2800 MHz with the turbo speed boost.
As for $3200 for that ram...I spent all my savings on the cpus and the rest was supposed to be for ram 16x16GB (256 GB) of ECC DDR-3200 but that's on hold now since it's hard to find compatible ram in stock for this motherboard (as per Supermicro). I was supposed to buy it for under $2000 but it's a lot more than that now so I'm just waiting for the dollar to go back up and prices to relax. Just because I spent a lot of money on the cpu's doesn't mean I can spend so much on ram. And I don't need more than 256 GB as even that is overkill right now. I just want to fill all the slots that way I get octa memory bandwidth.
Anyway, I went back to Win'10 Enterprise. I added the other 4 dimms of ram I had to give me 8 total and 4 channel memory at 2166MHz. Now here's the strange thing. In Aquarium I'll analyze various positions and usually get about 170-180,000 kN/s with my latest compiled SF with LP added. But when I just run the exe directly and then bench 1024 256 26 I get the kind of results similar to before or even worse since I now get about 58,000 kN/s. But again, in Aquarium it's reasonable with 180,000 kN/s.
And yes I've already tried with asmFishW-2017-05-22_popcnt before but still never got anywhere near 230,000+ kN/s (I think it was about 150,000 kN/s that time).
One thing I noticed compared to SF is that with asmF I get 3.20 GHz per core whereas with SF it's 2.80 GHz per core. Temps during this time are 68C per cpu.
Also thanks for the ipman links but I'm well aware of the data on his site and I saw what others got with the same cpu's.
Thanks for your results. I don't know what system that is but looks like a single 64 core and those results are really good compared to what I'm getting with dual 7742's.
Almost no difference with LP. I know I have LP working on my system because when I start my compiled SF it reports 16 Mb Large Pages (and I can also tell by looking at the memory info for the SF process in the Resource Manager):
Stockfish 090420 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
info string Hash LargePages 16 Mb
info string Found 0 tablebases
Given these results, I just don't believe that performance would be about 50% slower with the 8 * 2166MHz ram than if I was to get 16 * 3200MHz ram. But I guess I hope I'm wrong because there's not much else that I can try that I haven't already.
Actually hang on. I just realized the 2nd results with LP enabled says large pages are NOT enabled. But when I run SF with LP it says it's enabled. I think this is because when I added myself to Lock pages in Memory using gpedit.msc I did it when I renamed my computer name but didn't reboot so it was showing as User1 instead of myhostname\User1. Still strange that SF showed large pages enabled.
Anyway I've removed and readded my account to Lock pages in memory and after rebooting I see LP enabled in asmFish but results were even worse:
and yet I continue to get 180,000 kN/s in many positions using Aquarium although that's with 32GB hash. From the start position Aquarium shows me 170,000 kN/s using SF with LP enabled.
Clock speed starts at 1500 but it is 2800MHz when SF bench is running, and 3200MHz when asmFish bench is running.
And with the 8 * 8 GB 2166MHz ram it seems I do get octal memory channel already. I thought I needed all dimm slots populated for that.
In Aquarium I just tried the asmFish 2017-05-22-popcnt engine and I got 248,000 kN/s with LP and HT enabled in many positions. This is more like it. And I just realized that the bench is only giving me poor results like 94,000 kN/s when I bench with only 1024 MB of hash. When I set hash to 32GB I get the same or better speeds than in Aquarium with the same engine:
yorkman wrote: ↑Fri Apr 10, 2020 3:14 am
In Aquarium I just tried the asmFish 2017-05-22-popcnt engine and I got 248,000 kN/s with LP and HT enabled in many positions. This is more like it. And I just realized that the bench is only giving me poor results like 94,000 kN/s when I bench with only 1024 MB of hash. When I set hash to 32GB I get the same or better speeds than in Aquarium with the same engine:
Yes, on my other dual Xeon E5-2696v3 I'd get about 10-15% more kN/s with LP enabled. On this system using Brainfish, with HT and LP enabled I get 37% more kN/s.
That means I go from 140,000 kN/s to 192,000 kN/s. Still missing about 38,000 kN/s, or 20% though
This also means that asmFishW on my system is a whopping 40% faster than Brainfish with the same configuration. It's too bad asmFish isn't updated anymore or I'd just go with that and not bother the SF team.
In any case, that goes back to NUMA/Processor Groups as the most likely culprit on this system. Can someone please look at this part of the code in SF?
bob wrote:That 5K nodes per second is a REAL restriction. Many were doing 5K nodes per second in the 70's and 80's. And far beyond. Without a GM being produced.
I am certain that the 2700 Elo at 2K nodes per second is a wild exaggeration of reality. Maybe 500K nodes per second, possible. Certainly not 2K.
2K is more than enough, 1K is better estimate.
Solving positions is not a metric for estimating 2700-elo playing perfomance.
5K nodes in 1980's and 200 MN/sec of DeepBlue were targeted to seek bad moves as hard as possible.
If evaluation function was tuned to treat Queen same as pawn (I exxagerate just to introduce general idea), DeepBlue will search 120 billion nodes (in 10 minutes) to find all positions were he can sacrifice Queen for pawn.
DeepBlue put all his brute force to maximize his ill evaluation function.
yorkman wrote: ↑Fri Apr 10, 2020 3:14 am
In Aquarium I just tried the asmFish 2017-05-22-popcnt engine and I got 248,000 kN/s with LP and HT enabled in many positions.
{snip}
I wish I could say the same about latest SF with LP added:
===========================
Total time (ms) : 54464
Nodes searched : 11764651729
Nodes/second : 216007853
Quite a bit worse. NUMA/processor groups bug with this particular system?
That assumes that NPS is really important. asmFish is way behind SF in development.
With that lower NPS count, the current SF will still wipe the floor with asmFish.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.