Daniel Shawul wrote: ↑Fri Jul 19, 2019 4:49 am
I don't think lc0 consumes that much memory and from my calculations it could go upto 17 hours analysis with just 12 gb ram.
Thats one of the reason i decided to stick with 16 gb ram.
Daniel
1 node takes 250 bytes, or in other words 1GB is needed for 4M nodes.
(and also one NN cache entry takes 350 bytes. It doesn't grow with time, but it's a thing to consider when setting huge cache sizes).
So at 33333 nps it takes 1GB per 2 minutes, per hour it takes 30GB, and after 17 hours it needs 512GB of RAM.
If 12GB is enough for 17 hours (and it overflows into swap partition), it seems that average nps is around 780.
So if i divide that by a factor of (250 bytes / 24 bytes = 10), it will take 51 GB for 17 hour analysis. Also i used an nps of 20knps.
I did 2-hours analysis of 10 positions for Dann Corbit a while ago
So per hour it was taking 76 million nodes with an nps of 21 knps. So if i take 24 bytes / per node, in 17 hours it uses up 28 Gb.
I see my mistake now the nodes per second (nps) doesn't actually tell the number of nodes actually generated.
Unviisted nodes will increase memory consumtpion a factor of the averaging branching factor ...
So if i divide that by a factor of (250 bytes / 24 bytes = 10), it will take 51 GB for 17 hour analysis. Also i used an nps of 20knps.
I did 2-hours analysis of 10 positions for Dann Corbit a while ago
So per hour it was taking 76 million nodes with an nps of 21 knps. So if i take 24 bytes / per node, in 17 hours it uses up 28 Gb.
That was written prior to edge-node separation. (Memory usage estimation is still roughly true though, "internal node" included data from edge pointing to it, and "leaf node" only contained edge information but not node itself). Prior to edge-node separation, for every visited nodes all its children Nodes were also created even they were not visited (so after visiting X nodes, 20*X of node objects were in memory). We stored priors there.
Now it's the roughly the same, but edges are stored together with node.
Edge size is 4 bytes.
Node size is `80 bytes + number_possible_moves*edge_size`. With 30 possible moves in average given position, this gives 200.
Because of memory allocation overhead, memory fragmentation, blah blah blah, in reality it's more towards 250 bytes per node.
So this must have been a problem with TensorRT library i am using in scorpio. I used to get comparable nps with FP16 and FP32
as lc0 on a volta chip so I am not sure what is going on here. However scorpio still get 35knps with INT8 so it is ok for now..
I have also tried to ssh into the desktop while the screen is dim since the GPU is dual purpose right now. The effect is minimal to the
nps, but I may want to stick in a cheap gpu, maybe gtx 1650 for 150$ on the second pci slot, to handle the display, and the rtx can do the compute.
So this hardware combo ryzen 9 3900x + RTX 2070 super seems to be pretty good for stockfish+lc0 for those interested.
So it is now getting 46 knps with INT8, and 26knps with FP16 using my net-20x256.uff network.
So I suspect the problem is with TensorRT not being able to optimize leelas style nets as well as my own format.
One thing i noticed is that the RTX 2070-super when using FP16 has 2x more flops with FP16 accumulator than with FP32 accumulator.
Daniel Shawul wrote: ↑Thu Jul 18, 2019 7:46 am
Success at last!!
…
…
...
I did a quick benchmark on the CPU (ryzen 9 3900x) and GPU (RTX 2070 super). Stockfish seems to scale linearly across the 12 cores with its lazy smp implementation. So i got about 1.8 mnps on 1 core using latest source compiled with gcc 7.4, and get 21 mnps using all 12 cores. It goes up to 27 mnps if i use hyperthreading (24 threads). Similar scaling for scorpio as well. No overclocking for this test, so just the base clock of 3.8ghz.
The more I look at the numbers you posted for SF the more I think something isn't right. I'm not sure what, but the NPS seems unusually low and it really started to bug me. My first thought was JEDEC memory timings for slower memory are being used. But even if slow 2133 memory timings are used it would only slow the machine about 15% as compared to DDR4 3200 CL16 ram. Which makes me curios about which version of SF you used. Was it an Abrok Haswell compile of recent vintage? If so, which one? Or, was it a POPCNT compile?
Maybe I'm just paranoid but these numbers don't seem right and if it were my machine I would be scouring the earth to determine what the problem was especial since this is a new build. So you may want to look around a bit to make sure all the things that you normally can't see are actually set the way you believe they should be.
I'm going to a High Power Rocket launch tomorrow weather permitting so I won't be around until later in the day. But I am interested in determining if my intuition is correct.
Regards,
Zenmastur
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
Daniel Shawul wrote: ↑Thu Jul 18, 2019 7:46 am
Success at last!!
…
…
...
I did a quick benchmark on the CPU (ryzen 9 3900x) and GPU (RTX 2070 super). Stockfish seems to scale linearly across the 12 cores with its lazy smp implementation. So i got about 1.8 mnps on 1 core using latest source compiled with gcc 7.4, and get 21 mnps using all 12 cores. It goes up to 27 mnps if i use hyperthreading (24 threads). Similar scaling for scorpio as well. No overclocking for this test, so just the base clock of 3.8ghz.
The more I look at the numbers you posted for SF the more I think something isn't right. I'm not sure what, but the NPS seems unusually low and it really started to bug me. My first thought was JEDEC memory timings for slower memory are being used. But even if slow 2133 memory timings are used it would only slow the machine about 15% as compared to DDR4 3200 CL16 ram. Which makes me curios about which version of SF you used. Was it an Abrok Haswell compile of recent vintage? If so, which one? Or, was it a POPCNT compile?
I built it myself with gcc, but i did not do a profile build before. Now that I did that the single core nps had increased to 2.2 mnps from 1.8 mnps.
And with 24 threads, I now get 31 millons nps. I have also tried an abrok compile for modern-linux and it gives the same nps.
However, i did the nps measurement from the start position only which is probably the problem. If I do the "bench" command i get an average nps of 2.75 million nps. So using all 24 threads and bench it goes upto 38 million nps
./stockfish_19071415_x64_modern bench 512 24 28 default depth
Total time (ms) : 165299
Nodes searched : 6257478895
Nodes/second : 37855515
I'm going to a High Power Rocket launch tomorrow weather permitting so I won't be around until later in the day. But I am interested in determining if my intuition is correct.
Daniel Shawul wrote: ↑Thu Jul 18, 2019 7:46 am
Success at last!!
…
…
...
I did a quick benchmark on the CPU (ryzen 9 3900x) and GPU (RTX 2070 super). Stockfish seems to scale linearly across the 12 cores with its lazy smp implementation. So i got about 1.8 mnps on 1 core using latest source compiled with gcc 7.4, and get 21 mnps using all 12 cores. It goes up to 27 mnps if i use hyperthreading (24 threads). Similar scaling for scorpio as well. No overclocking for this test, so just the base clock of 3.8ghz.
The more I look at the numbers you posted for SF the more I think something isn't right. I'm not sure what, but the NPS seems unusually low and it really started to bug me. My first thought was JEDEC memory timings for slower memory are being used. But even if slow 2133 memory timings are used it would only slow the machine about 15% as compared to DDR4 3200 CL16 ram. Which makes me curios about which version of SF you used. Was it an Abrok Haswell compile of recent vintage? If so, which one? Or, was it a POPCNT compile?
I built it myself with gcc, but i did not do a profile build before. Now that I did that the single core nps had increased to 2.2 mnps from 1.8 mnps.
And with 24 threads, I now get 31 millons nps. I have also tried an abrok compile for modern-linux and it gives the same nps.
However, i did the nps measurement from the start position only which is probably the problem. If I do the "bench" command i get an average nps of 2.75 million nps. So using all 24 threads and bench it goes upto 38 million nps
./stockfish_19071415_x64_modern bench 512 24 28 default depth
Total time (ms) : 165299
Nodes searched : 6257478895
Nodes/second : 37855515
I'm going to a High Power Rocket launch tomorrow weather permitting so I won't be around until later in the day. But I am interested in determining if my intuition is correct.
Have fun!
Daniel
That's a 40% improvement over the original numbers you gave. I was expecting numbers in the range of 35M nps to 43.5M nps. So, these are definitely in the “GOOD” range. It makes me want to go out and buy one just so I can tweak on it!
I'm trying mightily to resist the urge.
How much money do you think you saved by building your own computer?
And are you happy with your component decision?
Regards,
Zenmastur
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
I am quite happy with the system! The cpu is very powerfull ( maybe half the power of the tcec machine) and so is the RTX 2070-super.
I think i may have saved up 500$ which we will know for sure once pre-built pcs with ryzen 9 and rtx 2070-super show up.
This came at the right time that I needed to buy either a desktop or laptop -- glad i went for the desktop since it is way more powerful in
every aspect. Most of all, i enjoyed the experience a lot.
I made a miscalulation on RAM so i will upgrade it to 32 GB or more at some point, but it works for now.
Daniel Shawul wrote: ↑Fri Jul 19, 2019 4:49 am
I don't think lc0 consumes that much memory and from my calculations it could go upto 17 hours analysis with just 12 gb ram.
Thats one of the reason i decided to stick with 16 gb ram.
Daniel
1 node takes 250 bytes, or in other words 1GB is needed for 4M nodes.
(and also one NN cache entry takes 350 bytes. It doesn't grow with time, but it's a thing to consider when setting huge cache sizes).
So at 33333 nps it takes 1GB per 2 minutes, per hour it takes 30GB, and after 17 hours it needs 512GB of RAM.
If 12GB is enough for 17 hours (and it overflows into swap partition), it seems that average nps is around 780.
The huge memory usage of Lc0 is a returning theme on the sites of LC0.
But this is the first established answer for it.
Why, Dear "crem"?
Is it so great secret?
I can not believe the developers of Leela can not make modifiable the usage of RAM with an UCI parameter. Maybe it would decrease the Leela`s Elo in some measure but this effect would be decreased with a timed and structured Hash table as it is in AB engines.
It is not an elegant solving for this issue to let the Leela freezing down when it consumes the full of system RAM.
corres wrote: ↑Sun Jul 21, 2019 5:43 pm
The huge memory usage of Lc0 is a returning theme on the sites of LC0.
But this is the first established answer for it.
Why, Dear "crem"?
Is it so great secret?
I can not believe the developers of Leela can not make modifiable the usage of RAM with an UCI parameter. Maybe it would decrease the Leela`s Elo in some measure but this effect would be decreased with a timed and structured Hash table as it is in AB engines.
It is not an elegant solving for this issue to let the Leela freezing down when it consumes the full of system RAM.
There already is a parameter, although there is some question about how accurate it is.
--ramlimit-mb=0..100000000
Maximum memory usage for the engine, in megabytes. The estimation is very rough,
and can be off by a lot. For example, multiple visits to a terminal node counted
several times, and the estimation assumes that all positions have 30 possible
moves. When set to 0, no RAM limit is enforced.
[UCI: RamLimitMb DEFAULT: 0 MIN: 0 MAX: 100000000]