Time per gigatype RAM with NN engines to exhaustion

jp · Post by jp » Fri Dec 06, 2019 6:15 am

schack wrote: ↑Fri Dec 06, 2019 6:01 am Is that not correct?

Memory will be the limiting factor in using an NN engine at long times. When it runs out of memory it just crashes.

jp · Post by jp » Fri Dec 06, 2019 6:50 am

[continued from above]
e.g. in a different thread, Jim said he analysed for 3.3 billion nodes and it used 185 GB (!!) of RAM.

smatovic · Post by **smatovic** » Fri Dec 06, 2019 8:14 am

schack wrote: ↑Fri Dec 06, 2019 6:01 am I was under the impression that NN engines didn't need nearly as much traditional hash space as AB engines. Is that not correct?

NN engines use hash to store evaluations of positions, you can define this in parameters of LC0 like nn cache size.

But the MCTS needs to store the search tree in memory (similar to Komodo MCTS), so the memory gets relative quickly filled.

--
Srdja

Ovyron · Post by **Ovyron** » Fri Dec 06, 2019 8:45 am

Why not just using Hard Drive space for that? Like, I have 3GB RAM but 146GB of free HD space, just use the latter.

smatovic · Post by **smatovic** » Fri Dec 06, 2019 8:50 am

smatovic wrote: ↑Fri Dec 06, 2019 8:14 am
schack wrote: ↑Fri Dec 06, 2019 6:01 am I was under the impression that NN engines didn't need nearly as much traditional hash space as AB engines. Is that not correct?
NN engines use hash to store evaluations of positions, you can define this in parameters of LC0 like nn cache size.

But the MCTS needs to store the search tree in memory (similar to Komodo MCTS), so the memory gets relative quickly filled.

--
Srdja

Ovyron wrote: ↑Fri Dec 06, 2019 8:45 am Why not just using Hard Drive space for that? Like, I have 3GB RAM but 146GB of free HD space, just use the latter.

LC0 is still WIP, make an proposal on GitHub or Discord. Afaik Percival (dragontamer) came up with this idea too.

--
Srdja

zullil · Post by **zullil** » Fri Dec 06, 2019 9:43 pm

Dann Corbit wrote: ↑Wed Dec 04, 2019 11:52 pm With an RTX 2080, how fast does it consume RAM?
I have found that my 1080TI eats 32 GB in an hour, and the analysis slows to a crawl (and the machine starts to get balky).
If you have 2x RTX2080, what is the maximum time you can analyze before RAM is exhausted with 64 GB and with 128 GB?
Has anyone looked into this?
I think that there is a way to limit maximum RAM, but if you do that, how is the analysis impacted?

Lc0 provides info in its log file regarding how many nodes it can handle based on a specified NNCacheSize and RamLimitMb. For example, here's what I see with NNCacheSize = 100000000 and RamLimitMb = 100000. (My system has only 128 GB RAM.)

Code: Select all

============= Log started. =============
1206 15:37:54.511214 140543876579776 ../../src/main.cc:40] Lc0 started.
1206 15:37:54.511252 140543876579776 ../../src/main.cc:41]        _
1206 15:37:54.511285 140543876579776 ../../src/main.cc:42] |   _ | |
1206 15:37:54.511298 140543876579776 ../../src/main.cc:43] |_ |_ |_| v0.23.0+git.02fc8e0 built Dec  2 2019
1206 15:37:54.514759 140543876579776 ../../src/utils/commandline.cc:45] Command line: ./lc0
1206 15:37:56.414327 140543876579776 ../../src/chess/uciloop.cc:132] >> uci
1206 15:37:56.414384 140543876579776 ../../src/chess/uciloop.cc:219] << id name Lc0 v0.23.0+git.02fc8e0
1206 15:37:56.414437 140543876579776 ../../src/chess/uciloop.cc:219] << id author The LCZero Authors.
1206 15:37:56.414601 140543876579776 ../../src/chess/uciloop.cc:219] << option name WeightsFile type string default <autodiscover>
1206 15:37:56.414625 140543876579776 ../../src/chess/uciloop.cc:219] << option name Backend type combo default cudnn var cudnn var cudnn-fp16 var random var check var roundrobin var multiplexing var demux
1206 15:37:56.414646 140543876579776 ../../src/chess/uciloop.cc:219] << option name BackendOptions type string default 
1206 15:37:56.414663 140543876579776 ../../src/chess/uciloop.cc:219] << option name Threads type spin default 2 min 1 max 128
1206 15:37:56.414680 140543876579776 ../../src/chess/uciloop.cc:219] << option name NNCacheSize type spin default 200000 min 0 max 999999999
1206 15:37:56.414697 140543876579776 ../../src/chess/uciloop.cc:219] << option name MinibatchSize type spin default 256 min 1 max 1024
1206 15:37:56.414714 140543876579776 ../../src/chess/uciloop.cc:219] << option name MaxPrefetch type spin default 32 min 0 max 1024
1206 15:37:56.414731 140543876579776 ../../src/chess/uciloop.cc:219] << option name LogitQ type check default false
1206 15:37:56.414754 140543876579776 ../../src/chess/uciloop.cc:219] << option name CPuct type string default 3.000000
1206 15:37:56.414772 140543876579776 ../../src/chess/uciloop.cc:219] << option name CPuctBase type string default 19652.000000
1206 15:37:56.414789 140543876579776 ../../src/chess/uciloop.cc:219] << option name CPuctFactor type string default 2.000000
1206 15:37:56.414806 140543876579776 ../../src/chess/uciloop.cc:219] << option name Temperature type string default 0.000000
1206 15:37:56.414822 140543876579776 ../../src/chess/uciloop.cc:219] << option name TempDecayMoves type spin default 0 min 0 max 100
1206 15:37:56.414839 140543876579776 ../../src/chess/uciloop.cc:219] << option name TempCutoffMove type spin default 0 min 0 max 1000
1206 15:37:56.414855 140543876579776 ../../src/chess/uciloop.cc:219] << option name TempEndgame type string default 0.000000
1206 15:37:56.414872 140543876579776 ../../src/chess/uciloop.cc:219] << option name TempValueCutoff type string default 100.000000
1206 15:37:56.414889 140543876579776 ../../src/chess/uciloop.cc:219] << option name TempVisitOffset type string default 0.000000
1206 15:37:56.414905 140543876579776 ../../src/chess/uciloop.cc:219] << option name DirichletNoise type check default false
1206 15:37:56.414922 140543876579776 ../../src/chess/uciloop.cc:219] << option name VerboseMoveStats type check default false
1206 15:37:56.414939 140543876579776 ../../src/chess/uciloop.cc:219] << option name FpuStrategy type combo default reduction var reduction var absolute
1206 15:37:56.414955 140543876579776 ../../src/chess/uciloop.cc:219] << option name FpuValue type string default 1.200000
1206 15:37:56.414972 140543876579776 ../../src/chess/uciloop.cc:219] << option name FpuStrategyAtRoot type combo default same var reduction var absolute var same
1206 15:37:56.414988 140543876579776 ../../src/chess/uciloop.cc:219] << option name FpuValueAtRoot type string default 1.000000
1206 15:37:56.415004 140543876579776 ../../src/chess/uciloop.cc:219] << option name CacheHistoryLength type spin default 0 min 0 max 7
1206 15:37:56.415021 140543876579776 ../../src/chess/uciloop.cc:219] << option name PolicyTemperature type string default 2.200000
1206 15:37:56.415037 140543876579776 ../../src/chess/uciloop.cc:219] << option name MaxCollisionEvents type spin default 32 min 1 max 1024
1206 15:37:56.415054 140543876579776 ../../src/chess/uciloop.cc:219] << option name MaxCollisionVisits type spin default 9999 min 1 max 1000000
1206 15:37:56.415070 140543876579776 ../../src/chess/uciloop.cc:219] << option name OutOfOrderEval type check default true
1206 15:37:56.415091 140543876579776 ../../src/chess/uciloop.cc:219] << option name StickyEndgames type check default true
1206 15:37:56.415142 140543876579776 ../../src/chess/uciloop.cc:219] << option name SyzygyFastPlay type check default true
1206 15:37:56.415162 140543876579776 ../../src/chess/uciloop.cc:219] << option name MultiPV type spin default 1 min 1 max 500
1206 15:37:56.415179 140543876579776 ../../src/chess/uciloop.cc:219] << option name PerPVCounters type check default false
1206 15:37:56.415195 140543876579776 ../../src/chess/uciloop.cc:219] << option name ScoreType type combo default centipawn var centipawn var centipawn_2018 var win_percentage var Q
1206 15:37:56.415214 140543876579776 ../../src/chess/uciloop.cc:219] << option name HistoryFill type combo default fen_only var no var fen_only var always
1206 15:37:56.415231 140543876579776 ../../src/chess/uciloop.cc:219] << option name ShortSightedness type string default 0.000000
1206 15:37:56.415247 140543876579776 ../../src/chess/uciloop.cc:219] << option name SyzygyPath type string default 
1206 15:37:56.415263 140543876579776 ../../src/chess/uciloop.cc:219] << option name Ponder type check default true
1206 15:37:56.415279 140543876579776 ../../src/chess/uciloop.cc:219] << option name UCI_Chess960 type check default false
1206 15:37:56.415295 140543876579776 ../../src/chess/uciloop.cc:219] << option name UCI_ShowWDL type check default false
1206 15:37:56.415312 140543876579776 ../../src/chess/uciloop.cc:219] << option name ConfigFile type string default lc0.config
1206 15:37:56.415328 140543876579776 ../../src/chess/uciloop.cc:219] << option name KLDGainAverageInterval type spin default 100 min 1 max 10000000
1206 15:37:56.415344 140543876579776 ../../src/chess/uciloop.cc:219] << option name MinimumKLDGainPerNode type string default 0.000000
1206 15:37:56.415361 140543876579776 ../../src/chess/uciloop.cc:219] << option name SmartPruningFactor type string default 1.330000
1206 15:37:56.415377 140543876579776 ../../src/chess/uciloop.cc:219] << option name RamLimitMb type spin default 0 min 0 max 100000000
1206 15:37:56.415393 140543876579776 ../../src/chess/uciloop.cc:219] << option name MoveOverheadMs type spin default 200 min 0 max 100000000
1206 15:37:56.415410 140543876579776 ../../src/chess/uciloop.cc:219] << option name Slowmover type string default 1.000000
1206 15:37:56.415426 140543876579776 ../../src/chess/uciloop.cc:219] << option name ImmediateTimeUse type string default 1.000000
1206 15:37:56.415442 140543876579776 ../../src/chess/uciloop.cc:219] << option name LogFile type string default 
1206 15:37:56.415462 140543876579776 ../../src/chess/uciloop.cc:219] << uciok
1206 15:38:11.134592 140543876579776 ../../src/chess/uciloop.cc:132] >> setoption name LogFile value Lc0.log
1206 15:38:27.542928 140543876579776 ../../src/chess/uciloop.cc:132] >> setoption name Backend value cudnn-fp16
1206 15:38:54.247437 140543876579776 ../../src/chess/uciloop.cc:132] >> setoption name NNCacheSize value 100000000
1206 15:39:25.256014 140543876579776 ../../src/chess/uciloop.cc:132] >> setoption name RamLimitMb value 100000
1206 15:39:33.824194 140543876579776 ../../src/chess/uciloop.cc:132] >> ucinewgame
1206 15:39:33.824926 140543876579776 ../../src/neural/loader.cc:209] Found pb network file: ./J13B.3-200
1206 15:39:34.564399 140543876579776 ../../src/neural/factory.cc:84] Creating backend [cudnn-fp16]...
1206 15:39:34.705520 140543876579776 ../../src/neural/cuda/network_cudnn.cc:723] CUDA Runtime version: 10.1.0
1206 15:39:34.705602 140543876579776 ../../src/neural/cuda/network_cudnn.cc:736] Cudnn version: 7.6.2
1206 15:39:34.705626 140543876579776 ../../src/neural/cuda/network_cudnn.cc:746] Latest version of CUDA supported by the driver: 10.1.0
1206 15:39:34.706284 140543876579776 ../../src/neural/cuda/network_cudnn.cc:754] GPU: GeForce RTX 2080 Ti
1206 15:39:34.706309 140543876579776 ../../src/neural/cuda/network_cudnn.cc:755] GPU memory: 10.7241 Gb
1206 15:39:34.706346 140543876579776 ../../src/neural/cuda/network_cudnn.cc:757] GPU clock frequency: 1635 MHz
1206 15:39:34.706358 140543876579776 ../../src/neural/cuda/network_cudnn.cc:758] GPU compute capability: 7.5
1206 15:39:40.448296 140543876579776 ../../src/chess/uciloop.cc:132] >> go nodes 1000
1206 15:39:40.448470 140543876579776 ../../src/mcts/stoppers/stoppers.cc:104] RAM limit 100000MB. Cache takes 31200MB. Remaining memory is enough for 344000000 nodes.

dragontamer5788 · Post by **dragontamer5788** » Fri Dec 06, 2019 10:26 pm

smatovic wrote: ↑Fri Dec 06, 2019 8:50 am LC0 is still WIP, make an proposal on GitHub or Discord. Afaik Percival (dragontamer) came up with this idea too.

I believe it is possible to do a MCTS traversal with virtual-loss on the hard drive or SSDs, but I haven't fully discussed with anyone the full scope of my idea yet. I'm kind of busy with many other ideas... but I figure I might as well discuss the general idea here. Maybe someone else can flesh it out and maybe make it a reality.

Ovyron wrote: ↑Fri Dec 06, 2019 8:45 am Why not just using Hard Drive space for that? Like, I have 3GB RAM but 146GB of free HD space, just use the latter.

Hard drives have random-access times of ~200 ops/second to 500 ops per second (depending on queue-depth and other details). Traversing a MCTS tree of depth 10 would require 10x reads, limiting your traversal to 10x traversals per second, maybe 50x traversals per second. 2GBs of nodes at 16-bytes per node and ~6-children per node would be able to cache a depth-10 search, but reaching into depth11 would start requiring the OS to thrash the hard-drive. (Maybe 1x read per traversal to reach into Depth11, and then 2x hard-drive reads per traversal to reach to Depth12).

So we're looking at an effective speed of ~200 MCTS traversals per second at ~depth 11, and then ~100 MCTS traversals per second at depth ~12, etc. etc, as more and more of the tree gets shifted onto super-slow hard-drive speeds (~500 read/writes per second) instead of DDR4 RAM (20-million read/writes per second).

--------

SSDs are different. SSDs have a speed of ~100,000 read/writes per second, roughly 200x slower than DDR4 RAM, but 100k IOPS is sufficient to maybe support the 50k nodes/second that LeelaZero currently supports. A degree of tuning, and maybe custom programming, would be necessary to get it working correctly, but it should be possible.

----------

In either case, the goal should be splitting up the MCTS tree into parallelizable parts, and processing subsets of the MCTS tree in such a way to minimize I/O requests, and maximize DDR4 usage. Consider my typical notation of a traversal (P is the root. P1 is the 1st child of the root. P123 is the 3rd-child of the 2nd-child of the 1st-child of the root).

If you are to assign 1-million virtual losses to a MCTS Tree with root P, maybe 500k of those positions will be assigned to P1 (and P1's children). All of P's other children, such as P2, P3, P4... as well as their children, can be written to SSD or Hard-drive space and removed from RAM.

Your traversal enters P1 with 500k virtual losses to assign. Lets say P13 would be assigned 400k of these virtual losses. You remove P1's children (P11, P12, P14, P15...) from RAM (writing it to SSD or Hard Drive space), and focus down on P13's path.

P13 now has 400k virtual losses to be assigned. Maybe this is small enough to fit in memory now, especially because we've removed so much from RAM. So you perform the 400k virtual losses as per normal MCTS-with-virtual loss (Ex: P13126231 is assigned a visit. That visit goes to the GPU to be processed. GPU responds with an answer, update MCTS positions P13126231 -> P1312623 -> P131262 -> P13126 -> P13126 -> P1312 -> P131 -> P13 with the updated MCTS visit rules, and carry on). Eventually, all 400k values are assigned and you can go back up the tree.

Upon going back up the tree, you hit P1 once more. You'll notice that you haven't assigned all 500k virtual losses yet: 100k still remain. The next 80k virtual losses might go down P12. So you remove the P13-branch from memory (writing it to SSD or Hard Drive), and traverse down P12 with 80k virtual losses.

Eventually, you return to the P root. 500k positions have yet to be assigned, and P1 has been completed. You remove P1 from memory (write it SSD or Hard Drive), and then enter P2 with maybe 300k virtual-loss positions to assign.

---------

This sort of traversal would minimize the hits to the hard drive, while still providing a MCTS traversal of some kind. SSD Flash storage might be fast enough to provide the ~50,000 positions/second that LeelaZero can provide, but hard drives are just way too slow for IOPS, a hard-drive program would need to consider that slow 5ms arm-swing each time you read/write from a new position on the HDD.

Still: because of how cheap HDDs are, its probably worthwhile to program hard-drive programs specifically. Its going to be hard to limit your IOPS to ~200 while still retaining a ~50,000 positions/second (or more) that LeelaZero can gobble up, but I think its possible to write such a program.

corres · Post by **corres** » Sat Dec 07, 2019 11:05 am

dragontamer5788 wrote: ↑Fri Dec 06, 2019 10:26 pm
smatovic wrote: ↑Fri Dec 06, 2019 8:50 am LC0 is still WIP, make an proposal on GitHub or Discord. Afaik Percival (dragontamer) came up with this idea too.
I believe it is possible to do a MCTS traversal with virtual-loss on the hard drive or SSDs, but I haven't fully discussed with anyone the full scope of my idea yet. I'm kind of busy with many other ideas... but I figure I might as well discuss the general idea here. Maybe someone else can flesh it out and maybe make it a reality.
Ovyron wrote: ↑Fri Dec 06, 2019 8:45 am Why not just using Hard Drive space for that? Like, I have 3GB RAM but 146GB of free HD space, just use the latter.
Hard drives have random-access times of ~200 ops/second to 500 ops per second (depending on queue-depth and other details). Traversing a MCTS tree of depth 10 would require 10x reads, limiting your traversal to 10x traversals per second, maybe 50x traversals per second. 2GBs of nodes at 16-bytes per node and ~6-children per node would be able to cache a depth-10 search, but reaching into depth11 would start requiring the OS to thrash the hard-drive. (Maybe 1x read per traversal to reach into Depth11, and then 2x hard-drive reads per traversal to reach to Depth12).
So we're looking at an effective speed of ~200 MCTS traversals per second at ~depth 11, and then ~100 MCTS traversals per second at depth ~12, etc. etc, as more and more of the tree gets shifted onto super-slow hard-drive speeds (~500 read/writes per second) instead of DDR4 RAM (20-million read/writes per second).
--------
SSDs are different. SSDs have a speed of ~100,000 read/writes per second, roughly 200x slower than DDR4 RAM, but 100k IOPS is sufficient to maybe support the 50k nodes/second that LeelaZero currently supports. A degree of tuning, and maybe custom programming, would be necessary to get it working correctly, but it should be possible.
----------
In either case, the goal should be splitting up the MCTS tree into parallelizable parts, and processing subsets of the MCTS tree in such a way to minimize I/O requests, and maximize DDR4 usage. Consider my typical notation of a traversal (P is the root. P1 is the 1st child of the root. P123 is the 3rd-child of the 2nd-child of the 1st-child of the root).
If you are to assign 1-million virtual losses to a MCTS Tree with root P, maybe 500k of those positions will be assigned to P1 (and P1's children). All of P's other children, such as P2, P3, P4... as well as their children, can be written to SSD or Hard-drive space and removed from RAM.
Your traversal enters P1 with 500k virtual losses to assign. Lets say P13 would be assigned 400k of these virtual losses. You remove P1's children (P11, P12, P14, P15...) from RAM (writing it to SSD or Hard Drive space), and focus down on P13's path.
P13 now has 400k virtual losses to be assigned. Maybe this is small enough to fit in memory now, especially because we've removed so much from RAM. So you perform the 400k virtual losses as per normal MCTS-with-virtual loss (Ex: P13126231 is assigned a visit. That visit goes to the GPU to be processed. GPU responds with an answer, update MCTS positions P13126231 -> P1312623 -> P131262 -> P13126 -> P13126 -> P1312 -> P131 -> P13 with the updated MCTS visit rules, and carry on). Eventually, all 400k values are assigned and you can go back up the tree.
Upon going back up the tree, you hit P1 once more. You'll notice that you haven't assigned all 500k virtual losses yet: 100k still remain. The next 80k virtual losses might go down P12. So you remove the P13-branch from memory (writing it to SSD or Hard Drive), and traverse down P12 with 80k virtual losses.
Eventually, you return to the P root. 500k positions have yet to be assigned, and P1 has been completed. You remove P1 from memory (write it SSD or Hard Drive), and then enter P2 with maybe 300k virtual-loss positions to assign.
---------
This sort of traversal would minimize the hits to the hard drive, while still providing a MCTS traversal of some kind. SSD Flash storage might be fast enough to provide the ~50,000 positions/second that LeelaZero can provide, but hard drives are just way too slow for IOPS, a hard-drive program would need to consider that slow 5ms arm-swing each time you read/write from a new position on the HDD.
Still: because of how cheap HDDs are, its probably worthwhile to program hard-drive programs specifically. Its going to be hard to limit your IOPS to ~200 while still retaining a ~50,000 positions/second (or more) that LeelaZero can gobble up, but I think its possible to write such a program.

There is an important thing about what you forgot:
OS uses at first the RAM and only at second the page file what is on HDD or SSD. Moreover it changes the place of the most older part of the data being in the RAM taking them to the page file. So if you have a RAM with appropriate measure it does not cause essential slowing down. Obviously as the thinking time is longer as you need bigger and bigger RAM. If one analyzes a variation for many ours that one need big RAM and some 100 GB page file on HDD or SSD.
I think basing on some experience we can solve the issue even in the case of the present lc0 too.

jp · Post by jp » Sat Dec 07, 2019 12:10 pm

dragontamer5788 wrote: ↑Fri Dec 06, 2019 10:26 pm This sort of traversal would minimize the hits to the hard drive, while still providing a MCTS traversal of some kind. SSD Flash storage might be fast enough to provide the ~50,000 positions/second that LeelaZero can provide, but hard drives are just way too slow

But I read that SSDs will be harmed by that sort of activity. Right?

Dann Corbit · Post by **Dann Corbit** » Sat Dec 07, 2019 7:15 pm

jp wrote: ↑Sat Dec 07, 2019 12:10 pm
dragontamer5788 wrote: ↑Fri Dec 06, 2019 10:26 pm This sort of traversal would minimize the hits to the hard drive, while still providing a MCTS traversal of some kind. SSD Flash storage might be fast enough to provide the ~50,000 positions/second that LeelaZero can provide, but hard drives are just way too slow
But I read that SSDs will be harmed by that sort of activity. Right?

Most ssd devices have about the same endurance as a regular mechanical disk. The AORUS M2 PCIE 4.0 SSDs that I bought have incredible endurance.

If you have a really fast machine and a small memory, your SSD will get used like mad. So it is still an important consideration.

Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion

Re: Time per gigatype RAM with NN engines to exhaustion