about an idea to learn better for LC0

hgm · Post by **hgm** » Thu Apr 19, 2018 2:08 pm

Michel wrote:So the ressources poured into the experiment are never wasted.

It depends on how you define 'wasted'. Confirming results by others in general only helps to secure the fame for those others, and doesn't buy anything for yourself. Doing something new that works better is progress, and puts you in the hall of fame yourself.

If one tries a different approach and it fails there is no conclusion.

If one, however, tries several approaches and elaborates on the one that works best, it is extremely unlikely to fail.

There is another argument for following the A0 approach: comparing different alternatives in a statistically sound way is very hard and requires enormous ressources.

Not necessarily. Sometimes a match of 10 games is enough to show with > 95% confidence which engine is best. Of course we are not interested in finding something that only works marginally better. We are interested in finding something that works orders of magnitude better. Like achieving with 3 residual blocks and 64 filters what otherwise would take 20 residual blocks and 256 filters.

Henk · Post by **Henk** » Thu Apr 19, 2018 3:26 pm

Comparing two totally different networks is expensive if you first have to train them many days before comparison. And then you still don't know how they will progress in the future.

Robert Pope · Post by **Robert Pope** » Thu Apr 19, 2018 3:57 pm

hgm wrote: If one, however, tries several approaches and elaborates on the one that works best, it is extremely unlikely to fail.

...
Not necessarily. Sometimes a match of 10 games is enough to show with > 95% confidence which engine is best. Of course we are not interested in finding something that only works marginally better. We are interested in finding something that works orders of magnitude better. Like achieving with 3 residual blocks and 64 filters what otherwise would take 20 residual blocks and 256 filters.

But there's the rub.

1. Trying a different approach takes as long to validate as the first. Even just a million games can take up to a week to run and process. And what do you know at the end? Only how that approach performed for the first million games of training. It may have reached that level faster or slower than the first approach, but that tells you nothing about the limit. So really, you have to take every approach extremely far to know if it is better or worse than another.

2. What idea do you have that is likely to work orders of magnitude better/faster? For two orders of magnitude, you could make 100 attempts and still come out ahead. But what if you didn't find anything that is orders of magnitude faster? Then you are a year later on, and have absolutely nothing to show, because you didn't pursue any of them.

3. There's nothing preventing you or anyone else from trying a better approach. Complaining that one team decided to take approach A instead of first testing B, C, D, and E, is a bit like complaining that Stockfish is spending its time on evaluation and extension tweaks instead of starting a research group to find the grand unifying theory of chess, which would make all chess engines obsolete.

I may think that the fastest way to train a deep net is by training random 3 piece positions, and gradually increase the number of pieces. Great. But if I want to do that, it's on me - I shouldn't expect other people to drop what they are doing to prove that it doesn't work nearly as well.

hgm · Post by **hgm** » Thu Apr 19, 2018 5:11 pm

The strategy you advocate is fundamentally flawed. The point is that different network topologies might have wildly different performance. In such a situation it pays to try things in parallel. Suppose there are two methods, A requiring effort 1, and B requiring effort 3, but I don't know which is the fast one and which the slow one. If I randomly pick one and stick to it, it will take me on the average 2 units of work to get a solution. If I would start working on both methods in parallel, I would get the solution after spending effort 1 on A, and in the mean time I would have wasted also 1 unit of work on B. So also 2 units of work. The factor 3 is the cross-over point; when the methods differ by more than that factor, working on them in parallel on the average will take less work.

Basically this is a depth-first vs breadth-firs issue. Usually depth-fist performs very poorly in vast search spaces. Which is why Chess engines use iterative deepening to effectively turn it into a breadth-first search,. So that in every new iteration you can start with the most-promising line.

The AlphaZero network topology (3x3 convolutions) seems to be very ill-adapted for Chess. Because many (if not most) moves in Chess are not contained in a 3x3 area. That means you need many layers in order to be able to see that Ra1 attacks Qa8. And even then you can only see it if lots of filters in the traversed layers are dedicated to transporting piece locations to the cell that can integrate them. A single layer with filters examining 8x1 or 1x8 areas, or diagonals, would do the same job far more efficiently; a factor ~10 sounds plausible. This is way above the threshold where parallel search of alternatives outperforms serial search.

I am not trying to convince the LC0 team of anything. It is their project, and they can do what they want. Neither am I planning to a similar project in my own way, for Chess. But that doesn't mean I have to think the LC0 team is doing a good job, or to agree with you when you say that what they do is the logical thing to do.

Robert Pope · Post by **Robert Pope** » Thu Apr 19, 2018 5:26 pm

hgm wrote:The strategy you advocate is fundamentally flawed. The point is that different network topologies might have wildly different performance. In such a situation it pays to try things in parallel. Suppose there are two methods, A requiring effort 1, and B requiring effort 3, but I don't know which is the fast one and which the slow one. If I randomly pick one and stick to it, it will take me on the average 2 units of work to get a solution. If I would start working on both methods in parallel, I would get the solution after spending effort 1 on A, and in the mean time I would have wasted also 1 unit of work on B. So also 2 units of work. The factor 3 is the cross-over point; when the methods differ by more than that factor, working on them in parallel on the average will take less work.

Except A0 has demonstrated that one particular approach works reasonably well, so they aren't starting at some random unknown.

But here we have 100+ combinations of topologies and training data approaches that may take 20-100 units of effort. If we split across these searching for the 20 unit one, after 1000 units we aren't even halfway there for the best one, and we could have done the worst one 10 times over.

And what if the effort is 80-100 for best vs worst? You've just increased your effort to 10,000 instead of 100 by going in parallel. And you can't drop bad performers early, because it may just be that their increased capability means they need more examples.

If you want the fastest development, that's an alphabeta searcher with a simple evaluation. 1500+ elo in one day - orders of magnitude faster, so there's no point in neural nets in the first place.

Henk · Post by **Henk** » Thu Apr 19, 2018 5:29 pm

If you know what pattern you want to detect you don't need a neural network at all. Neural network is only for doing the black box stuff. Only knowing input and output. Method is hidden/unknown.

Also knights, pawns, kings being different from bishops, queens and rooks or not.

hgm · Post by **hgm** » Thu Apr 19, 2018 5:51 pm

Robert Pope wrote:Except A0 has demonstrated that one particular approach works reasonably well, so they aren't starting at some random unknown.

I wouldn't go so far as to say that. All we know is that it worked. But it required an incredible amount of calculation effort. So perhaps we should say that they demonstrated it worked rather poorly. 20-40 residual blocks, each of several layers, seems unreasonably large for something as simple as a Chess evaluation.

And it doesn't seem they spent any effort on picking a network topology suitable for Chess, rather than a poor one. They just used what worked for Go. Which, in their situation (i.e. unlimited computational resources) was probably the smart thing to do. But using a Go-optimized system to do Chess can be considered a sort of random choice (as Go is a randomly chosen non-chess-like game).

Robert Pope · Post by **Robert Pope** » Thu Apr 19, 2018 6:22 pm

hgm wrote:
Robert Pope wrote:Except A0 has demonstrated that one particular approach works reasonably well, so they aren't starting at some random unknown.
I wouldn't go so far as to say that. All we know is that it worked. But it required an incredible amount of calculation effort. So perhaps we should say that they demonstrated it worked rather poorly. 20-40 residual blocks, each of several layers, seems unreasonably large for something as simple as a Chess evaluation.

And it doesn't seem they spent any effort on picking a network topology suitable for Chess, rather than a poor one. They just used what worked for Go. Which, in their situation (i.e. unlimited computational resources) was probably the smart thing to do. But using a Go-optimized system to do Chess can be considered a sort of random choice (as Go is a randomly chosen non-chess-like game).

Fair enough. And I guess I would just leave it with that it seems like the goal of the Leela team is to validate that a Zero Knowledge approach does indeed work for chess, not to find the best or fastest or smallest way to use neural nets in chess. And once you start playing around with what works best for chess, you've strayed from that initial vision.

hgm · Post by **hgm** » Thu Apr 19, 2018 6:44 pm

Not really. If your goal is to show it works for Chess, you have to find at least one way to make it work, and although you might not need the very best way to do it, it would very much hinder you reaching the goal at all if you were using a comparatively verly poor way.

gladius · Post by **gladius** » Thu Apr 19, 2018 6:51 pm

hgm wrote:The strategy you advocate is fundamentally flawed. The point is that different network topologies might have wildly different performance. In such a situation it pays to try things in parallel. Suppose there are two methods, A requiring effort 1, and B requiring effort 3, but I don't know which is the fast one and which the slow one. If I randomly pick one and stick to it, it will take me on the average 2 units of work to get a solution. If I would start working on both methods in parallel, I would get the solution after spending effort 1 on A, and in the mean time I would have wasted also 1 unit of work on B. So also 2 units of work. The factor 3 is the cross-over point; when the methods differ by more than that factor, working on them in parallel on the average will take less work.

Basically this is a depth-first vs breadth-firs issue. Usually depth-fist performs very poorly in vast search spaces. Which is why Chess engines use iterative deepening to effectively turn it into a breadth-first search,. So that in every new iteration you can start with the most-promising line.

The AlphaZero network topology (3x3 convolutions) seems to be very ill-adapted for Chess. Because many (if not most) moves in Chess are not contained in a 3x3 area. That means you need many layers in order to be able to see that Ra1 attacks Qa8. And even then you can only see it if lots of filters in the traversed layers are dedicated to transporting piece locations to the cell that can integrate them. A single layer with filters examining 8x1 or 1x8 areas, or diagonals, would do the same job far more efficiently; a factor ~10 sounds plausible. This is way above the threshold where parallel search of alternatives outperforms serial search.

I am not trying to convince the LC0 team of anything. It is their project, and they can do what they want. Neither am I planning to a similar project in my own way, for Chess. But that doesn't mean I have to think the LC0 team is doing a good job, or to agree with you when you say that what they do is the logical thing to do.

It definitely makes sense to experiment with different network topologies. That's one of the most fun parts of machine learning! The problem is just that it requires a pretty incredible amount of effort to do this right now. 3x3 convolutions are very well supported by the ecosystem, and have had huge optimizations done on them. When you start to stray outside that, it gets harder.

It's not like the effort to train the current network is wasted either - you could use these games as input to the training for the new network, and it would be better than using normal games (as you have the policies for all root moves to train towards).

If someone is interested in machine learning and wants to experiment with this though, please do!

about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0

Re: about an idea to learn better for LC0