Progress on Loki

niel5946 · Post by **niel5946** » Sat May 29, 2021 8:06 pm

mvanthoor wrote: ↑Sat May 29, 2021 5:23 pm Thanks Don't worry about embarrassment. If you put anything out there, someone, somewhere, will find things wrong with it. I had to remove a release myself. I wasn't really sure about the meager Elo and speed gain by the TT. In the end I posted a topic to see if anything was wrong. One of the comments about one piece of code was: "But that's all wrong! It should be..." Damn.

You're right. I actually find that sometimes embarrassment can be a motivator to get things done. If I had figured out that Loki didn't have the right elo for its feature set myself, the release of Loki 3.0 would probably have taken a longer time, and not been as strong.

mvanthoor wrote: ↑Sat May 29, 2021 5:23 pm I wish I had more time to work on Rustic besides an evening hour here or there. Writing the book about the engine takes even more time than I thought as well Not even talking about testing. It would be very helpful if I had another computer. (I'm waiting for AMD's AM5 socket, or a successor to intel's X299. Then testing will be a LOT faster.) No matter; it'll get there.

I checked out your book some time ago, and I must say, I really like it. I think it's good that you, and many chess programmers in general, are so inclusive to newcomers as to document nearly all your/their work.
Regarding testing, I think the main thing that makes my development faster is that I test with SPRT at 10s+0.1s whereas you test at longer, more realistic TC's.

mvanthoor wrote: ↑Sat May 29, 2021 5:23 pm PS: I just noticed you even seem to like my signature enough to copy it

Yeah

I figured, I had to have one, and yours was nice and simple.

niel5946 · Post by **niel5946** » Sat May 29, 2021 8:11 pm

Easy training
Today, I implemented a command-line based training command. It is the following:

Code: Select all

		Command format: "learn ...":
		Obligatory hyperparameters (in this order):
			- dataset: string
				The csv file containing all the training data.
			- epoch: int [1;+∞]
				Amount of iterations to run the optimization algorithm for.
			- batchsize: int [1;size of dataset]
				The amount of datapoints in a single gradient estimation.
			- loss: string, mse or aae.
				The loss function to use. Either mean squared error or absolute average. Note: This should be rather easy to expand upon.
			- threads: int [1;+∞]. Note: A lot of threads will take up a big portion of memory, so one is adviced to be conservative with this number.
				Threads to use.
		Optional hyperparameters:
			- eta: float [> 0.0], default = 0.01
				initial learning rate
			- eta_decay: float [>= 0.0], default = 0.0001
				Decay of the learning rate after each iteration (for example, eta_decay 0.5 will halve eta each iteration).
				Helpful to avoid passing over a minimum.
			- min_param: float [-∞;+∞], default = -2.0
				Minimum value of parameters if a new net should be trained (randomly initialized).
				Note: If a min_param is passed, a max_param should also be.
			- max_param: float [> min_param], default = 2.0
				Maximum value of parameters if a new net should be trained (randomly initialized).
				Note: If a max_param is passed, a min_param should also be.
			- format: CSV or BIN, default = BIN.
				The format, that the network should be saved as. Either a .csv file or a .lnn binary file.
			- output: string, default = "LokiNet-<Date and time>.lnn/.csv"
				The output file that the saved network should be saved to.
				Note: This needs to match the output format if one is given.
			- net: string, default = ""
				An existing network to train further. If this isn't passed, the algorithm will randomly initialize a new network and train that.
		Example of command:
			learn dataset C:\\Users\\username\\trainingset.csv epoch 1000 batchsize 14500 loss mse threads 4 eta 0.001 eta_decay 0.1 output C:\\Users\\username\\output.lnn

			This will train a new network for 1000 epochs, with a batchsize and loss function of 14500 and mean squared error respectively. It will use 4 threads, a learning
			rate of 0.001, a learning decay of 0.1, and save the network to C:\\Users\\username\\output.lnn.
			The training set used will be C:\\Users\\username\\trainingset.csv

There is one obscure bug though. For gcc compiled executables, the output file has a parsing bug, so its name is always "+".

mvanthoor · Post by **mvanthoor** » Sat May 29, 2021 10:31 pm

niel5946 wrote: ↑Sat May 29, 2021 8:06 pm I checked out your book some time ago, and I must say, I really like it. I think it's good that you, and many chess programmers in general, are so inclusive to newcomers as to document nearly all your/their work.
Regarding testing, I think the main thing that makes my development faster is that I test with SPRT at 10s+0.1s whereas you test at longer, more realistic TC's.

Thanks

I've just written an update again, about move ordering. (The "why move ordering" part.) Before proceeding further with the engine, I'm going to write that section because it's nice and simple, and self-contained. Maybe I'll even get it done this weekend. I'll have to write the part immediately after the feature is done, or it'll get behind schedule. I still have all the stuff from Alpha 1 to write.

With regard to testing: I've had it with the slower time controls already. I was testing at 1m+0.6s, against 10-12 engines, 50 engines per game. After adding the history heuristic and testing it in a bunch of positions (yes, it cuts down on the nodes), the result was basically 1 Elo. Strangely enough, some engines got BETTER against the version that included the killers and the history heuristic. It seems it's possible for an engine to sit at +30 Elo for half the match, and then end at -20 because it loses 80% of the second half. If engine X plays against Alpha 2, and the same X against the development version that should be +50 Elo, it could even be that X scores better against the development version.

Then I ran two gauntlets: one engine against all Rustic versions, 500 games per match at 5s+0.1s, with two different engines. Now it is very clear that the killer moves make the engine stronger, and the history heuristic does as well. I think I'm going to retest Alpha 2 and 1 as well. I won't be able to predict the CCRL rating anymore, but I would be able to at least give an estimate about the improvement of one version to the next.

Somewhere during the weekend I'll post a progress post myself.

niel5946 · Post by **niel5946** » Mon May 31, 2021 1:54 pm

mvanthoor wrote: ↑Sat May 29, 2021 10:31 pm Thanks I've just written an update again, about move ordering. (The "why move ordering" part.) Before proceeding further with the engine, I'm going to write that section because it's nice and simple, and self-contained. Maybe I'll even get it done this weekend. I'll have to write the part immediately after the feature is done, or it'll get behind schedule. I still have all the stuff from Alpha 1 to write.

I just checked out the update, and it looks great. I like the way you use humans's thought processes to motivate the use of move ordering

mvanthoor wrote: ↑Sat May 29, 2021 10:31 pm Then I ran two gauntlets: one engine against all Rustic versions, 500 games per match at 5s+0.1s, with two different engines. Now it is very clear that the killer moves make the engine stronger, and the history heuristic does as well. I think I'm going to retest Alpha 2 and 1 as well. I won't be able to predict the CCRL rating anymore, but I would be able to at least give an estimate about the improvement of one version to the next.

I think this is the best way to go regarding determining improvement or regression of strength. Using very short time controls, you can get a lot of games played which will be statistically more reliable. The only thing to keep in mind is to make sure the engines you're testing against are made to play with such fast time controls. Before releasing version 3.0.0, I tested against raven and Madchess 2, and Erik Madsen then notified me about this exact problem. For some reason, I was lucky enough for the results to be at least partly reliable, but it could have gone the other way too.
What I do is that I test each change against the latest commit (best dev version so far), and use the SPRT result to determine whether or not it is an improvement. Then before I release, I run a gauntlet tournament against a pool of other engines to get a more exact/reliable elo gain.

amanjpro · Post by **amanjpro** » Mon May 31, 2021 2:11 pm

niel5946 wrote: ↑Mon May 31, 2021 1:54 pm What I do is that I test each change against the latest commit (best dev version so far), and use the SPRT result to determine whether or not it is an improvement. Then before I release, I run a gauntlet tournament against a pool of other engines to get a more exact/reliable elo gain.

I on my end have a bunch of test engines. For each significant change, I run a 200 games long match with: first current best dev, (master) if it passes, against latest, if it passes, against a test engine, and the other, and the other... So, I basically have a fail fast testing mechanism. Which for me is very helpful, as I use a "long" time control of 1m+1s

I have some problems though, for example my master (aka current best dev version) is supposedly 50 elos stronger than Zahak 3.0, and my dev version that I work on is 30 elos stronger than master, but only 20 elos stronger than Zahak 3.0. I honestly cannot understand this, that is why I involve other engines to test.

Maybe I should use faster time control, but I believe Zahak is not good for very fast time controls, as it relies more on pruning than NPS to gain its elos

niel5946 · Post by **niel5946** » Mon May 31, 2021 2:16 pm

Training new networks
Yesterday, I tried running a training test to see if the network was able to learn succesfully, and I was positively surprised

Since it was just a test, I didn't bother using a big/good dataset, so I used one with 3.2M positions, that I got from Jay Honnold, and had Loki analyze them with HCE (not search because it needed to be quick). Then, with a batch size of 14500 and learning rate of 0.001, I ran the training for 11 epochs. Then, I made Loki - with the net loaded - analyze the starting position, and it gave the following PV:
[pgn]1. c4 h6 2. e3 d6 3. Nc3 Nc6 4. Nf3 Rb8
5. Rb1 a6 6. a3[/pgn]
Now, many of these moves do look idiotic. Especially h6, Rb8 and Rb1, but I think the net is showing good potential. I am really impressed that the net has already learned that knights should be developed in the center, c4 is a good opening move, a6 and h6 are good since they inhibit the bishops's movements etc., after only 11 epochs with a tiny dataset

I also made it analyze the following position arising from the french defense:
[d]rnbq1rk1/pppn1ppp/4p3/3pP3/1b1P4/2NB1N2/PPP2PPP/R1BQK2R w KQ - 5 7
Here, we know that Bxh7 is the best move (greek gift), and I like to check this position with Loki to see how effective the engine is at seeing sacrifices. The net didn't see it, but it did see O-O which isn't a bad move either. I think the thing to take away from this is that the net has at least partly learned to recognize king safety enough for it to consider castling

Currently, I am generating a real training set with 60M positions from the lichess database (dont know if this will be enough though). I will make Loki analyze it with quiescence or depth 1/2 and use these scores to train the network. This isn't nearly enough, but at the moment, I just want to see if I can get the net to play at Loki's current strength.
The good thing about using lichess for training the network is that a lot of the positions are really unbalanced. This is usually something one wants to avoid for normal tuning of the HCE since it'll make material dominate the evaluations, but for a network it is needed. This is because a normal HCE should be able to distinguish between a pretty balanced position, and a totally loosing one, but this is in no way guaranteed for neural networks that are randomly initialized.

I have made an estimate of the training time for five epochs with the 60M dataset, and it will probably take 3 weeks with my 8-thread machine

I think I'm going to solve this problem by renting a server with 50+ threads for a week or something, and then using that though

P.S. I hadn't thought about the size of all these dataset files, so yesterday my machine ran out of disc space because of this

niel5946 · Post by **niel5946** » Mon May 31, 2021 2:27 pm

amanjpro wrote: ↑Mon May 31, 2021 2:11 pm I on my end have a bunch of test engines. For each significant change, I run a 200 games long match with: first current best dev, (master) if it passes, against latest, if it passes, against a test engine, and the other, and the other... So, I basically have a fail fast testing mechanism. Which for me is very helpful, as I use a "long" time control of 1m+1s

The method of testing does sound good (especially the way you're making it fail fast), but I do have one concern: 200 games isn't nearly enough for any statistical significance of the results. That is why hyperbullet TC's are usually used.

amanjpro wrote: ↑Mon May 31, 2021 2:11 pm I have some problems though, for example my master (aka current best dev version) is supposedly 50 elos stronger than Zahak 3.0, and my dev version that I work on is 30 elos stronger than master, but only 20 elos stronger than Zahak 3.0. I honestly cannot understand this, that is why I involve other engines to test.

Well, considering you're only using 200 games to test the changes, I am not sure you can really rely on those results. With so few games , the current dev version could very well be either +30 from master or only +20 from 3.0.
Unless I am missing something?

amanjpro wrote: ↑Mon May 31, 2021 2:11 pm Maybe I should use faster time control, but I believe Zahak is not good for very fast time controls, as it relies more on pruning than NPS to gain its elos

I would strongly advice you to do that. The more games the better, and this way you can test versions of Zahak against each other with reliable results, instead of having to always test against a big group of other engines.
NPS and Zahak's TC strength-scalability doesn't matter much this way since both version will be equally inhibited by the short time control. Therefore, the only difference would be the change you're testing.
There's one thing to keep in mind though: If you're testing depth dependent pruning/reduction methods, you will have to account for that with the time control. If Zahak doesn't get to depth 7 often in 10s+0.1s, and that is where IID (as an example) kicks in, you need to make the TC longer to test that feature.

amanjpro · Post by **amanjpro** » Mon May 31, 2021 2:33 pm

niel5946 wrote: ↑Mon May 31, 2021 2:27 pm
amanjpro wrote: ↑Mon May 31, 2021 2:11 pm I on my end have a bunch of test engines. For each significant change, I run a 200 games long match with: first current best dev, (master) if it passes, against latest, if it passes, against a test engine, and the other, and the other... So, I basically have a fail fast testing mechanism. Which for me is very helpful, as I use a "long" time control of 1m+1s
The method of testing does sound good (especially the way you're making it fail fast), but I do have one concern: 200 games isn't nearly enough for any statistical significance of the results. That is why hyperbullet TC's are usually used.

amanjpro wrote: ↑Mon May 31, 2021 2:11 pm I have some problems though, for example my master (aka current best dev version) is supposedly 50 elos stronger than Zahak 3.0, and my dev version that I work on is 30 elos stronger than master, but only 20 elos stronger than Zahak 3.0. I honestly cannot understand this, that is why I involve other engines to test.
Well, considering you're only using 200 games to test the changes, I am not sure you can really rely on those results. With so few games , the current dev version could very well be either +30 from master or only +20 from 3.0.
Unless I am missing something?

amanjpro wrote: ↑Mon May 31, 2021 2:11 pm Maybe I should use faster time control, but I believe Zahak is not good for very fast time controls, as it relies more on pruning than NPS to gain its elos
I would strongly advice you to do that. The more games the better, and this way you can test versions of Zahak against each other with reliable results, instead of having to always test against a big group of other engines.
NPS and Zahak's TC strength-scalability doesn't matter much this way since both version will be equally inhibited by the short time control. Therefore, the only difference would be the change you're testing.
There's one thing to keep in mind though: If you're testing depth dependent pruning/reduction methods, you will have to account for that with the time control. If Zahak doesn't get to depth 7 often in 10s+0.1s, and that is where IID (as an example) kicks in, you need to make the TC longer to test that feature.

It is 200 games against each engine/version. I have the results of the latest stable against all test engines, and I (for good dev versions) eventually have games against the test engines and my conclusion is based on the sum. Per increments, I have around 2000 games or so

mvanthoor · Post by **mvanthoor** » Mon May 31, 2021 2:42 pm

niel5946 wrote: ↑Mon May 31, 2021 1:54 pm I just checked out the update, and it looks great. I like the way you use humans's thought processes to motivate the use of move ordering

More updates are coming. I've decided to write each part immediately after I merge it into master, and retrospectively add the parts about the move generator, board representation, etc...

I think this is the best way to go regarding determining improvement or regression of strength. Using very short time controls, you can get a lot of games played which will be statistically more reliable

I did notice that some of the slower engines, such as WukongJS (in Javascript) _completely_ tank in the faster time controls. I assume this is because in a longer time control, they have enough time to "catch up" on the search depth while a faster engine such as Rustic is struggling to complete the next depth (and will probably fail). In super fast time controls, the slower engines don't have this chance.

The engines that have more knowledge (in my gauntlet, MinimalChess) will suddenly perform MUCH better than on longer time controls.

The only thing to keep in mind is to make sure the engines you're testing against are made to play with such fast time controls. Before releasing version 3.0.0, I tested against raven and Madchess 2, and Erik Madsen then notified me about this exact problem. For some reason, I was lucky enough for the results to be at least partly reliable, but it could have gone the other way too.

Eh... lol. I noticed. As I said, at the slower time controls, Rustic's great speed (compared to engines in the same stage of development) is a huge advantage, except if it's outsmarted by an engine that just knows more.

What I do is that I test each change against the latest commit (best dev version so far), and use the SPRT result to determine whether or not it is an improvement. Then before I release, I run a gauntlet tournament against a pool of other engines to get a more exact/reliable elo gain.

I'm moving in the same direction.

mvanthoor · Post by **mvanthoor** » Mon May 31, 2021 2:50 pm

amanjpro wrote: ↑Mon May 31, 2021 2:11 pm Maybe I should use faster time control, but I believe Zahak is not good for very fast time controls, as it relies more on pruning than NPS to gain its elos

Correct. I use Zahak in my testing (0.2.1 and 0.3.0 for now), and while both are as strong or stronger than Rustic Alpha 2 (0.3.0 is still stronger than the current dev version), they both completely tank on fast time controls. Rustic outperforms them by over +100 Elo.

Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki

Re: Progress on Loki