Similarity Report - d=1 move chooser rarity

chrisw · Post by **chrisw** » Tue Oct 01, 2019 4:20 pm

Below is a sorted table with a measure of how often the first move choice of an engine (at d=1) matches up with the general mass of (135 tested) engines first move choices. A low figure means the engine first move choice tends to the unusual, a high figure that the engine tends to the usual (where usual means what the mass tends to do).
NB this is not an engine-engine test result, it's one engine to all-other-engines-at-once test result. Or one could look at it the other way round, a high figure for an engine means that lots of other engines agree with it.

Algorithm is fairly straightforward, for each move selection of an engine, it scores 1.0 for every agreement move by any other engine. Divide the result by (count of engines x count of epds).
I might try splitting the "mass" up into two or three sub-groups indicated by the Simex similarity results and see if there is any further discrimination to be found.

Code: Select all

Ethereal 11.50           0.239
Hiarcs 14                0.319
CRafty 20.1              0.372
ProDeo 2.2               0.376
Houdini 4                0.39
Ruffian 2                0.397
Rubichess 1.4            0.397
Hiarcs 13                0.402
Ethereal 11.00           0.406
Komodo  7                0.407
Komodo  8                0.408
Fruit 1.0                0.412
Ethereal 11.25           0.414
Fire 7.1                 0.424
Arasan 17                0.424
Andscacs 0.85            0.426
Andscacs 0.84            0.427
Andscacs 0.82            0.427
Andscacs 0.83            0.427
Komodo  6                0.428
Stockfish 10             0.429
Arasan 19                0.429
Arasan 18                0.429
Xiphos 0.4               0.429
Andscacs 0.81            0.43
Texel 1.07               0.43
Nemorino 2               0.432
Komodo CCT               0.432
Ethereal 9.00            0.433
Nemorino 1               0.435
Komodo  4                0.435
Komodo  3                0.435
Andscacs 0.87            0.435
Arasan 21.3              0.436
Cheng 3                  0.436
Komodo  5                0.436
Andscacs 0.93            0.437
Senpai 1                 0.437
Ethereal 10.55           0.437
Houdini 6                0.437
Gull 3                   0.439
Gull 1                   0.439
Ethereal 10.00           0.44
Ethereal 8.00            0.44
Ethereal 8.61            0.441
Schooner 2               0.442
Nirvana 2.0              0.442
Xiphos 0.5               0.442
Shredder 12              0.443
Xiphos 0.3               0.443
Ethereal 8.37            0.444
Ethereal 9.65            0.444
Ethereal 9.30            0.445
Komodo 12                0.445
Komodo  2                0.445
Xiphos 0.2               0.447
BrainFish                0.447
Fruit 1.5                0.448
Arasan 20                0.449
Stockfish  1             0.449
Fruit 2.3                0.45
Glaurung 2.2             0.451
Komodo 11                0.451
Houdini 5                0.453
Stockfish  9             0.453
Pedone 1.9               0.453
Nirvana 1.8              0.454
CRafty 24.1              0.454
Xiphos 0.1               0.455
AsmFish                  0.458
Rodent 3                 0.46
SugaR 1.0                0.46
CRafty 23.4              0.461
CRafty 23.3              0.461
Pedone 1.6               0.461
Komodo 10                0.462
CRafty 25.1              0.462
Stockfish  2             0.462
Wasp 3.75                0.463
Toga 3.0                 0.465
Gull 2                   0.465
Toga 1.2                 0.467
Hannibal 1.7             0.467
Fruit 2.0                0.468
Komodo  9                0.469
DiscoCheck               0.471
Equinox 3.20             0.472
Bouquet 1.8              0.472
Equinox 3.3              0.473
Toga 3.1                 0.475
Critter 1.4              0.475
Stockfish  8             0.476
Critter 1.6              0.476
Rybka 1                  0.476
Shredder 13              0.476
Stockfish  7             0.479
Bouquet 1.5              0.48
Houdini 1.5              0.483
Nirvana 2.4              0.483
Strelka 5                0.483
GAmbitFruit 1.1          0.483
Toga 1.3                 0.484
GAmbitFruit 1.0          0.484
Ippolit                  0.485
Houdini 1.0              0.485
Strelka 2                0.485
Doch 1.2                 0.485
Nirvana 2.2              0.485
Toga 4.1                 0.486
Robbolito                0.486
Critter 1.2              0.486
Komodo  1                0.487
Doch 0.98                0.488
Toga 1.0                 0.488
Wasp 3.00                0.489
Nirvana 2.3              0.489
Stockfish  3             0.49
Nirvana 2.1              0.49
Wasp 3.60                0.491
GAmbitFruit 4.b          0.492
Wasp 3.50                0.493
Sting 1.5                0.493
Wasp 1.01                0.494
Wasp 2.00                0.494
Wasp 2.60                0.494
Wasp 1.25                0.495
Stockfish  4             0.495
Fruit 2.1                0.495
Sting 9.6                0.495
Sting 8.8                0.495
GAmbitFruit 4.bx         0.496
Toga 1.1                 0.497
Stockfish  6             0.497
Stockfish  5             0.498
Fruit 2.2                0.498

Ovyron · Post by **Ovyron** » Wed Oct 02, 2019 5:42 am

Thanks, this is a very interesting list. I consider Fizbo and Thinker the engines with some of the most original move choices I've ever seen, so it'd be interesting to see where they land.

Laskos · Post by **Laskos** » Wed Oct 02, 2019 10:50 am

chrisw wrote: ↑Tue Oct 01, 2019 4:20 pm Below is a sorted table with a measure of how often the first move choice of an engine (at d=1) matches up with the general mass of (135 tested) engines first move choices. A low figure means the engine first move choice tends to the unusual, a high figure that the engine tends to the usual (where usual means what the mass tends to do).
NB this is not an engine-engine test result, it's one engine to all-other-engines-at-once test result. Or one could look at it the other way round, a high figure for an engine means that lots of other engines agree with it.

Algorithm is fairly straightforward, for each move selection of an engine, it scores 1.0 for every agreement move by any other engine. Divide the result by (count of engines x count of epds).
I might try splitting the "mass" up into two or three sub-groups indicated by the Simex similarity results and see if there is any further discrimination to be found.

Code: Select all

Ethereal 11.50           0.239
Hiarcs 14                0.319
CRafty 20.1              0.372
ProDeo 2.2               0.376
Houdini 4                0.39
Ruffian 2                0.397
Rubichess 1.4            0.397
Hiarcs 13                0.402
Ethereal 11.00           0.406
Komodo  7                0.407
Komodo  8                0.408
Fruit 1.0                0.412
Ethereal 11.25           0.414
Fire 7.1                 0.424
Arasan 17                0.424
Andscacs 0.85            0.426
Andscacs 0.84            0.427
Andscacs 0.82            0.427
Andscacs 0.83            0.427
Komodo  6                0.428
Stockfish 10             0.429
Arasan 19                0.429
Arasan 18                0.429
Xiphos 0.4               0.429
Andscacs 0.81            0.43
Texel 1.07               0.43
Nemorino 2               0.432
Komodo CCT               0.432
Ethereal 9.00            0.433
Nemorino 1               0.435
Komodo  4                0.435
Komodo  3                0.435
Andscacs 0.87            0.435
Arasan 21.3              0.436
Cheng 3                  0.436
Komodo  5                0.436
Andscacs 0.93            0.437
Senpai 1                 0.437
Ethereal 10.55           0.437
Houdini 6                0.437
Gull 3                   0.439
Gull 1                   0.439
Ethereal 10.00           0.44
Ethereal 8.00            0.44
Ethereal 8.61            0.441
Schooner 2               0.442
Nirvana 2.0              0.442
Xiphos 0.5               0.442
Shredder 12              0.443
Xiphos 0.3               0.443
Ethereal 8.37            0.444
Ethereal 9.65            0.444
Ethereal 9.30            0.445
Komodo 12                0.445
Komodo  2                0.445
Xiphos 0.2               0.447
BrainFish                0.447
Fruit 1.5                0.448
Arasan 20                0.449
Stockfish  1             0.449
Fruit 2.3                0.45
Glaurung 2.2             0.451
Komodo 11                0.451
Houdini 5                0.453
Stockfish  9             0.453
Pedone 1.9               0.453
Nirvana 1.8              0.454
CRafty 24.1              0.454
Xiphos 0.1               0.455
AsmFish                  0.458
Rodent 3                 0.46
SugaR 1.0                0.46
CRafty 23.4              0.461
CRafty 23.3              0.461
Pedone 1.6               0.461
Komodo 10                0.462
CRafty 25.1              0.462
Stockfish  2             0.462
Wasp 3.75                0.463
Toga 3.0                 0.465
Gull 2                   0.465
Toga 1.2                 0.467
Hannibal 1.7             0.467
Fruit 2.0                0.468
Komodo  9                0.469
DiscoCheck               0.471
Equinox 3.20             0.472
Bouquet 1.8              0.472
Equinox 3.3              0.473
Toga 3.1                 0.475
Critter 1.4              0.475
Stockfish  8             0.476
Critter 1.6              0.476
Rybka 1                  0.476
Shredder 13              0.476
Stockfish  7             0.479
Bouquet 1.5              0.48
Houdini 1.5              0.483
Nirvana 2.4              0.483
Strelka 5                0.483
GAmbitFruit 1.1          0.483
Toga 1.3                 0.484
GAmbitFruit 1.0          0.484
Ippolit                  0.485
Houdini 1.0              0.485
Strelka 2                0.485
Doch 1.2                 0.485
Nirvana 2.2              0.485
Toga 4.1                 0.486
Robbolito                0.486
Critter 1.2              0.486
Komodo  1                0.487
Doch 0.98                0.488
Toga 1.0                 0.488
Wasp 3.00                0.489
Nirvana 2.3              0.489
Stockfish  3             0.49
Nirvana 2.1              0.49
Wasp 3.60                0.491
GAmbitFruit 4.b          0.492
Wasp 3.50                0.493
Sting 1.5                0.493
Wasp 1.01                0.494
Wasp 2.00                0.494
Wasp 2.60                0.494
Wasp 1.25                0.495
Stockfish  4             0.495
Fruit 2.1                0.495
Sting 9.6                0.495
Sting 8.8                0.495
GAmbitFruit 4.bx         0.496
Toga 1.1                 0.497
Stockfish  6             0.497
Stockfish  5             0.498
Fruit 2.2                0.498

Put Leela too.

chrisw · Post by **chrisw** » Wed Oct 02, 2019 11:10 am

Made some 'family' sub-groups of engines that tended to group together in the Simex network force directed graphs.
Similarity scores are now based on count of move matches of an engine with the moves of the sub-group. An engine that matches an engine that define the sub-group will obviously find one fully matching line, so that is a given bias in the results. Apart from that bias, I think the numbers ought to be directly comparable to the numbers from the mass.

50% or greater is the relatively arbitrary cutoff for inclusion in the matching list. Larger values mean more move matches.

1. Ippollit sub-group

Code: Select all

Strelka 2 0.501
Toga 1.1 0.502
Gull 3 0.503
Fruit 2.2 0.504
Doch 1.2 0.505
Doch 0.98 0.507
Komodo  1 0.51
Houdini 4 0.52
Gull 1 0.525
Gull 2 0.615
Equinox 3.3 0.645
Equinox 3.20 0.646
Critter 1.6 0.67
Bouquet 1.8 0.686
Critter 1.4 0.688
Critter 1.2 0.69
Houdini 1.5 0.695
Strelka 5 0.7
Bouquet 1.5 0.704
Ippolit 0.718
Houdini 1.0 0.72
Robbolito 0.724

2. Fruity sub-group

Code: Select all

Stockfish  3 0.5
CRafty 24.1 0.505
Gull 2 0.506
Wasp 2.00 0.507
Wasp 2.60 0.507
Sting 1.5 0.507
Equinox 3.3 0.508
Equinox 3.20 0.508
Sting 8.8 0.509
Sting 9.6 0.509
Wasp 1.25 0.511
CRafty 23.4 0.514
CRafty 23.3 0.514
Wasp 1.01 0.514
CRafty 25.1 0.515
Critter 1.6 0.531
Critter 1.4 0.533
Bouquet 1.8 0.537
Fruit 2.3 0.54
Critter 1.2 0.554
Bouquet 1.5 0.557
Houdini 1.5 0.561
Strelka 5 0.562
Houdini 1.0 0.569
Ippolit 0.572
Robbolito 0.574
Fruit 1.5 0.579
Doch 1.2 0.587
Komodo  1 0.588
Doch 0.98 0.589
Toga 3.0 0.597
Rybka 1 0.606
Toga 4.1 0.607
Fruit 2.0 0.612
Strelka 2 0.616
Toga 1.2 0.63
Toga 3.1 0.642
Toga 1.3 0.645
GAmbitFruit 1.1 0.653
GAmbitFruit 1.0 0.655
Toga 1.0 0.665
GAmbitFruit 4.b 0.665
GAmbitFruit 4.bx 0.67
Fruit 2.1 0.673
Fruit 2.2 0.68
Toga 1.1 0.682

3. Later Stockfishy

Code: Select all

Wasp 3.00 0.502
Wasp 3.60 0.506
Wasp 3.50 0.506
Hannibal 1.7 0.508
Xiphos 0.4 0.518
Nirvana 2.4 0.527
Nirvana 2.2 0.529
Xiphos 0.5 0.53
Nirvana 2.3 0.533
Nirvana 2.1 0.533
Xiphos 0.2 0.533
Xiphos 0.3 0.535
Schooner 2 0.537
Stockfish  3 0.538
Fire 7.1 0.54
Sting 1.5 0.542
Sting 9.6 0.544
Sting 8.8 0.544
Stockfish  4 0.547
Stockfish 10 0.552
Houdini 5 0.554
Komodo 12 0.558
Houdini 6 0.559
Shredder 13 0.562
Komodo 10 0.567
Komodo 11 0.567
Stockfish  5 0.573
BrainFish 0.579
Stockfish  9 0.595
SugaR 1.0 0.598
AsmFish 0.6
Stockfish  6 0.601
Stockfish  8 0.615
Stockfish  7 0.616

4. Early Stockfishy

Code: Select all

GAmbitFruit 1.0 0.501
Wasp 2.00 0.501
Wasp 2.60 0.501
GAmbitFruit 1.1 0.501
Ippolit 0.502
Robbolito 0.502
Houdini 1.0 0.503
Toga 3.1 0.504
Strelka 2 0.506
DiscoCheck 0.508
Wasp 1.25 0.508
Critter 1.2 0.509
Hannibal 1.7 0.509
Toga 1.0 0.51
Nirvana 2.4 0.51
Wasp 1.01 0.511
GAmbitFruit 4.b 0.511
Toga 4.1 0.515
GAmbitFruit 4.bx 0.515
Stockfish  6 0.516
Toga 1.3 0.517
Toga 1.1 0.52
Fruit 2.2 0.521
Fruit 2.1 0.521
Nirvana 2.2 0.521
Nirvana 2.3 0.522
Nirvana 2.1 0.531
Stockfish  5 0.544
Gull 1 0.552
Fruit 2.3 0.567
Stockfish  1 0.627
Stockfish  4 0.633
Glaurung 2.2 0.636
Stockfish  2 0.646
Stockfish  3 0.655
Sting 1.5 0.706
Sting 9.6 0.709
Sting 8.8 0.709

5. Middle Komodos

Code: Select all

Komodo  9 0.6
Komodo  2 0.626
Komodo  8 0.659
Komodo CCT 0.673
Komodo  7 0.674
Komodo  3 0.686
Komodo  6 0.69
Komodo  4 0.701
Komodo  5 0.703

The relatively arbitrary sub groups are formed out of the 135 test engines. Anything in the lists that matches the start of an engine name, gets included. For example 'Cri' finds all Critters.

ippo_list = ['Ipp', 'Cri', 'Rob', 'Bou', 'Strelka 5',
'Houdini 1', 'Houdini 1.5', 'Houdini 4', 'Gul', 'Equ']

fruity_list = ['Ryb', 'Fru', 'Do', 'Komodo 1', 'Strelka 2', 'Strelka 5',
'Tog', 'GAm', 'Ipp', 'Rob',
'Houdini 1', 'Houdini 1.5', 'Houdini 4']

late_stockfish_list = ['Stockfish 3', 'Stockfish 4', 'Stockfish 5', 'Stockfish 6',
'Stockfish 7', 'Stockfish 8', 'Stockfish 9', 'Stockfish 10',
'Scho', 'Xip', 'Sti', 'Han', 'Komodo 12',
'Komodo 11', 'Komodo 10', 'Shredder 13', 'Sug',
'Houdini 5', 'Houdini 6', 'Asm', 'Bra', 'Fir']

early_stockfish_list = ['Gla', 'Stockfish 1', 'Stockfish 2', 'Sti', 'Fruit 2.3',
'Gull 1', 'Stockfish 3', 'Stockfish 4']

mid_komodo_list = ['Komodo 3', 'Komodo 4', 'Komodo 5', 'Komodo 6', 'Komodo 7',
'Komodo 8', 'Komodo 9']

chrisw · Post by **chrisw** » Wed Oct 02, 2019 11:19 am

Laskos wrote: ↑Wed Oct 02, 2019 10:50 am

chrisw wrote: ↑Tue Oct 01, 2019 4:20 pm Below is a sorted table with a measure of how often the first move choice of an engine (at d=1) matches up with the general mass of (135 tested) engines first move choices. A low figure means the engine first move choice tends to the unusual, a high figure that the engine tends to the usual (where usual means what the mass tends to do).
NB this is not an engine-engine test result, it's one engine to all-other-engines-at-once test result. Or one could look at it the other way round, a high figure for an engine means that lots of other engines agree with it.

Algorithm is fairly straightforward, for each move selection of an engine, it scores 1.0 for every agreement move by any other engine. Divide the result by (count of engines x count of epds).
I might try splitting the "mass" up into two or three sub-groups indicated by the Simex similarity results and see if there is any further discrimination to be found.

Code: Select all

Ethereal 11.50           0.239
Hiarcs 14                0.319
CRafty 20.1              0.372
ProDeo 2.2               0.376
Houdini 4                0.39
Ruffian 2                0.397
Rubichess 1.4            0.397
Hiarcs 13                0.402
Ethereal 11.00           0.406
Komodo  7                0.407
Komodo  8                0.408
Fruit 1.0                0.412
Ethereal 11.25           0.414
Fire 7.1                 0.424
Arasan 17                0.424
Andscacs 0.85            0.426
Andscacs 0.84            0.427
Andscacs 0.82            0.427
Andscacs 0.83            0.427
Komodo  6                0.428
Stockfish 10             0.429
Arasan 19                0.429
Arasan 18                0.429
Xiphos 0.4               0.429
Andscacs 0.81            0.43
Texel 1.07               0.43
Nemorino 2               0.432
Komodo CCT               0.432
Ethereal 9.00            0.433
Nemorino 1               0.435
Komodo  4                0.435
Komodo  3                0.435
Andscacs 0.87            0.435
Arasan 21.3              0.436
Cheng 3                  0.436
Komodo  5                0.436
Andscacs 0.93            0.437
Senpai 1                 0.437
Ethereal 10.55           0.437
Houdini 6                0.437
Gull 3                   0.439
Gull 1                   0.439
Ethereal 10.00           0.44
Ethereal 8.00            0.44
Ethereal 8.61            0.441
Schooner 2               0.442
Nirvana 2.0              0.442
Xiphos 0.5               0.442
Shredder 12              0.443
Xiphos 0.3               0.443
Ethereal 8.37            0.444
Ethereal 9.65            0.444
Ethereal 9.30            0.445
Komodo 12                0.445
Komodo  2                0.445
Xiphos 0.2               0.447
BrainFish                0.447
Fruit 1.5                0.448
Arasan 20                0.449
Stockfish  1             0.449
Fruit 2.3                0.45
Glaurung 2.2             0.451
Komodo 11                0.451
Houdini 5                0.453
Stockfish  9             0.453
Pedone 1.9               0.453
Nirvana 1.8              0.454
CRafty 24.1              0.454
Xiphos 0.1               0.455
AsmFish                  0.458
Rodent 3                 0.46
SugaR 1.0                0.46
CRafty 23.4              0.461
CRafty 23.3              0.461
Pedone 1.6               0.461
Komodo 10                0.462
CRafty 25.1              0.462
Stockfish  2             0.462
Wasp 3.75                0.463
Toga 3.0                 0.465
Gull 2                   0.465
Toga 1.2                 0.467
Hannibal 1.7             0.467
Fruit 2.0                0.468
Komodo  9                0.469
DiscoCheck               0.471
Equinox 3.20             0.472
Bouquet 1.8              0.472
Equinox 3.3              0.473
Toga 3.1                 0.475
Critter 1.4              0.475
Stockfish  8             0.476
Critter 1.6              0.476
Rybka 1                  0.476
Shredder 13              0.476
Stockfish  7             0.479
Bouquet 1.5              0.48
Houdini 1.5              0.483
Nirvana 2.4              0.483
Strelka 5                0.483
GAmbitFruit 1.1          0.483
Toga 1.3                 0.484
GAmbitFruit 1.0          0.484
Ippolit                  0.485
Houdini 1.0              0.485
Strelka 2                0.485
Doch 1.2                 0.485
Nirvana 2.2              0.485
Toga 4.1                 0.486
Robbolito                0.486
Critter 1.2              0.486
Komodo  1                0.487
Doch 0.98                0.488
Toga 1.0                 0.488
Wasp 3.00                0.489
Nirvana 2.3              0.489
Stockfish  3             0.49
Nirvana 2.1              0.49
Wasp 3.60                0.491
GAmbitFruit 4.b          0.492
Wasp 3.50                0.493
Sting 1.5                0.493
Wasp 1.01                0.494
Wasp 2.00                0.494
Wasp 2.60                0.494
Wasp 1.25                0.495
Stockfish  4             0.495
Fruit 2.1                0.495
Sting 9.6                0.495
Sting 8.8                0.495
GAmbitFruit 4.bx         0.496
Toga 1.1                 0.497
Stockfish  6             0.497
Stockfish  5             0.498
Fruit 2.2                0.498

Put Leela too.

Can do if we have the engine-move data for Leela d=1 performance on the Don Dailey epd suite. I don't think Ed Schroeder (database central) has included any NN engines yet.
Most useful and interesting would be to run a batch of other NN engines plus a batch of Leela with various nets that have been developed over time.

Rebel · Post by **Rebel** » Wed Oct 02, 2019 11:20 am

Laskos wrote: ↑Wed Oct 02, 2019 10:50 am Put Leela too.

Leela at depth=1 makes no sense only at time control.

Reason, the play-outs.

Hence low similarity by definition.

Laskos · Post by **Laskos** » Wed Oct 02, 2019 11:44 am

Rebel wrote: ↑Wed Oct 02, 2019 11:20 am
Laskos wrote: ↑Wed Oct 02, 2019 10:50 am Put Leela too.
Leela at depth=1 makes no sense only at time control.

Reason, the play-outs.

Hence low similarity by definition.

No, low similarity is not by definition. It's low at fixed time too. Leela depth 1 IIRC is the same as Leela nodes 1, and it IS meaningful. Anyway, the eval of Leela itself is as strong as Stockfish at some depth 3 or even 4, so I don't know what more meaning you can get. In a sense, data for depth=1 are anyway meaningless, because what each engine does when using UCI depth=1 is different from an engine to another. I often use fixed time, and sometimes the scaling with this fixed time.

chrisw · Post by **chrisw** » Wed Oct 02, 2019 3:06 pm

Laskos wrote: ↑Wed Oct 02, 2019 11:44 am
Rebel wrote: ↑Wed Oct 02, 2019 11:20 am
Laskos wrote: ↑Wed Oct 02, 2019 10:50 am Put Leela too.
Leela at depth=1 makes no sense only at time control.

Reason, the play-outs.

Hence low similarity by definition.
No, low similarity is not by definition. It's low at fixed time too. Leela depth 1 IIRC is the same as Leela nodes 1, and it IS meaningful. Anyway, the eval of Leela itself is as strong as Stockfish at some depth 3 or even 4, so I don't know what more meaning you can get. In a sense, data for depth=1 are anyway meaningless, because what each engine does when using UCI depth=1 is different from an engine to another. I often use fixed time, and sometimes the scaling with this fixed time.

Well, the d=1 data is clearly not ‘meaningless’ since it is able to be mined and represented in human readable terms in ways that pleasingly reflect reality, actual and anecdotal. Engines group. Development versions chain and so on. Similarities are brought to life. Unexpected similarities have been confirmed. d=1 is supposed, according to UCI defined protocol, to return a move from a search depth of 1 ply. Most engines appear to adhere to the standard, if they don’t, and it’s clear they don’t, Ed doesn’t include them. btw a broken d=1 is going to reduce similarity, it’s pretty unlikely to increase it unless the opposite engine in the pair being tested is broken in the same way (which would be a similarity in itself), so I think the Similarity Test is good for positive correlations, despite. Or, as Dawkins might say “it works, bitches”. As I might say, it picks up, quite sensitively, in fact surprisingly sensitively, use of parallel ideas in parallel ways. Trends within the data and the overall big-picture view are satisfyingly shown. Nobody pretends the data themselves should be looked at to N places if decimals, and we even managed to get away from the fixed line in the sand model of old. Visual observation of connections, non-connections, groupings and trends allow and help each critical observer to draw his/her own conclusions without having to reference some arbitrary, and actually political defined, number. It’s those red line numbers that have been rendered ‘meaningless’ by our meaningful experimenter-unbiased scientifically verifiable data.

There’s nothing to prevent testing a batch of NNs Policy. Inter-NN comparisons would presumably be valid. I have no idea how comparable the outputs would be for NN-AB, the data will tell us, I guess. I’m interested to see.

Laskos · Post by **Laskos** » Wed Oct 02, 2019 3:49 pm

chrisw wrote: ↑Wed Oct 02, 2019 3:06 pm
Laskos wrote: ↑Wed Oct 02, 2019 11:44 am
Rebel wrote: ↑Wed Oct 02, 2019 11:20 am
Laskos wrote: ↑Wed Oct 02, 2019 10:50 am Put Leela too.
Leela at depth=1 makes no sense only at time control.

Reason, the play-outs.

Hence low similarity by definition.
No, low similarity is not by definition. It's low at fixed time too. Leela depth 1 IIRC is the same as Leela nodes 1, and it IS meaningful. Anyway, the eval of Leela itself is as strong as Stockfish at some depth 3 or even 4, so I don't know what more meaning you can get. In a sense, data for depth=1 are anyway meaningless, because what each engine does when using UCI depth=1 is different from an engine to another. I often use fixed time, and sometimes the scaling with this fixed time.
Well, the d=1 data is clearly not ‘meaningless’ since it is able to be mined and represented in human readable terms in ways that pleasingly reflect reality, actual and anecdotal. Engines group. Development versions chain and so on. Similarities are brought to life. Unexpected similarities have been confirmed. d=1 is supposed, according to UCI defined protocol, to return a move from a search depth of 1 ply. Most engines appear to adhere to the standard, if they don’t, and it’s clear they don’t, Ed doesn’t include them. btw a broken d=1 is going to reduce similarity, it’s pretty unlikely to increase it unless the opposite engine in the pair being tested is broken in the same way (which would be a similarity in itself), so I think the Similarity Test is good for positive correlations, despite. Or, as Dawkins might say “it works, bitches”. As I might say, it picks up, quite sensitively, in fact surprisingly sensitively, use of parallel ideas in parallel ways. Trends within the data and the overall big-picture view are satisfyingly shown. Nobody pretends the data themselves should be looked at to N places if decimals, and we even managed to get away from the fixed line in the sand model of old. Visual observation of connections, non-connections, groupings and trends allow and help each critical observer to draw his/her own conclusions without having to reference some arbitrary, and actually political defined, number. It’s those red line numbers that have been rendered ‘meaningless’ by our meaningful experimenter-unbiased scientifically verifiable data.

There’s nothing to prevent testing a batch of NNs Policy. Inter-NN comparisons would presumably be valid. I have no idea how comparable the outputs would be for NN-AB, the data will tell us, I guess. I’m interested to see.

I prefer to use fixed time. If one on purpose hides his output from Simex, he has difficulties with fixed time, as this is simply very closely the strength (on same hardware). One can cripple heavily an engine in a matter of an hour to look it very differently on Simex (change PST), but this way he would cripple heavily the strength too. Authors usually do care about the strength. Me, a complete idiot in chess programming, can have a _very_weird_ under Simex 2600 CCRL Elo points engine, dissimilar to anything, derived in a matter of hours from Stockfish. So, the goal of "strength" matters, therefore the goal of "fixed time". That's, if we care about obfuscated output or unusual output. The "unusual output" us also important. IIRC, Fruit has some 5-10 times more nodes at fixed low depths than Stockfish. Also, the nodes are counted differently with many engines. I don't see a better measure than fixed time with strength-oriented authors.

Yes, depth=1 Simex IS giving a good assessment, but ultimately, probably time=100ms on a reasonable core of a modern computer should be the standard.

chrisw · Post by **chrisw** » Wed Oct 02, 2019 6:42 pm

Laskos wrote: ↑Wed Oct 02, 2019 3:49 pm
chrisw wrote: ↑Wed Oct 02, 2019 3:06 pm
Laskos wrote: ↑Wed Oct 02, 2019 11:44 am
Rebel wrote: ↑Wed Oct 02, 2019 11:20 am
Laskos wrote: ↑Wed Oct 02, 2019 10:50 am Put Leela too.
Leela at depth=1 makes no sense only at time control.

Reason, the play-outs.

Hence low similarity by definition.
No, low similarity is not by definition. It's low at fixed time too. Leela depth 1 IIRC is the same as Leela nodes 1, and it IS meaningful. Anyway, the eval of Leela itself is as strong as Stockfish at some depth 3 or even 4, so I don't know what more meaning you can get. In a sense, data for depth=1 are anyway meaningless, because what each engine does when using UCI depth=1 is different from an engine to another. I often use fixed time, and sometimes the scaling with this fixed time.
Well, the d=1 data is clearly not ‘meaningless’ since it is able to be mined and represented in human readable terms in ways that pleasingly reflect reality, actual and anecdotal. Engines group. Development versions chain and so on. Similarities are brought to life. Unexpected similarities have been confirmed. d=1 is supposed, according to UCI defined protocol, to return a move from a search depth of 1 ply. Most engines appear to adhere to the standard, if they don’t, and it’s clear they don’t, Ed doesn’t include them. btw a broken d=1 is going to reduce similarity, it’s pretty unlikely to increase it unless the opposite engine in the pair being tested is broken in the same way (which would be a similarity in itself), so I think the Similarity Test is good for positive correlations, despite. Or, as Dawkins might say “it works, bitches”. As I might say, it picks up, quite sensitively, in fact surprisingly sensitively, use of parallel ideas in parallel ways. Trends within the data and the overall big-picture view are satisfyingly shown. Nobody pretends the data themselves should be looked at to N places if decimals, and we even managed to get away from the fixed line in the sand model of old. Visual observation of connections, non-connections, groupings and trends allow and help each critical observer to draw his/her own conclusions without having to reference some arbitrary, and actually political defined, number. It’s those red line numbers that have been rendered ‘meaningless’ by our meaningful experimenter-unbiased scientifically verifiable data.

There’s nothing to prevent testing a batch of NNs Policy. Inter-NN comparisons would presumably be valid. I have no idea how comparable the outputs would be for NN-AB, the data will tell us, I guess. I’m interested to see.
I prefer to use fixed time. If one on purpose hides his output from Simex, he has difficulties with fixed time, as this is simply very closely the strength (on same hardware). One can cripple heavily an engine in a matter of an hour to look it very differently on Simex (change PST), but this way he would cripple heavily the strength too.

Gaming test suites is nothing new, and maybe Ed’s d=1 idea is open for future gaming, but we think nobody expected a d=1 or even considered it and especially didn’t consider the deep buried information mine that testing over a hundred engines, including engine series, revealed.

Authors usually do care about the strength. Me, a complete idiot in chess programming, can have a _very_weird_ under Simex 2600 CCRL Elo points engine, dissimilar to anything, derived in a matter of hours from Stockfish. So, the goal of "strength" matters, therefore the goal of "fixed time". That's, if we care about obfuscated output or unusual output. The "unusual output" us also important. IIRC, Fruit has some 5-10 times more nodes at fixed low depths than Stockfish. Also, the nodes are counted differently with many engines. I don't see a better measure than fixed time with strength-oriented authors.

sure, if you want to measure strength then nothing beats playing matches at relative normal times, but Simex. As designed by Don Dailey many years ago now, was designed to measure similarity. I’m not quite sure why he went to so much effort to find 8000 plus positions with a good spread of output moves, to get 8000 that stand having width against some reasonable level of search, one probably has to test and throw out many more positions, at what, 60 seconds a position? Say he tracks through 15,000 positions to get the 8352 (which is a curious number to end at) would be 15,000 minutes or 250 hours. Not sure if there was multipv is those days, so times those numbers by either using several engines or else getting multiple searches out of one’s own engine. Okay, I am guessing and throwing numbers at it, but either which way, it’s a long time to tie some hardware up on.
Then the problem arises now, positions that looked wide ten years ago, maybe don’t with the much deeper searches of nowadays. Do we do as you suggest and drastically cut back search time to some milliseconds, but then we run into another problem coming from the other direction that some engines do a lot more initialisation than others, so the short time they have available takes a big hit compared to quick initialisers.

Yes, depth=1 Simex IS giving a good assessment, but ultimately, probably time=100ms on a reasonable core of a modern computer should be the standard.

For strength tests, sure. For similarity, I’m not sure we have any good time-relative data, but the anecdotal thought-experiment danger is the results get dulled by search depth and can no longer discriminate well.

Similarity Report - d=1 move chooser rarity

Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity

Re: Similarity Report - d=1 move chooser rarity