Cluster analysis of similarity test

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Cluster analysis of similarity test

Post by michiguel »

This is the data Don sent me. It looks a bit noisier to me than what we had months ago (visual inspection, I am not sure) but the signals are picked up. I wonder whether running the test for more than 100 ms will make it even better. I ran a bootstrap-like analysis (more similar to a "jackknife") resampling it 1000 times, taking half of the sample (500) randomly and the most of the branches are consistent. I can give the numbers later.

Two things are needed to be considered:
1) how the branches groups to each other and how consistent they are (the are if the branches exists and the groups remain together no matter how I resample the data)
2) how long the branches are, i.e. distance.

This is just a test, a more thorough analysis could and should be done.

Since the output of Don's tool changed, I needed to rewrite the scripts.

https://sites.google.com/site/gaviotach ... &width=800

Miguel
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Cluster analysis of similarity test

Post by Laskos »

michiguel wrote:This is the data Don sent me. It looks a bit noisier to me than what we had months ago (visual inspection, I am not sure) but the signals are picked up. I wonder whether running the test for more than 100 ms will make it even better. I ran a bootstrap-like analysis (more similar to a "jackknife") resampling it 1000 times, taking half of the sample (500) randomly and the most of the branches are consistent. I can give the numbers later.

Two things are needed to be considered:
1) how the branches groups to each other and how consistent they are (the are if the branches exists and the groups remain together no matter how I resample the data)
2) how long the branches are, i.e. distance.

This is just a test, a more thorough analysis could and should be done.

Since the output of Don's tool changed, I needed to rewrite the scripts.

https://sites.google.com/site/gaviotach ... &width=800

Miguel
Sorry, I didn't follow your earlier results, I have to get used to the chart.
Do the distances on a branch mean something? Do the angles mean something?

Thanks,
Kai
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Cluster analysis of similarity test

Post by michiguel »

Laskos wrote:
michiguel wrote:This is the data Don sent me. It looks a bit noisier to me than what we had months ago (visual inspection, I am not sure) but the signals are picked up. I wonder whether running the test for more than 100 ms will make it even better. I ran a bootstrap-like analysis (more similar to a "jackknife") resampling it 1000 times, taking half of the sample (500) randomly and the most of the branches are consistent. I can give the numbers later.

Two things are needed to be considered:
1) how the branches groups to each other and how consistent they are (the are if the branches exists and the groups remain together no matter how I resample the data)
2) how long the branches are, i.e. distance.

This is just a test, a more thorough analysis could and should be done.

Since the output of Don's tool changed, I needed to rewrite the scripts.

https://sites.google.com/site/gaviotach ... &width=800

Miguel
Sorry, I didn't follow your earlier results, I have to get used to the chart.
Do the distances on a branch mean something? Do the angles mean something?

Thanks,
Kai
The length of each branch represents the "disimilarity" between points.
The angles mean absolutely nothing. In fact, it is licit to put the plot in a draw program and move the branches around, as long as you do not change the length and the nodes. For instance

Code: Select all

A
 \
  \        C
   \      /
    o----o
   /      \  
  /        \
 /          D
B


A
 \          D
  \        /
   \      /
    o----o
   /      \  
  /        C
 /          
B

A
 \          
  \        
   \      
    o----o--D
   /      \  
  /        C
 /          
B
All three drawings are identical, or we can say are three different "artistic" representations of the same tree. Assuming that each '/' or '-' represent the same distance on paper, this tree is formed with the matrix

Code: Select all

C  0  3  8  8
D     0  9  9
A        0  6
B           0
The lower the numbers, the more similar are the individuals.

The other VERY important point to take into account is: How reliable are each of the branches? For instance, all the trees above are different than this

Code: Select all

            D
  A        /
   \      /
    \    /
     o--o
    /    \  
   /      \
  /        C
 /
B
Because the distances are different, but the topology is identical. The fact that the topology is the same, means that I can still conclude that (A,B) is one family and (C,D) is another one.

So... the important point is how reliable is the connection or branch represented by o--o? What is the other possibility? if the branch is not reliable, it means that this tree is not (statistically) significantly different from the following tree

Code: Select all

           D
  A       /
   \     /
    \   /
     \ /
      o
      |
      o
     / \  
    /   \
   /     C
  /
 B
where the branch o--o "collapsed" and now the topology changed, meaning that (A,D) is a family and (C,D) is another.

To investigate the reliability of those branches what we can do is to resample the data and recalculate the tree and ask the question: How many times do I see that branch? half of the time? that is not reliable at all and I cannot conclude anything between the relationship between A,B,C,D. But, if I see the same branch no matter how many times I resample the data, the branch becomes very reliable.

These are the numbers I got (same tree I showed before). See the numbers 1000. That means, I got the same branch 1000 out of 1000 times. I included an arrow pointing to a number that means that all komodos belong to the same family, no matter how I resample the data.

Code: Select all

                                         very reliable!!
                                        \_______________/
                                               |
                                               |                 +------21-Kom1.2J
                                               |          +811.0-|
                                               |   +-1000-|      +------09-Kom4046
                                               V   |      |
                                            +-1000-|      +-------------11-Kom4046
                                            |      |
                                            |      +--------------------00-Kom1.0
                                            |
                                     +406.0-|                    +------17-Fire12
                                     |      |             +906.0-|
                                     |      |      +-1000-|      +------18-Robbo84
                                     |      |      |      |
                                     |      +907.0-|      +-------------01-Hou1.5
                                     |             |
                              +907.0-|             +--------------------19-Rybka3
                              |      |
                              |      |                           +------08-Strel18
                              |      |                    +813.0-|
                              |      |             +-1000-|      +------20-Rybka1
                              |      |             |      |
                              |      +-------876.0-|      +-------------14-Strel2
                       +-1000-|                    |
                       |      |                    +--------------------16-Rybka23
                       |      |
                       |      |                                  +------03-Fruit21
                       |      |             +---------------1000-|
                       |      |             |                    +------04-Frui231
                       |      |             |
                +938.0-|      +--------1000-|                    +------07-Shredde
                |      |                    |             +907.0-|
                |      |                    |      +562.0-|      +------10-gaviota
                |      |                    |      |      |
                |      |                    +438.0-|      +-------------12-Spike1.
         +-1000-|      |                           |
         |      |      |                           +--------------------02-Crit042
         |      |      |
  +------|      |      +------------------------------------------------06-SF1.6
  |      |      |
  |      |      +-------------------------------------------------------05-SF171
  |      |
  |      +--------------------------------------------------------------13-SF1.8JA
  |
  +---------------------------------------------------------------------15-SF191JA
EDIT: In this last plot, distances means nothing, only the topology. This is complementary to the plot I showed in the original post.

Miguel