Cluster analysis of similarity test

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
User avatar
michiguel
Posts: 6300
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Cluster analysis of similarity test

Post by michiguel » Fri Dec 31, 2010 7:43 am

This is the data Don sent me. It looks a bit noisier to me than what we had months ago (visual inspection, I am not sure) but the signals are picked up. I wonder whether running the test for more than 100 ms will make it even better. I ran a bootstrap-like analysis (more similar to a "jackknife") resampling it 1000 times, taking half of the sample (500) randomly and the most of the branches are consistent. I can give the numbers later.

Two things are needed to be considered:
1) how the branches groups to each other and how consistent they are (the are if the branches exists and the groups remain together no matter how I resample the data)
2) how long the branches are, i.e. distance.

This is just a test, a more thorough analysis could and should be done.

Since the output of Don's tool changed, I needed to rewrite the scripts.

https://sites.google.com/site/gaviotach ... &width=800

Miguel

User avatar
Laskos
Posts: 8291
Joined: Wed Jul 26, 2006 8:21 pm

Re: Cluster analysis of similarity test

Post by Laskos » Fri Dec 31, 2010 5:16 pm

michiguel wrote:This is the data Don sent me. It looks a bit noisier to me than what we had months ago (visual inspection, I am not sure) but the signals are picked up. I wonder whether running the test for more than 100 ms will make it even better. I ran a bootstrap-like analysis (more similar to a "jackknife") resampling it 1000 times, taking half of the sample (500) randomly and the most of the branches are consistent. I can give the numbers later.

Two things are needed to be considered:
1) how the branches groups to each other and how consistent they are (the are if the branches exists and the groups remain together no matter how I resample the data)
2) how long the branches are, i.e. distance.

This is just a test, a more thorough analysis could and should be done.

Since the output of Don's tool changed, I needed to rewrite the scripts.

https://sites.google.com/site/gaviotach ... &width=800

Miguel
Sorry, I didn't follow your earlier results, I have to get used to the chart.
Do the distances on a branch mean something? Do the angles mean something?

Thanks,
Kai

User avatar
michiguel
Posts: 6300
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: Cluster analysis of similarity test

Post by michiguel » Fri Dec 31, 2010 6:41 pm

Laskos wrote:
michiguel wrote:This is the data Don sent me. It looks a bit noisier to me than what we had months ago (visual inspection, I am not sure) but the signals are picked up. I wonder whether running the test for more than 100 ms will make it even better. I ran a bootstrap-like analysis (more similar to a "jackknife") resampling it 1000 times, taking half of the sample (500) randomly and the most of the branches are consistent. I can give the numbers later.

Two things are needed to be considered:
1) how the branches groups to each other and how consistent they are (the are if the branches exists and the groups remain together no matter how I resample the data)
2) how long the branches are, i.e. distance.

This is just a test, a more thorough analysis could and should be done.

Since the output of Don's tool changed, I needed to rewrite the scripts.

https://sites.google.com/site/gaviotach ... &width=800

Miguel
Sorry, I didn't follow your earlier results, I have to get used to the chart.
Do the distances on a branch mean something? Do the angles mean something?

Thanks,
Kai
The length of each branch represents the "disimilarity" between points.
The angles mean absolutely nothing. In fact, it is licit to put the plot in a draw program and move the branches around, as long as you do not change the length and the nodes. For instance

Code: Select all

A
 \
  \        C
   \      /
    o----o
   /      \  
  /        \
 /          D
B


A
 \          D
  \        /
   \      /
    o----o
   /      \  
  /        C
 /          
B

A
 \          
  \        
   \      
    o----o--D
   /      \  
  /        C
 /          
B
All three drawings are identical, or we can say are three different "artistic" representations of the same tree. Assuming that each '/' or '-' represent the same distance on paper, this tree is formed with the matrix

Code: Select all

C  0  3  8  8
D     0  9  9
A        0  6
B           0
The lower the numbers, the more similar are the individuals.

The other VERY important point to take into account is: How reliable are each of the branches? For instance, all the trees above are different than this

Code: Select all

            D
  A        /
   \      /
    \    /
     o--o
    /    \  
   /      \
  /        C
 /
B
Because the distances are different, but the topology is identical. The fact that the topology is the same, means that I can still conclude that (A,B) is one family and (C,D) is another one.

So... the important point is how reliable is the connection or branch represented by o--o? What is the other possibility? if the branch is not reliable, it means that this tree is not (statistically) significantly different from the following tree

Code: Select all

           D
  A       /
   \     /
    \   /
     \ /
      o
      |
      o
     / \  
    /   \
   /     C
  /
 B
where the branch o--o "collapsed" and now the topology changed, meaning that (A,D) is a family and (C,D) is another.

To investigate the reliability of those branches what we can do is to resample the data and recalculate the tree and ask the question: How many times do I see that branch? half of the time? that is not reliable at all and I cannot conclude anything between the relationship between A,B,C,D. But, if I see the same branch no matter how many times I resample the data, the branch becomes very reliable.

These are the numbers I got (same tree I showed before). See the numbers 1000. That means, I got the same branch 1000 out of 1000 times. I included an arrow pointing to a number that means that all komodos belong to the same family, no matter how I resample the data.

Code: Select all

                                         very reliable!!
                                        \_______________/
                                               |
                                               |                 +------21-Kom1.2J
                                               |          +811.0-|
                                               |   +-1000-|      +------09-Kom4046
                                               V   |      |
                                            +-1000-|      +-------------11-Kom4046
                                            |      |
                                            |      +--------------------00-Kom1.0
                                            |
                                     +406.0-|                    +------17-Fire12
                                     |      |             +906.0-|
                                     |      |      +-1000-|      +------18-Robbo84
                                     |      |      |      |
                                     |      +907.0-|      +-------------01-Hou1.5
                                     |             |
                              +907.0-|             +--------------------19-Rybka3
                              |      |
                              |      |                           +------08-Strel18
                              |      |                    +813.0-|
                              |      |             +-1000-|      +------20-Rybka1
                              |      |             |      |
                              |      +-------876.0-|      +-------------14-Strel2
                       +-1000-|                    |
                       |      |                    +--------------------16-Rybka23
                       |      |
                       |      |                                  +------03-Fruit21
                       |      |             +---------------1000-|
                       |      |             |                    +------04-Frui231
                       |      |             |
                +938.0-|      +--------1000-|                    +------07-Shredde
                |      |                    |             +907.0-|
                |      |                    |      +562.0-|      +------10-gaviota
                |      |                    |      |      |
                |      |                    +438.0-|      +-------------12-Spike1.
         +-1000-|      |                           |
         |      |      |                           +--------------------02-Crit042
         |      |      |
  +------|      |      +------------------------------------------------06-SF1.6
  |      |      |
  |      |      +-------------------------------------------------------05-SF171
  |      |
  |      +--------------------------------------------------------------13-SF1.8JA
  |
  +---------------------------------------------------------------------15-SF191JA
EDIT: In this last plot, distances means nothing, only the topology. This is complementary to the plot I showed in the original post.

Miguel

Post Reply