Blog post

Saturday, March 30, 2019

Examining my MyHeritage AutoClusters

Compared to other testing companies, MyHeritage has a lot of information about DNA matches displayed on the website. Unfortunately, the information I'm most interested in - the shared matches - is not available as a data download. 

Given the lack of access to the data I was curious to see what the new MyHeritage AutoCluster tool (based on the technology of Evert-Jan Blom from Genetic Affairs) could tell me. I went to the MyHeritage website, set the tool going, and after a time received the results in my email.

The AutoCluster tool applies a clustering algorithm to your DNA shared matches and provides the output as a list and a matrix chart visualisation. On opening the visualisation there is an 'oooh!' moment as all the blocks slide into place. When the process finished, my AutoCluster matrix looked like this:


The goal of this, or any other clustering tool on offer, is to identify groups of people that likely descended from a common ancestor. Those potential groups can be identified by their colour and placement on the chart.


A short guide to reading a matrix chart

  • Names of my matches are listed down the left-hand side and repeated along the top of the matrix (I've blurred them for privacy).
  • If there is a filled block at the intersection of two names (one at the side and one at the top) then those two people are a shared match.
  • Coloured blocks indicate clusters (as defined by the algorithm used).
  • Some people have connections to more than one cluster. Look to the grey blocks to see where those linkages are.
  • If there are a lot of grey blocks between two clusters, then those clusters are probably relevant to each other. For example, the first (red) and fourth (green) groups have several connections between several people.
Before I go on I should say that I have only reviewed my own AutoCluster results. Other user experiences may differ. MyHeritage has made efforts to accommodate all the vastly varying DNA networks of its users when it creates these matrix charts, without requiring users to adjust any settings. That's got to be difficult!

First impressions

The first thing I noticed was that the matrix was very fragmented. That could be representative of my data, but having browsed my DNA matches all those very small groups didn't feel quite right.

I liked the amount of information given for the thresholds used:
"Your AutoCluster analysis was generated using thresholds of 25 cM (minimum) and 350 cM (maximum). In addition, DNA Matches were required to share at least 15 cM with one another in order to be indicated with a colored or gray cell. A total number of 104 DNA Matches ended up in 26 clusters in the final analysis."

Matrix visualisations are limited in how many matches they can include in one view and still be readable. Filtering is necessary to limit the matches. The automatically selected thresholds seem reasonable.

I appreciated the list of 11 matches who had no shared matches at the thresholds used.

I was perturbed by the exclusion of 95 matches who both met the threshold and had shared matches:

"The following 95 matches met the inclusion criteria but ended up in singleton clusters without other members and are therefore excluded from the analysis as well."
95 matches in "singleton clusters"?! Why are there almost as many matches excluded for "singleton clusters" as there are matches actually included in the matrix? Just how aggressively does the algorithm chop up the groups?

As I read the long list of matches that had been excluded I was taken aback to see that my second closest match, at 129 cM shared, was among the "singleton clusters".

Digging deeper: A new view

If you've been following my blog, you'll know that network graphs are my favoured tool for understanding shared match relationships. Using the csv file provided with the output, I was able to wrangle the data into shape and create a network graph version of the AutoCluster matrix information.

I've aligned the group labels and colours with the matrix display (but made the second and third use of each colour darker for clarity). The numbers indicate the group in the AutoCluster result reading down the diagonal. The dot sizes reflect the amount of DNA I share with each person. Each line is a shared match relationship (the lines here are equivalent to the blocks in the matrix chart).

This was the result.



Looking at this graph, I retain my first impression that the algorithm is heavy-handed in breaking up the groups. For example, groups 1, 4 and two elements of thirteen look like they should be together, as do 24-25, and 3-19-22.

I'm relaxed about which group the closer match in group 11 is allocated to. That person would naturally "belong" in more than one cluster as they would likely match with groups of people with more distant ancestors from each side of our shared branch.

With this view, I also see that there are a few 'strings' of small groups. They include matches for whom, without more information, inclusion in one group or the next would be equally valid. That can't be helped when working with shared match information alone but is a reason to take care when looking at small groups in a matrix layout and track back to any other connected groups.

There is huge potential for refinement of matching groups with the data MyHeritage has - and that I'd like to get hold of as downloads! Information about the total size of the match between pairs, and whether the matches have a triangulated segment could be very informative to group allocations.

How would this have looked if the other 95 matches were included? I suspect that the sensible breakup of some of the smaller groups would be clearer for a start.

Digging even deeper - segment data

Looking at the network version I created, my impression is that groups 1, 4, two people from 13 and my closest match in group 11 are connected densely enough that they should really be a single group. 




One of the website tools that I like on my MyHeritage is the chromosome browser tool. I entered the names of matches in my proposed larger group into the tool in batches. I both started and ended with my closest match. The result was clear. All of the matches I identified had triangulated segments with me and each other on chromosome three. (I couldn't find one match from group 4 in my match list to make that comparison). 

I also checked the other members of groups 13 and 11 (outside the circle above). None of them had a shared segment with me at that location.


As an aside, some of the pairs don't show as matched in the matrix or network graph (based on the matrix data) even though they clearly triangulate on a reasonably sized segment. This is because some of the pairs match at just below the total matching threshold that was used to filter the graph. This is a point to be aware of when interpreting any shared match information, or indeed any DNA information where some sort of threshold or cutoff has been used.


Conclusions

I have only reviewed my own results and they may not be typical of most users. There seems to be an overly aggressive breakup of groups. This has made the chart fragmented and harder to read and interpret than it otherwise would be. 

The excessive fragmentation of groups is also likely the reason that almost half of my relevant matches were assigned to "singleton clusters" and excluded. Some of my best and most useful matches have been excluded. I'm concerned that the baby has been thrown out with the bathwater here.

When using AutoClusters I would suggest that users should:

  • Read the notes. Take note of who's in and out.
  • Use the grey cells to check for connections between groups.
  • Don't assume that the matrix will include your best and closest matches. They could be excluded!
Remember also that the result reflects only a small proportion of your matches (less than 2% in my case). There is no doubt much more to be found in matching results. I've written to MyHeritage in the past and asked that they consider allowing downloads of shared match lists (including shared match cM amounts). This would allow for analysis and clustering of more matches and for different clustering techniques to be used for those who want to do their own analysis.

Overall though my feeling about the AutoCluster tool is that something is better than nothing. The AutoCluster tool is a helpful way to start identifying groups at the top end of your match list, but caution is needed.  

Friday, March 15, 2019

My Thrulines improved! I doubt it was due to me


It’s true! Five days after messaging corrected information to other people with my Ancestor in their tree, my AncestryDNA Thrulines have improved. I no longer see my carefully researched Ancestor replaced with a ‘Potential Ancestor’ from other trees, who never actually existed.



While the desired result has occurred, I can’t claim that my experiment was anything to do with it. Out of the seventeen messages I sent, just three people responded (with thanks) and said they would update their trees.

Thrulines is a beta feature that is constantly changing. For example I noticed when I logged in today that my ancestors were now grouped by generation (nice!). I’m wondering if maybe Ancestry has listened to user feedback and changed who they choose to display. Either way, I prefer what I am seeing now and a few interested people have better information for their trees, so it’s a win-win.

Sunday, March 10, 2019

Can I improve my Thrulines?


AncestryDNA’s new beta feature, Thrulines, takes the work out of cobbling together your DNA matches’ trees to try and work out where your connection is. Overall, I think it’s great! It has come up with connections that would have taken me hours to work out on my own.

Of course, it doesn’t always get it right.

I have one particular ‘Potential Ancestor’ suggestion that I know to be incorrect. What’s worse, it suggests replacing my good information about that ancestor with bad.

AncestryDNA Thrulines 'Potential Ancestor' card stamped 'Do not copy' and 'Denied'
Sorry Edward Flower Darcy, you never existed.

Some might get upset about a suggestion to replace careful research with something incorrect. I can’t say I’m one of them. I do my own research before adding anything to my tree and if a hint isn’t right, I ignore it. I had that incorrect name in my own tree for many years and know it came from a death certificate, reported by a child who would never have know their grandparent. Due to people marrying at unexpected times, and dying in unexpected places, the correct information wasn’t easy to find.

While I’m not upset, I would prefer to be given good hints. There are about 10 Ancestry trees with the old information for each Ancestry tree that has picked up my new research.

I wonder what the tipping point is for Ancestry to shift its suggestion?

As an experiment, I’ve sent a friendly message to 17 people who have the incorrect information in their tree and given them corrected information. It will be interesting to see how many respond to my message, and if the Thrulines suggestion changes.