Blog post

Friday, June 22, 2018

Quiz: AncestryDNA Shared Matches

AncestryDNA shared matches have some quirks that can be confusing.

Do you understand which shared matches relationships are in, and just as importantly, which are out?

Test your knowledge with this quiz!


Wednesday, March 14, 2018

Congress 2018 wrap-up

Four days - a busy blur of conference sessions and group gatherings for meals or photos. Now Congress 2018 has ended, and hundreds of delegates have returned home. I expect that like me they were sad to see it end, but ready for a break and a chance to put all they’d learnt into action. Conference tag with ribbons attached, string of beads.

There was a good selection of both local and international speakers, but the speakers are only part of the experience. Jill Ball of GeniAus did an exceptional job of extending the community spirit and camaraderie that exists among genealogy bloggers to the non-blogging conference goers. Or at least that’s how it appeared to me, and I hope that’s how they felt about it!

I caught up with friends I had met online or at the Canberra conference in 2015, with my cousin who was also attending, and also made/met some new friends. I don’t want to name names or I will be sure to leave someone out.

I delivered my presentation on Visualising DNA Matches with Network Graphs on Sunday evening. The conference started on Friday so there were three days for my nerves to build, but also three days to settle in and feel like part of the genealogy community. Several people told me afterwards that they were keen to try graphing their DNA matches, or spoke to me about the insights they had already gained through doing so.

I’ve run through my notes and made a list of things to try, or thoughts to hang on to. Some of my top items:

  • Need to investigate the journals section of Trove.
  • Possible purchase: Farewell my Children by Richard E Reid (after hearing Pauleen Cass talk)
  • Why don’t I have a copy of Phillimore’s Atlas?! Must fix that (several talks prompted this thought).  
  • Need to take a proper look at DustyDocs.
  • Judy Russell (The Legal Genealogist) provided links to sites with public domain photos – bookmark them.
  • Freemason records! Now that I’ve learnt more about these I definitely want to follow up on the Freemasons in my family. 
  • Lewis’ gazetteer – get hold of that too.
  • Lisa Louise Cooke spoke about using Google Earth Pro. I realised I already have it on my computer and promptly lost several hours playing with it. She said that would happen…
  • A couple of blog tweaks I should probably make after hearing Jill Ball talk about Beaut Blogs.

One of the highlights was meeting international speaker, Judy Russell (The Legal Genealogist).

Shelley Crawford and Judy Russell

This is one of the few photos I have of people – I really should have taken more. Between lunches, dinners, group photos and other get togethers it felt like I had taken a million, but apparently not.

It was very disappointing to hear that none of the Societies have put their hand up to host the next Congress. I hope that we will hear good news on that front soon. I will be more than ready to go to another conference in three years from now.

Saturday, March 3, 2018

Triangulation is the icing, not the cake

I’m seeing more and more DNA network graphing activity going on. I’m so pleased to see that there are tools being developed to make this type of approach widely available.

One concern I have with these new developments is the exclusive use of “triangulated” segments to link between two DNA matches. By triangulated segments I mean segments of DNA that you and two of your DNA matches all have in common.

Don't get me wrong - triangulation is a very good thing. If you have a triangulated DNA segment, there’s a very good chance that all three of you inherited it from the same ancestor (whoever that may be). Sticking to triangulated segments only is appealing and seems an intuitively sensible choice – they provide a degree of confidence because you know that the relationships you see are relevant to your ancestry.

My contention is that the addition of DNA relationships that don’t have triangulated segments is essential to find groups of mid range – say 2nd to 4th - cousins descended from a common ancestor among a set of matches.

The triangulated view

Below is a layout of triangulated segments extracted from Gedmatch using the Tier 1 triangulation report (chart produced with Gephi). Many of the groups here – particularly the large groups – are very distant relatives.

Notice the four pink dots? They are known cousins who all share a common ancestor. They match me and each other in the 1st cousin once removed to fourth cousin range. Only one of the six possible pairings of the four shows a triangulated segment! And that line is between the two more distant (to me) matches. If I didn’t know that all four of them had a common ancestor there would not be much in the chart that compelled me to pursue how those four people match.

Chart showing distinct separated clusters of dots and lines

The untriangulated view

Below is a different view of the data, taking a different approach.

Here I added in shared match information from Gedmatch’s “People who match one or both of two kits” report for all my matches over 20 centiMorgans (cM) . This includes pairs of matches without any triangulated (with me) segments. In the chart I have limited the matches shown those who share 20cM with me AND with each other. This is similar to but slightly more inclusive than Ancestry’s thresholds (there’s another post in what Ancestry does that I may write one day).

  • The blue lines indicate the match pair has at least one shared segment in common with me.
  • Grey lines indicate that the people at each end of the line match each other, but there is no overlap of segments between the pair of matches and me.

I needed to limit the connections between people on the amount of DNA they shared with each other in order to stop the number of links in the chart from becoming ridiculous – and I have no known endogamy.

I should also mention that in both these charts, thicker lines indicate larger shared cM amounts between the pairs of matches. The thickest lines are parent/child or sibling relationships. The size of the dot reflects the relationship with me. Larger dots are closer relatives.

Chart with sparse but interconnected dots and lines, with a few distinct clusters

Quite a different picture. While I’ve lost a lot of distant matches, there is now the suggestion of a grouping with my known cousins. The chart is more interlinked – some of these links may be coincidental relationships nothing to do with my tree.  I would look upon single links between clusters with suspicion but not dismiss them entirely.

There are some some clusters entirely made up of “untriangulated” match pairs including relatives closer than 20cM to me who do NOT show up in the triangulated only version above. These are clusters that are close enough that I might be able to determine the common ancestor with a little digging. 

Is what I am seeing with my four cousins a one-in-a-million random chance occurrence?Chromosome browser view with blue and orange segment markers that don't overlap

I don’t think so. I suspect that there’s a higher chance of relatives in a researchable timeframe not sharing a triangulated segment than one may imagine.

Here’s another example – a Family Tree DNA chromosome browser view of two people who share one great-great grandparent with me. They are more closely related to each other. There is ample paper and other DNA evidence to say that the relationship is correct.

No stacked blue and orange lines = no triangulated segments.

Once again, if I didn’t already know about it, a connection between these two people is exactly what I would want to find in my data.

I would be interested to know if readers can find further examples of close matches that don’t triangulate in their data.

So if triangulated matches between closer relatives are so hard to come by, why those big clusters of distant triangulated relationships?

As each generation passes, you are less likely to inherit DNA from a particular ancestor. For a very distant ancestor you may have only one segment, if any. Each ancestor, however, has on average an increasing number of descendants with each generation. The chances of another descendant having the same inherited segment as you are slim… but there are a lot of other descendants. A small fraction of them do inherit that same segment. If they DNA test, they all match in common with each other on that one segment and become a cluster in the chart. You can see it when you look at the chromosome data for the matches in a big cluster – they all match in a big stack at one location.

Keeping only triangulated segments is cleaner and increases the chance that the relationship you see is due to a shared ancestor – but that doesn’t necessarily make them more helpful for research. There is a risk of losing close match information that could be researched, for the sake of distant match information beyond paper trail timeframes.

Finding the balance

A compromise position that trimmed off untriangulated relationships for distant relatives, but kept them where there was a close relationship, might be the answer.

The version of the graph below uses the same thresholds as the untriangulated chart (20cM shared with me, 20cM shared between match pairs), but then adds in all triangulated segments between pairs of people who each share 20 cM or more with me. This adds in a few more matches, and the addition of the less close triangulated lines support some of the untriangulated clusters. I now have a good picture of that group of four known matches in pink. There is a winding path of untriangulated matches connecting several of the triangulated (and untriangulated) groups. While they complicate the picture they do alert me to the possibility that my tree may have intermarriage that I’m not aware of. It’s messy, but not necessarily a bad thing.

Network chart showing interconnected lines, with a moderate number of distinct clusters

DNA products and datasets

I would like to see DNA matching datasets (or products made from them) with as many as possible of the following attributes:

  • Inclusion of close in-common-with relationships that don’t have triangulated segments.
  • Data on the strength of the total connection between pairs of matches (ie or edge filters using this information).
  • Ability to distinguish between match pairs with and without triangulated segments.
  • Ability to set different thresholds for triangulated and non-triangulated edges.
  • Inclusion of total match size for each match.

Triangulated segments are the icing, not the cake.

I hope that as more products and data extraction capabilities are developed some of these ideas will be incorporated. You can help by giving developers a push along these lines when you provide feedback about their products.



Friday, February 23, 2018

Getting ready for Congress 2018

The biggest event on Australia’s genealogy calendar is the triennial Australasian Congress on Genealogy and Heraldry and it’s only two weeks away (Friday 9 to Monday 12 March).

Travelling to another city to attend a genealogy conference takes time and money, and if you don’t know anyone it’s intimidating. Perhaps that’s why I had never felt moved to attend until three years ago when it was was held in my home town. I enjoyed the conference immensely and got a lot from it. After that experience, I had no doubts about going to the next one.

There are going to be two big differences (that I know about) between my experience this time and last time. First, I’ll need to travel. Second, this time around I’ll be speaking at the conference which adds a few substantial to-do items and I’m sure will give me a new perspective on the event.

I’ve been reading Jill Ball’s (aka GeniAus) posts about preparing for Congress (and other conferences) with interest, and adding relevant items to my own checklist.

Let’s see how I’m doing with preparations:

  • Conference Registration: Done, as soon as registrations opened. I also paid for a seat at the conference dinner.
  • Work: Leave request submitted and approved.
  • Family: Leave request submitted and approved.
  • Accommodation: Booked and paid for. I’ve arranged to share rental of a small house near the venue with two other genealogists. It’s going to be fun!
  • Travel to Sydney: Booked. Although I usually prefer to take the train, this time I chose the bus. It’s quicker, a little cheaper, but most importantly the timetable is more flexible. I can return home at a civilised hour and get to work the next day in a fit state to do some work.
  • Travel within Sydney: I’m close enough to the venue that I will be able to walk. I’m sure I’ll appreciate a bit of exercise at the start and end of each day. I already have an Opal card from previous visits to Sydney for when I need to use public transport.
  • Devices: I’m planning on taking my phone and my laptop. I need to make sure any information I might want is synced to the laptop. Still to do.
  • Note taking: While I like technology for storage, I prefer to take notes on paper. I have a Whitelines note book with a hard cover that I plan to use. The pages are light grey with a white grid, and it comes with an app that will hide the grey background, resize and sync to wherever you want online. It will be easy to keep a soft copy of any of my scribbles that I think are worth keeping.
  • Contact cards: I’ve had a small batch of business cards printed up with details of this blog, various contact details for me, and family surnames I’m researching.
  • Blogger beads: If you’re not a blogger, you might not be aware of the trend at US genealogy conferences for bloggers to wear identifying beads. Jill Ball has imported this to Australia and it’s a fun way to break the ice at events. I’ve put my hand up for some. Thanks Jill!
  • Clothing: It’s too soon to pack my bags, but I’ve invested in some new comfortable shoes that I can test out and break in before the day. I’m not too worried about attire for the conference days, but I still need to work out what I will wear to the conference dinner.
  • Speech: I’ve submitted my handouts and slides to the organisers. All I have to do is continue to practice – and keep an eye on developments relating to my topic.

I think I’m as ready as I need to be at this stage.

Let the countdown commence!

Wednesday, January 3, 2018

Visualising Ancestry DNA matches-Part 10-Colour Coding

This is the tenth part of a series of posts about visualising Ancestry DNA matches with network graphs. You can find the index to the posts here. In this post, I’ll show you how to colour code your matches.

The material in this post is what I have been most looking forward to showing you. There is so much you can do with colour coding! I’ll provide a few ideas and examples, but would love to see what else you come up with. Tell me about it in the comments, or join the freshly minted Network Graphs for Genetic Genealogy Facebook group here

What information can I colour code on?

You can colour code on whatever you want! If you can get it into a column you can colour code on it. For a start, here are some ideas with no data manipulation required (although you may need to load extra columns from your matches file):

  • Starred matches. Where do those people you were interested in fit?
  • Viewed matches. Immediately spot critical new matches.
  • Shared ancestor hints. Have you checked them all out?
  • Numerical information – eg SharedCM, Shared Segments – can be used to create a heat map to help spot clusters of closer or more distant matches.
  • Manually add a column with the branch that a known matches belong to, and colour code on that. This can help to identify clusters from a particular part of your tree. I recommend only colouring matches that you know for sure belong to a particular branch.

If you’re able to use Excel or a database tool to manipulate the data yourself, even more options are available. For instance I have found it very useful to download the ‘ancestors’ file (using the DNAGedcom client) which contains a lists of ancestors for your matches who have their DNA connected to a public tree:

  • Matches with a particular surname or surnames in their tree.
  • Matches with a particular place or places in their tree.

These examples don’t work so well with names like “Smith” – but are fantastic for finding clusters with less common names or from a particular region.

Get the settings right

Colour by vertex

The default setting, once groups have been created, is to colour by group.

In order to apply colours by person, we’ll need to tell NodeXL to 'colour by vertex’ instead.

  • NodeXL Basic ribbon  >  Groups  Group Options…
    image
  • Select “The colors specified in the Color column on the Vertices worksheet”
    image

At this point all the dots will change to the default Vertex colour (black). If you want to return to group by group colours you can change back at any time by selecting “The colors specified in the Vertex Color column on the Groups worksheet”.

Prevent the nodes from moving 

Each time when you change the colours you will need to refresh the graph to apply the change. The chart layout will be applied again, and the nodes will move.

If you like the nodes where they and don’t want them moving about you can keep them in place:

  • Set the layout algorithm to “None”

OR

  • Highlight the nodes of interest and click the Lock button to lock them in place.
    image
    (highlight them and click the Key button to allow them to move again when you refresh the layout, if desired).

Applying colour a few nodes at a time

Manual methods are useful if you only want to apply colour to a few nodes and don’t want or need to switch between different colour schemes.

The easiest method is to select a node or nodes from the chart using the Select tool.
image

  • Select the nodes of interest.
  • Choose a colour using the colour picker on the NodeXL Basic ribbon.
    image
  • Click the Refresh Graph button to apply the changes.

OR

Enter a colour directly into the ‘Color’ column on the Vertices worksheet. If the column is not already visible you can show it both the Edges and Vertices worksheets via the NodeXL Basic Ribbon > Workbook Columns button.

In the Color column:

  • Right click and selecting a colour using the “Select Color” menu option, or
  • Type in an RGB colour reference in the format R, G, B. For example, 0, 255, 255, or
  • Type in a CSS colour name. For example, DarkSeaGreen.

Click the Refresh Graph button to apply the changes.

Apply colour in bulk – the real fun begins!

Applying colour (or other formatting choices) in bulk is very easy. If it’s in a column, you can colour code with it. It doesn’t matter how that information was entered in the column – loaded in, typed, derived by a formula – or what type of data it is. Pick one of the ideas I listed at the start of the post, and try it out.

  • Apply colour via the Autofill Columns button on the NodeXL Basic ribbon.
    image
  • If you have previously applied colour (whether manually or by using this control) choose the option to “Clear Vertex Color Column Now” to start fresh.
    image
  • Select the column to code on from the Vertex Colour dropdown box.
  • Check the settings under “Vertex Color Options….”.
    If you are colour coding on text values choose “Categories” from the dropdown box at the top left and click OK.
  • If you want to colour code using a numerical scale, choose “Numbers” and more options will appear.
    image

View the legend

Once of the useful features of automatic colour coding is that NodeXL will generate a legend for you.

  • Show the legend at the bottom of the chart, via the NodeXL Basic ribbon > Graph Elements button.
    image

Change default node colour

Unfortunately NodeXL doesn’t allow you to choose the colours applied to each category. The first colour used is always a dark blue, which on my monitor is hard to distinguish from the default colour of black. It’s possible to change the default colour using the graph options.

  • Click the Graph Options button
    image
  • Select a new colour by double clicking the colour swatch on the Vertices tab.
    image

I encourage you to explore the other changes to default settings that are possible.


Example – Categories

Applying colour codes to categories really is as simple as selecting the column in a drop down box. This is a quick example of the type of investigation possible. Don’t forget – before you add new colours always use the option to clear the colour column or you might mix up your schemes.

Colour code matches with known branches

I manually added a column to the Vertices sheet labelled “Branch” and entered a surname indicating the branch for each person where the common ancestor is known. Then I clicked Autofill Columns and set my new Branch column as the vertex colour. My DNA results have a lot of very small groups. I can now easily see which branch six of them are connected to. It’s a start!

image
Kit with 125 4th or closer cousins (more distant cousins included in chart), cluster by connected component, Harel-Koren Fast Multiscale Layout with each group in it’s own box. Selected groups.

Colour code by side

I loaded both my own and my father’s matches into one file and then used a formula to mark each match as “Paternal” or “Maternal” in a new column depending on whether they shared DNA with my father. When I colour coded on the new “Side” column I could see that there was a clear division between groups, with a few strays. (Selected larger groups are shown for the sake of illustration).

This works with my tree as my branches are not inter-related and are generally from distinct populations. With a more interrelated tree it may highlight groups where it would be dangerous to make an assumption about side.

DNA matches colour coded by side (maternal, paternal)

image
Kit with 125 4th or closer cousins (more distant cousins included in chart), cluster by connected component, Harel-Koren Fast Multiscale Layout with each group in it’s own box. Selected groups.

Colour code a place

Now I want to see if I can dig in further.

Once quarter of my father’s tree is from Cornwall. Many people have Cornish ancestry and following up on every possible Cornish lead could take me on any number of wild goose chases. Instead, using the ancestors file downloaded using the DNA Gedcom Client, I created a list of matches whose ancestors were born or died in Cornwall.

While there was an occasional individuals highlighted here and there among my groups, one group stood out. This was a group where I had not confirmed any of the relationships – the only clue I had to go on is that they are matches to my father.

I would not expect every dot in a group to be coloured as not all matches have public trees on Ancestry. If you have made a list with places or names using the ancestors file, try also searching your matches on Ancestry itself. Chances are there will be some private trees among the results. You can add their matchIDs to the import list and make use of that information.  Yes, you read right. This is a way to squeeze some information from private trees!

Note also that only one of my closer matches is marked blue indicating Cornish ancestry in a public tree. It was the trees of distant matches, which I may never have looked at otherwise, that made the difference. 

DNA matches who have any ancestor born in Cornwall highlighted

image
Kit with 125 4th or closer cousins (more distant cousins included in chart), cluster by connected component, Harel-Koren Fast Multiscale Layout with each group in it’s own box. Selected groups.

Example – Numeric information

In earlier posts we used the SharedCM column to size the dots, so that closer relatives would have bigger dots. The human brain, however, is more able to pick out colour differences than size differences, so if you are focusing on groups around your closer matches, a heatmap type display might be useful.

We can use colour to make those close cousins stand out more – the eye tends to be drawn to warm colours. In this example, closer relatives are more orange and more distant matches will be a deep purple/blue.

  • Click the Autofill Column buttons.
  • Set the Vertex colour to sharedCM. 
  • Click the options button and choose Vertex Color Options…image
  • Select Numbers in the dropdown.
  • Click Swap Colors so that closer matches will be more orange.
  • As I wanted all distant cousins to be blue I set the smallest number to 20cM.
  • I wanted all estimated 2nd cousins to be strongly orange, so I set the other extreme to 200cM.

image

I used a kit with more interconnections that my own. The result is below. In this kit there are two groupings of closer cousins. The cousins in the centre of the graph have more connections, while relatives of the group on the left seem to be less well represented in the DNA testing population.

DNA match heatmap – closer cousins are more orange

image
Kit with 470 4th or closer cousins, cousins with <15cM shared excluded, Harel-Koren Fast Multiscale Layout to set start positions, followed by two applications of the Fruchterman-Rheingold layout with repulsive force 1.0 and 3 iterations to increase the visual definition of the groups. Smaller unconnected components displayed separately at the bottom of the screen.


Where to from here?

This is the last post I have planned in this series focusing on Ancestry and NodeXL, but I doubt it will be my last post on the subject of network graphs. I’ve created a group on Facebook for discussion of Network Graphs for Genetic Genealogy. If you would like to have a conversation about what you’re doing with network graphs as they apply to genetic genealogy (regardless of the source of DNA matches or software used!) please comment below or better yet join the Facebook group.