Blog post

Wednesday, August 16, 2017

Visualising Ancestry DNA matches-Part 9-Combining kits

By now those of you playing along will have created a network analysis workbook using the NodeXL template, loaded your Ancestry DNA information, broken the tangle of matches into groups, experimented with the settings and found out how you could add additional relationships. Phew! See the index to previous posts if you’re just joining in.

Now the real fun begins!

A few readers have asked if it’s possible to combine kits together. The answer is Yes! Combining kits in one file is almost as easy as loading your own information, and can be very useful.

This post assumes that you manage more than one kit, or that the owner of another kit has provided you with their files. It also assumes that your kits aren’t so large that loading more information will make the file unworkable. Save before you try it. I manage two kits at present but you can add information for as many kits as you think your computer will handle.

I’ve loaded my kit and my father’s kit into one worksheet. A simple edit to the matches file before loading created a new column for my father’s sharedCM values.

image

I did some quick calculations to find out how many matches we have in common. I match 50% of my father’s 4th or closer cousins. Including all the distant cousins we have a combined total of 18,889 matches – only 15% of the grand total is shared. Exercise caution if adding distant cousins!

In-common-with file

The in-common-with file will add lines representing DNA connections to your graph.

Loading an in-common-with file will also add people who are related to the additional kit’s subject. If your goal is to research the family tree of the focus person (‘you’), the best kits to load are those belonging to relatives who have some of the same ancestors as you, but no ancestors that you don’t have.

Many of the people you ‘skipped’ are prime candidates:

  • Full siblings
  • Parents
  • Aunts and uncles
  • Grandparents

This doesn’t mean that you should never load the in-common-with file for someone who has ancestors you don’t. Combining a kit with a half sibling may help you work out which matches are ‘yours, mine, or ours’. 

If you don’t load the in-common-with file you can still load the matches file to place the sharedCM values side by side as I have.

Load the ICW file

Loading the in-common with file for additional kits is easy. Simply load it in exactly as you have done before.

  • NodeXL basic ribbon, Import button, From Open Workbook…
  • Select the file in the top box
  • Under Is Edge Column tick ‘matchid’ and ‘icwid’
  • Which edge column is Vertex 1: matchid
  • Which edge column is Vertex 2: icwid


Matches file

When loading matches for an additional kit the data loaded for shared matches will overwrite existing data.

The name and admin columns have the same information regardless of which kit they match so nothing is lost by reimporting these for another person. In fact, it’s better if you do import them, otherwise you won’t know who the new matches are. 

Columns such as range, sharedCM, note and matchURL differ from kit to kit. If you want to import any of these columns (I’d import sharedCM at minimum) you’ll need to make a few minor edits to the import file first.

Prepare the matches file

  • Open the match file m_AdditionalKitName.csv
  • Save a copy with a different name. m_AdditionalKitName_edited.csv will do.
  • The matchid, name and admin columns should be left alone.
  • For any other column you want to import, change the column header to indicate whose information it is.
    For example, ‘sharedCM’ might become ‘sharedCM John’. Keep it simple because next time you update the file you’ll need to enter it in exactly the same way.
  • Choose the first value in the testid column and change it to ‘zzz delete’. Then double click on the little square in the corner of the cell to copy it all the way down the sheet. This step isn’t strictly necessary but it only takes a few seconds and will make it easier to remove extra lines not needed for the graph.
    image
  • Save the file, but don’t close it yet.


Load the matches file

  • NodeXL basic ribbon, Import button, From Open Workbook…
  • Select the file in the top box
  • Under Is Edge Column tick ‘testid’ and ‘matchid’
  • Under Is Vertex 2 Property Column tick:
    • name
    • admin
    • any other columns you wish to import (remember if the column name matches a column already present the information will be overwritten)
  • Which edge column is Vertex 1: testid
  • Which edge column is Vertex 2: matchid


Remove unwanted matches

If you decided not to load the in-common with file, you may prefer to remove matches who don’t share DNA with you. You’ll find them at the bottom of the Vertices sheet. There won’t be any information in your own sharedCM column for those people.

Housekeeping

A few clean up tasks will make sure the graph is ready for more work.

Clean up the Edges

  • If you loaded an in-common-with file, remove duplicates (NodeXL ribbon, Prepare data button).
  • On the Edges worksheet, sort the Vertex 1 column from smallest to largest using the dropdown on the column header.
  • Filter the Vertex 1 column to only show ‘zzz delete’ entries.
    image
  • Highlight those lines and delete them.
  • Clear the filter afterwards.

Excel tips:

  • To quickly select a range of rows, select the top cell you want to include. With the Shift key held down, tap the End key and then the Down arrow.
  • To delete rows, move to the Home ribbon and click the Delete button. Choose either Delete Sheet Rows or Delete Table Rows.
    image

Clean up the Vertices

There should only be one row labelled  ‘zzz delete’ to get rid of and it will be at the very bottom of the Vertices sheet. Sort the column to find it if not. You can get rid of it, or just enter ‘Skip’.

Fix up the dot sizes

Earlier, we sized the dots according to the value in the sharedCM column so that we would have a visual indication of how close the relationship with the match is. Now that you have two (or more!) sharedCM columns it’s very likely that they are scattered with blank cells. All those dots will be the default dot size.

The easiest option is to set all the dots to the same size by using the Autofill columns button to clear the size column.

Personally, I prefer having larger and smaller dots. To fill in the blanks, I added a new column to the Vertices worksheet with a formula that returns the larger of the two sharedCM values. To do this I used the MAX function. The AVERAGE function might be a good option if you have loaded several siblings.

  • Add a column to the Vertices sheet by entering a new column heading in the first empty cell in the heading row. ‘New Size’ will do for a heading.
  • Select the first empty cell in the new column.
  • Move to the Home ribbon and change the cell format from ‘Text’ to ‘General’.
    image
  • Enter your preferred formula (see below if you need help). It should automatically fill in all the way down the table.

When you’re happy with the formula, use the Autofill columns button to transfer the content of your new column into the Vertex Size property.

Excel tip:

To enter the MAX or AVERAGE formula, start by typing in the formula name and an opening bracket:

=MAX( 

Then click on each cell (type a commas in between each click) that the calculation should use. You can enter as many elements as you want. Make sure you’re clicking in the same row as your formula. Finish off by entering a closing round bracket. It will look something like this:

=MAX([@sharedCM],[@[sharedCM Dad]])

Or type:

=MAX(AF3,AG3)

(check the cell references match your sheet).

Important note: Formulas and PC performance

Usually when you enter a formula in Excel it calculates so quickly that the result seems to pop up instantaneously. When you make a change in a worksheet any dependant cells (and their dependant cells and so on down the line) are recalculated in the blink of an eye.

We’ve just entered a formula all the way down a long table. This shouldn’t pose too much of a problem…. until it does. It might be when you run the grouping calculations again, or next time you load new data. With potentially tens of thousands of cells to recalculate those fractions of a second start to add up and Excel may stop responding.

There are two options to choose from that will lighten the load.

  1. Replace the formula with values: Highlight the column, Copy. Paste as values.
    image
    If you choose option 1, you’ll need to recreate the formulas when you load new data.

    OR

  2. Stop Excel from automatically calculating. You’ll find Calculation Options on the Formulas ribbon.
    If you do this you will need trigger recalculation of the worksheet yourself when required, either by pressing the Calculate Now button, or by pressing F9 on the keyboard.
    image
    The calculation choice will be saved with the worksheet. Be aware that any other worksheet that is open at the same time will also be affected, and the calculation choice saved for them as well. Also, the setting saved in the first workbook opened in any session is then applied to any other workbooks opened in the same session! It’s probably better to check the setting before you do anything with heavy calculations… and…. if you choose this option, remember what you have done! Formulas may look like they are working when you fill them in, but they won’t calculate correctly until you press F9.
    (In practice it’s not all quite so troublesome as it sounds).

Run clustering calculations

Did you read the important note about PC performance? Hopefully one column of formulas won’t be too much of a strain, but if you have any doubt please take one of the actions above, just in case!

Re-run the clustering algorithm of your choice and lay the graph out once more.

Explore!

In the next post I’ll show you how to colour code your matches.

5 comments:

  1. Thanks so much for these articles! Unfortunately, my laptop doesn't have enough processing power to deal with adding my sister's matches this way. What I've done instead is to treat mine as an update to hers -- so my info overrides hers, rather than the other way around. That works visually, but I lose the ability to see both of our match levels on the chart.

    My system also stalls if I try to update more than once, so when we have a number of new matches, I need to redo the whole chart!

    Is there a way to select only a few new matches and add them? That might not crash my Excel as abruptly.

    Thanks again!

    ReplyDelete
  2. There are various ways you could extract the new records, but they're are a bit complex.
    Do you have the problem when you load either file, or only when you load the match file?
    I assume you've cleaned out extra lines and made sure that no formulas are calculating as you import. The other thing that may help is to load the new records first when you open the worksheet, before you refresh the graph.

    ReplyDelete
  3. Thanks for all this wonderful help you have given.

    My computer grinds to a halt when I import a second match file - I have tests from five full siblings that I would like to combine. In your instructions above you say change the testid column to "zzz delete” and you show in the picture row 1 is left as "testid". later when you describe the import you say "Which edge column is Vertex 1: zzz delete" I dont see an option for zzz delete just testid - have I done something wrong here ou is this incorrect.
    I have stopped auto recalc and copied formulas to values but it just grinds to a halt and I have to crash out of the Excel task.

    I was wondering could you combine the match and icw files of my siblings prior to importing by doing the merging in excel and how I might go about this if it is possible?

    Thanks again for these wonderful tutorial style instructions.

    ReplyDelete
    Replies
    1. Whoops! That should have read 'testid'. I've corrected the post now.

      Have you tried loading the new records when you first open the file, before you click refresh graph?

      You could try combining the match and ICW files. Either cut and paste (which may be awkward with such large files) or explore Excel's query options on the data ribbon.

      There should be no issue in terms of importing with appending the ICW files one after the other. The only issue with the match file would be the treatment of columns where the data differs for each person. If you append the file without creating a new columns for sharedcM etc for the second person it will still import, but you will see a mixture of values some for the first person imported and some for the second without a way to tell which is which.

      Delete
  4. I'll try that - thanks. I think it's just the volume / memory issue for my far-from-new-laptop.

    ReplyDelete