Blog post

Friday, June 25, 2010

Drawing on other disciplines

Think of your family as a disease. They infect web pages and database records.

Think of individually opening and inspecting all those web pages and database records as a gold standard diagnostic test. The test is great, but you can't check every single page and record, can you?

What you need is a screening test to pick out the most likely candidates. You can administer the test via a search engine or the database search tools. Your job is to come up with the most appropriate screening test. One that identifies as many cases of the disease as possible, without overloading you with pages that have similar but unrelated symptoms (or family names!).

In medicine, the ability of a screening test to pick up all true cases is called its sensitivity, and it's ability to pick up only true cases is its specificity.

So how can this help us when we are looking for records relating to our family?

Ideally, a medical screening test should be both sensitive and specific - that is it should pick up all true cases, and not too much else. You don't want anyone with the disease to go undiagnosed, but you don't want to subject healthy people to diagnostic tests that may be uncomfortable, embarrassing, or even dangerous. Unfortunately, in a screening test there is always a trade-off between sensitivity and specificity. Otherwise it wouldn't be a screening test, it would be the diagnosis!

The same ideas apply when coming up with search terms to find your family. It's useful to remember that trade-off, and to think about the result you want to achieve.

I started thinking along these lines as I was testing out search terms for use in google alerts last week. Having blogged about a find I made while doing so, I was asked in the chat session for a course I'm doing about how I was setting up my searches. As I blathered on (I fear I did blather) part of my mind was thinking how much easier it would be to properly describe what I was doing, if only I could talk about what I was hoping to achieve in terms of the sensitivity and specificity of my search results.

Now I can.

So... I was trying out google searches for suitability as google alerts. If you are going to create a screening test, you need to know WHY you are screening. What do you hope to find? What are the repercussions if you don't identify every single case?

My reason for creating the alerts is to see new material about my family, even though I may not be actively researching that part of the tree. The impact of missing relevant pages through the alerts is low. I'm likely to come across them when I get around to researching that part of the family. I don't want to wade through irrelevant results each day, and I don't mind creating lots and lot and lots of alerts.

In other words, I'm more concerned about the specificity of my search results than the sensitivity. It would be different if I was looking for my family in a census. I would be much more concerned about finding every relevant record and may have to sacrifice specificity in order to pick up spelling and transcription variations. Fortunately, in genealogy unlike medicine you can usually run lots of population screening tests!

How does this help?

Knowing why you are doing a search makes all the difference to knowing what search terms will get you a good result. I knew that from my starting point of a search on "couper", I wanted a big increase in specificity but didn't mind if I lost sensitivity.

Google these days seems to search on word variations for you, even without using "~" in front of a word. If I had wanted to increase the sensitivity of my test I could ask google to search on multiple names using | between words, which acts as "or". That is, "Couper|Cooper|Coupar|Cowper" will give me 100 million plus results, compared to 5.8 million with "Couper" alone.

There are plenty of ways I can increase specificity of the search. Any additional information on the family might do the trick. The more unique to the family, and the more likely to be mentioned when discussing the family, the better.

Some ideas are place names, street addresses, family member's first names, occupations, employers, year ranges (use ".." for the date range that should appear on the page eg "1850..1935".) All these things can be used to increase the specificity of the search. "Genealogy" or "~genealogy" will also help narrow the results down.

When I searched on Couper and the place name Oakleigh, about 1 million results were returned, with a relevant result that I would want appear in alerts on the front page. I also see that google has added in variations for "couper", giving me an unwanted increase in sensitivity. Suddenly, Oakleigh looks like the place to buy and sell "Mini Cooper S coupe"s which isn't my interest at all!

"+Couper Oakleigh" looks better, and "+Couper Oakleigh ~genealogy" even better again. Just a pity that so many of the results are me, one way or another!

So far I have only set up half a dozen alerts, but google will allow me 1000.

Do you draw on ideas from other fields? 

Note: In case you are wondering, I don't have any medical qualifications, but I do have a post-graduate qualification in Public Health.

No comments:

Post a Comment