Friday, June 25, 2010

Drawing on other disciplines

Think of your family as a disease. They infect web pages and database records.

Think of individually opening and inspecting all those web pages and database records as a gold standard diagnostic test. The test is great, but you can't check every single page and record, can you?

What you need is a screening test to pick out the most likely candidates. You can administer the test via a search engine or the database search tools. Your job is to come up with the most appropriate screening test. One that identifies as many cases of the disease as possible, without overloading you with pages that have similar but unrelated symptoms (or family names!).

In medicine, the ability of a screening test to pick up all true cases is called its sensitivity, and it's ability to pick up only true cases is its specificity.

So how can this help us when we are looking for records relating to our family?

Ideally, a medical screening test should be both sensitive and specific - that is it should pick up all true cases, and not too much else. You don't want anyone with the disease to go undiagnosed, but you don't want to subject healthy people to diagnostic tests that may be uncomfortable, embarrassing, or even dangerous. Unfortunately, in a screening test there is always a trade-off between sensitivity and specificity. Otherwise it wouldn't be a screening test, it would be the diagnosis!

The same ideas apply when coming up with search terms to find your family. It's useful to remember that trade-off, and to think about the result you want to achieve.


I started thinking along these lines as I was testing out search terms for use in google alerts last week. Having blogged about a find I made while doing so, I was asked in the chat session for a course I'm doing about how I was setting up my searches. As I blathered on (I fear I did blather) part of my mind was thinking how much easier it would be to properly describe what I was doing, if only I could talk about what I was hoping to achieve in terms of the sensitivity and specificity of my search results.

Now I can.

So... I was trying out google searches for suitability as google alerts. If you are going to create a screening test, you need to know WHY you are screening. What do you hope to find? What are the repercussions if you don't identify every single case?

My reason for creating the alerts is to see new material about my family, even though I may not be actively researching that part of the tree. The impact of missing relevant pages through the alerts is low. I'm likely to come across them when I get around to researching that part of the family. I don't want to wade through irrelevant results each day, and I don't mind creating lots and lot and lots of alerts.

In other words, I'm more concerned about the specificity of my search results than the sensitivity. It would be different if I was looking for my family in a census. I would be much more concerned about finding every relevant record and may have to sacrifice specificity in order to pick up spelling and transcription variations. Fortunately, in genealogy unlike medicine you can usually run lots of population screening tests!


How does this help?

Knowing why you are doing a search makes all the difference to knowing what search terms will get you a good result. I knew that from my starting point of a search on "couper", I wanted a big increase in specificity but didn't mind if I lost sensitivity.

Google these days seems to search on word variations for you, even without using "~" in front of a word. If I had wanted to increase the sensitivity of my test I could ask google to search on multiple names using | between words, which acts as "or". That is, "Couper|Cooper|Coupar|Cowper" will give me 100 million plus results, compared to 5.8 million with "Couper" alone.

There are plenty of ways I can increase specificity of the search. Any additional information on the family might do the trick. The more unique to the family, and the more likely to be mentioned when discussing the family, the better.

Some ideas are place names, street addresses, family member's first names, occupations, employers, year ranges (use ".." for the date range that should appear on the page eg "1850..1935".) All these things can be used to increase the specificity of the search. "Genealogy" or "~genealogy" will also help narrow the results down.

When I searched on Couper and the place name Oakleigh, about 1 million results were returned, with a relevant result that I would want appear in alerts on the front page. I also see that google has added in variations for "couper", giving me an unwanted increase in sensitivity. Suddenly, Oakleigh looks like the place to buy and sell "Mini Cooper S coupe"s which isn't my interest at all!

"+Couper Oakleigh" looks better, and "+Couper Oakleigh ~genealogy" even better again. Just a pity that so many of the results are me, one way or another!

So far I have only set up half a dozen alerts, but google will allow me 1000.

Do you draw on ideas from other fields? 


Note: In case you are wondering, I don't have any medical qualifications, but I do have a post-graduate qualification in Public Health.

Friday, June 18, 2010

First known burial in Oakleigh General Cemetery

I just found out that a plaque has been placed, commemorating the first known burial in Oakleigh General Cemetery. The person named, Christina Couper, was seven years old when she died in 1860, and happens to be my great-grand-aunt. Here's the media release. The release also mentions a booklet about the cemetery.

I've previously posted some pictures of the memorial park, on a Tombstone Tuesday post for Christina's nephew who was tragically killed when his horse bolted.


So, there are a few more to-do items for me. Get a photo of the plaque, and get a copy of that booklet!


I found the post when I was testing out search terms to set up a few more Google alerts. Looks like that search worked!

Monday, June 7, 2010

Another spur of the moment student

I was reading Geniaus' post, spur of the moment student, about how she signed up for a National Institute for Genealogical Studies course on "Australian and NZ Research".

Like Geniaus, I've been thinking for some time that I'd like to do some sort of genealogy study. I'm not sure how anything too formal would fit into my life at the moment though. It's hard for me to get time. So, the idea of a short course with little pressure, not too pricey, and specific to my research interests was very tempting...



So tempting, that I signed up too!

Friday, June 4, 2010

In defence of my honour!

Yesterday I read Tamura Jones' article about the MyHeritage Awards fiasco. I must say, I felt a little... besmirched.

For the record:
I am on that list and I received, and replied to, an email from MyHeritage prior to the list being published.

The email from MyHeritage included the following statement:
We were hoping to simply list you among the winners on our website, and to offer you a html badge to display on your website. There's no pressure with this, so if you don't want to have a badge on your site then you don't have to do that.
My interpretation of the message was that I could choose to participate or not. If I chose to participate, then they would send me a badge which I could choose to display, or not. There was no obligation. It was unclear if you had to reply to be included.

I did think it odd that they would send the email. Why not just issue the awards? Why seek permission to say you liked what someone was doing?

The thought crossed my mind that (although the message read very much like a form letter) I may have been contacted in advance because I had an email exchange with MyHeritage last November, initiated by them. They asked me what I thought of their site. Initally I ignored the message but they asked again - so I told them. I expressed some serious concerns arising from an incident a few years ago. I won't go into those concerns as in that November correspondence I felt that they took me seriously, looked into all that I said, and provided a plausible and possibly even reasonable explanation of what had happened.

After some thought, I replied and said that they could list me and send the badge code, but I may not display it.

When the list was published many of my favourites were included.Quite a few of them reacted with surprise at being listed. I don't doubt them. I wonder how widespread those emails actually were? It all seemed like a bit of fun, so long as you didn't take the name too seriously. My 'acceptance speech' post was intended to reflect that view. I decided to put up the badge for a while on the basis that, cynical marketing exercise or not, I thought there were many great blogs on the list. I've taken it down now that I've learnt what it may do to my reputation!

Can anyone recommend a good tarnish remover for the corners of my blog? ;-)

Tuesday, June 1, 2010

Data entry tally - June 2010

Back in February I described my backlog of 190 electronic files to process and enter into my database. To make myself actually enter all that data, I was going to report back regularly on my progress. So far I've only reported back once, but I have been diligently checking and recording my progress on the first of each month.

Today, the total number of files left to process is... (drum roll)... 225.

Sigh.

It just keeps getting bigger! On the other hand, the source list in my database is definately getting longer. I am entering a lot of data, but am accumulating information even more quickly! Better luck next month...