Saturday, October 17, 2009

Getting more from my newspaper archive searches

I've been experimenting with Google's News Archive Search, which we were recently reminded of by Randy Seaver. I had this post in mind before I read his article, I swear!

Although I'll be talking about the application of the Google News Archive Search to the National Library of Australia's (NLA's) newspaper archive site, I imagine my comments would be applicable many of the newspaper archives indexed by Google.

I noticed that the NLA newspaper archive is picked up by the Google search. I was interested to see how the site's own search, and the Google search, would compare. It's possible to make a comparison by adding an appropriate site restriction to the Google search term. eg, site:.nla.gov.au

I compared the search results I got from:

  1. the NLA site's own search (http://newspapers.nla.gov.au/) , and
  2. the Google news archive search (http://news.google.com/archivesearch), limited to the NLA archive.

The first search I tried was the surname STANNUS. It's the surname I usually use for experimenting with new databases. It's common enough that I get hits, but rare enough that the number of hits doesn't overwhelm me. Also I have some idea of how most of the people returned (outside of the USA) connect to my tree, which is nice.

Running the search on the NLA site, I got 259 hits. The Google News Archive Search, limited to site:.nla.gov.au, gave me 61 hits.

This was about what I expected. The NLA seems to have added a lot of newspapers lately and it looked as though the Google indexing had not yet picked up the additional newspapers or changes to the archives (the NLA OCR results are user editable). I could see that Google had picked up older edits to the NLA archives, because a few I had made several months ago came up in the Google search.

Then I noticed something interesting in the Google results. This:



You see how the OCR of newspaper text split the word STANNUS into STAN and NUS? Google picked it up as a hit, the NLA site didn't. (The Stannus referred to turns out to be my GG Uncle).

Further experimentation with a search on "Couper, Oakleigh, butcher" gave 151 results on the NLA site, the first of those being the story about the death of Leslie Couper Miller. There were more hits - 453 - on Google. That was a surprise. I could see that Google had also included hits for "Coupe" and "Coupar". It's a pity that there's no easy way (correct me if I'm wrong!) to find out what set of words Google searched on. I didn't see any Coopers or Cowpers in the Google results. When I tried a search on Couper|Coupar|Coupe I still got 453 results (the "|' works as OR in the search term). I don't think those other common name variations were included.

If I forced Google to look for exactly the search terms given (adding a + in front of each or putting quotation marks around each word works here) I found only 31 results. They did not include the article about young Leslie's death.

All this will change my NLA newspapers search strategy, if only slightly. I think that I will definately still use the NLA (or other archive site) first, thanks to the better coverage and finer options available. I will then follow up with a search via Google as it might pick up some name variations, or OCR errors, that I hadn't thought of.

If you find this interesting or (especially) if it helps you with your searches, please leave a comment!

4 comments:

  1. I get so frustrated with searching my husband's surname as it is "English". Think of how many hits you get with that word.
    I guess we have to exhaust all avenues of searching and we have to do it repeatedly to get the results we hope for. We just don't want to miss something.
    Good article.

    ReplyDelete
  2. Thanks!
    I certainly can relate to searching on the family name "English". One of mine is "French"!

    ReplyDelete
  3. Thanks for the tip and for sharing it on your blog.

    ReplyDelete
  4. I picked up a document or 2 in Google's News Archive Search that NLA does not give me at all, using the same search name(s). I guess that sometimes words/letters are just not picked up for various reasons.

    So its really worth searching both places for the same thing.

    ReplyDelete