Thursday, January 15, 2009

Funny Biproducts of Research

I've recently begun working full-throttle on my Master's thesis. The first stage requires crawling a number of pages looking for certain HTML features within them. Unfortunately, the feature I am interested in can be used in multiple ways, and I need to make sure manually that each page contains the one I need. Luckily I was able to make an interface that significantly speeds up this process, but the whole process still requires several hours of clicking 'keep' or 'reject' buttons. On the bright side, I got to see first-hand what an eclectic collection of pages I've crawled:
  • several pages of sumo wrestler bios and statistics
  • lots of pages about the odds on horse races
  • many pages in Estonian
  • a lot of pages about finances, weather, and distaster information
It's a fun project and hopefully I'll be able to blog more about it as I progress with each stage of it.

