Friday, October 5, 2012

Short URLs, Big Data: Learning about the World in Realtime

This session was presented by Hilary Mason, who is the Chief Scientist of bitly. She is the cofounder of a non-profit called Hack New York as well. If you've read Freakonomics I think you will really enjoy learning about Hilary's work. Also, in my opinion this was easily one of the best talks at GHC this year. Be sure to read all about it!

Hilary studies data, and gets to work with scientists from fields like sociology, physics, biology, etc to do it! She gave a quote that describes her philosophy:

"The purpose of computing is insight, not numbers. ~ Richard Hamming, 1961

Building on this, Hilary talked about how her work is really about learning about humanity and to increase our ability to make better decisions about the world. I could really relate to thinking about CS as a means to an end, or as a way to model and think about the world, rather than focusing only on code itself.  She also talked about a shift in our thinking from thinking about naive data (single pieces of information, which is interesting on its own), to data that is interesting in the human context.

There have been advances in the field of Big Data recently: scalability/clustering, algorithms, data storage/analysis. Some fascinating examples of recent Big Data applications are:
  • Using heat maps to identify restaurants that were illegally disposing of used oil/grease (an organized crime problem) in New York. Using this data, the City of New York set up their own grease disposal company and approached the restaurants to sell their grease to them, eliminating the problem.
  • Using ambulance response data to find out why ambulance drivers parked in non-optimal locations (it turned out there were coffee shops there). Using this data resulted in making deals with coffee shops to entice ambulance drivers to park in better locations!
There are plenty of apps/startups using Big Data, like the "Dark Sky" app for iPhone, DataKind, and PatientsLikeMe. Hilary also highly recommends the OKCupid blog.

What are Data Scientists?
Data science is a mix of disciplines like math, comp sci, engineering, and curiousity. Hilary showed us how the overlaps of these disciplines contain nerds, but the intersection of all of them contains awesome nerds! It was quite funny.  Data scientists are concerned with building mathematical models for the right questions. It's important to find the questions that matter.

Bitly is actually a spinoff from a failed product. It's first year was consumed mostly with building scalable systems. They now see 10's of millions of URLs per day and 100s of millions of clicks per day. Its research goal is: Can we understand human social attention in real time?  A few things they've learned are:
  • social experience in a given social environment varies by individual
  • attention is fickle
  • data needs to be normalized for it to be truthful and accurate
  • the frame changes the way we consume content
  • new geographies matter (i.e. best time to get clicks on twitter is different from facebook)
  • the internet IS the real world (i.e. look at network data during the Arab Spring, one reflects the other)
What people share on the internet is not what people read. An interesting insight Hilary shared was that what people share is part of a persona they would like to present, not what they actually are. People actually have a mix of identities that they combine for presentation to the world.

Engineering Process at Bitly
The process is as follows:
  1. Research offline
  2. Do fancy math - find the shortcut
  3. Design infrastructure
  4. Re-design to run at scale and speed
As a startup, Hilary explained that it's important to understand 'when you've won' - startups don't have two years to do academic research, so they need to iterate quickly and not allow themselves to go 'down a rabbit hole'. I think this is important for other business environments as well - I can certainly see it applying to some of my work.

Steps two and three are generally done in Python, while steps three and four often involve C and Go.

Another interesting problem they are working on is Realtime Search. Rankings are done dynamically and can vary by the second! That's amazing.

They have a @GHCBot which tweets about items of interest to the GHC community, so be sure to check it out! For those of us fans out there, there is also a Star Trek bot, which you'll have to let me know about if you find it. :-)

No comments: