Kaggles are good for you

Earlier this week I finished my first Kaggle competition. Kaggle is (primarily) a platform for holding data science competitions. If somebody has a pile of data, and would like some predictive modeling done with it, they can work with Kaggle to unleash a hoard of bored smart people on the problem.

The most famous data competition is probably the Netflix Prize. Netflix wanted to improve their recommendation engine, and outsourced the problem to the nameless masses, offering fame and fortune to the best among them.

The competition I just finished was cooked up by Marinexplore and Cornell University’s Bioacoustic Research Program. The task was to correctly identify specific whale calls in a collection of audio data gathered by a network of buoys. Here’s how I did:

WhaleResults

I had just a few simple goals for this competition:

  • Actually compete. That is, submit a few increasingly accurate predictions that I could be relatively proud of.
  • Use R, almost exclusively. While I’ve occasionally used R for things in the past, I had never done an entire project with it.  It’s hard to learn a tool when you’re only using it in very special circumstances.
  • Beat all of the benchmarks. Kaggle leaderboards usually show how well simple implementations of various approaches would score. In this case there were benchmarks for an “all zeros” submission as well as the existing algorithm used by the team at Cornell.

So how did I do? I failed to get my way into the top 50% of teams (~45th percentile) but I did go 3 for 3 on my own goals.

I used the tuneR library to pull out a few numbers from each audio file and then ran a random forest. My score did improve with each submission, entirely as a result of pulling out numbers from smaller and smaller segments of the audio files.

This was mostly a proof of concept for me: I pulled arbitrary data features I didn’t understand and put them through a standard classification method. I still don’t know anything about whales and didn’t learn anything new about machine learning. Hopefully I’ll take a more thoughtful approach in future competitions.