Earlier this week I finished my first Kaggle competition. Kaggle is (primarily) a platform for holding data science competitions. If somebody has a pile of data, and would like some predictive modeling done with it, they can work with Kaggle to unleash a hoard of bored smart people on the problem.
The most famous data competition is probably the Netflix Prize. Netflix wanted to improve their recommendation engine, and outsourced the problem to the nameless masses, offering fame and fortune to the best among them.
The competition I just finished was cooked up by Marinexplore and Cornell University’s Bioacoustic Research Program. The task was to correctly identify specific whale calls in a collection of audio data gathered by a network of buoys. Here’s how I did:
I had just a few simple goals for this competition:
- Actually compete. That is, submit a few increasingly accurate predictions that I could be relatively proud of.
- Use R, almost exclusively. While I’ve occasionally used R for things in the past, I had never done an entire project with it. It’s hard to learn a tool when you’re only using it in very special circumstances.
- Beat all of the benchmarks. Kaggle leaderboards usually show how well simple implementations of various approaches would score. In this case there were benchmarks for an “all zeros” submission as well as the existing algorithm used by the team at Cornell.
So how did I do? I failed to get my way into the top 50% of teams (~45th percentile) but I did go 3 for 3 on my own goals.
I used the tuneR library to pull out a few numbers from each audio file and then ran a random forest. My score did improve with each submission, entirely as a result of pulling out numbers from smaller and smaller segments of the audio files.
This was mostly a proof of concept for me: I pulled arbitrary data features I didn’t understand and put them through a standard classification method. I still don’t know anything about whales and didn’t learn anything new about machine learning. Hopefully I’ll take a more thoughtful approach in future competitions.
Recently defeated Gubernatorial candidate, and former WA Attorney General, Rob McKenna wrote an op-ed for the Seattle Times last week discussing ways that NW Republicans can start winning state-wide races again.
He fully endorses the need to expand the GOP tent by reaching out to minority communities and young people. And he… well that’s about it. Although he does end the column by noting how close he came to winning (he lost by ~4%).
Here’s a summary of the policy prescriptions he outlines for winning over non-old-white voters:
Not everything has an API.
I haven’t had to do much web scraping in my life, and when I have it’s been simple and did not need to be reproducible. But there are a few projects that have been floating around in my head that would benefit greatly from repeatedly collecting a lot of data straight from webpages. My search for a good scraping tool led me to the usual places (Stack Overflow and Quora) and I found Scrapy.
Scrapy is a Python based screen scraping and web crawling framework that is available to fork on GitHub. I currently work on a windows machine so, like most cool things, it was non-trivial to set up but luckily they provide a straightforward installation guide with links to all of the dependencies you need to install. They also provide a nice tutorial to help you get a feel for the framework.
So that’s where I am now: everything is up and running, and I feel comfortable with the tutorial project. Now I just need to figure out how to use it for my own (currently ill-defined) projects. Hopefully I’ll be back here soon reporting on some cool results.
I recently started a wonderful course in Social Network Analysis (available on Coursera). There are many, many good things to be said about this course but I will save those until I have completed it.
For now, I just want to highlight a book that I found through this class. Network Science by Barabási is an introductory text to the field. Network science is the study of network representations of phenomena and their related models. While its foundation is graph theory (a field of mathematics), network science is interdisciplinary and draws methods and concepts from a wide variety of fields, ranging from sociology to physics. Like data science, its applications have grown dramatically in recent years thanks to cheap computing power and data collection & storage. I had actually hoped to title this post “Network Science is Data Science for people who had sex in high school”. While I’m pretty sure that would be a lie, the point is that network science is awesome and has a fascinating future in store for it.
Albert-László Barabási is one of the biggest names in the field. In 1999 he and Réka Albert published a paper on scale-free networks that has proven pivotal in launching the booming interest that network science has seen as an academic field over the last decade. Now, Barabási is working on a textbook aimed at exposing undergrads to this powerful field of study.
It is a work in progress but the first two chapters are currently available for free. So far, the content is at too low of a level to warrant what I expect the price to be (undergrad books are unconscionably expensive), but I am definitely looking forward to read any additional chapters that are posted online. And I am glad that a high-quality, introductory book on this subject will be available soon.
“A single ear of wheat in a large field is as strange as a single [habitable] world in infinate space” – Metrodorus
This week I finished a course called Intro to Astrobiology by Professor Charles Cockell (of the UK Centre for Astrobiology at The University of Edinburgh) offered on Coursera.
An ancient field of thought that is concerned with the origin, evolution and distribution of life in the universe. It pulls from many disciplines (chemistry, biology, astrophysics, etc).
- How/why/where did life begin on earth?
- What are the extreme limits of life (temperature, pressure, desiccation) on earth?
- Are these limits universal? Can life exist in ways we haven’t concieved?
- Is there life outside of earth? How can we go about finding it?
There are billions of galaxies, each with billions of stars, many having planets; and we are only beginning to have the technology capable of inspecting them.
Each week a handful of short lecture videos were released as well as a couple of multiple choice quizzes. Meant as an introductory/teaser course, the videos offered an overview of the many disparate parts of the subject. Professor Cockell clearly finds the field fascinating and did a good job of connecting the topics together.
Is life sustainable outside of the comfort of the earth?
This may not be answered for a long time. But eventually, the earth will be unable to sustain life; whether through our own actions or because of the expiration of our sun. So it is imperative for us to explore these issues.
Are we alone in the universe? The answer is profound either way.
I’m starting a long(er)-term project looking at congressional redistricting. This is the process done every decade (following the constitutionally mandated census) to redraw the boundaries of the districts represented in the House of Representatives. It can also be done, with less legitimate cause, at any time a state chooses.
There a few reasons I find this topic interesting:
- Republicans maintained control of the House in the 2012 elections. They did this despite more Americans having voted for a Democratic representative than a Republican one. That’s just the way the cookie crumbles sometimes, but it’s been fun seeing people try to explain it and I’m interested in exploring this phenomenon more.
- I like the idea of multi-member districts. I was first exposed to this idea a few years ago by Matthew Yglesias, and it has stayed with me largely because it’s probably my only chance of ever becoming a congressman.
- It will finally give me an excuse to play with maps.
I’m not sure where this project is headed, but along the way we’ll get to play with these ideas as well as the Voting Rights Act, the “big house”, gerrymandering and so much more.
Yesterday, Super Bowl Sunday, I put out a few charts (found here. CLICK!) comparing the two contenders. It was a last minute attempt to see how the 49ers and Ravens stacked up against each other based on their regular season performance.
In December, I had started mulling over ideas for showing how dominant the Seahawks had been in the second half of the season, but they got knocked out of the playoffs before I pulled the trigger on a graphic. In that time I came across this difference chart by mbostock and it seemed like a really powerful way to compare teams over time:
I got the data I needed from Advanced NFL Stats. There were a couple of other data sets that I ran across in my research, but this had the entire season (play-by-play) in one handy csv.
I had to do a little bit of work to get the data into the shape I needed it:
- How do I handle time? The first decision was to ignore bye weeks completely. This means as you move along the x-axis, the stats don’t match up by date.
- The other hiccup was how to handle overtime, which each team encountered at least once this season. I decided to map the game clock (as given in the raw data) to proportion of the total game time for that week. So a play happening as the first half expired was marked as 0.5 most of the time, but somewhere around 0.35 – 0.4 for the games that went into overtime.
- There were some data quality issues in the points columns which required manual cleaning. It is entirely possible I missed something here.
- I calculated yards (for and against) from a combination of the yard line data and which team was on offense for the preceding and following plays. It is entirely likely I screwed this up for some edge cases.
I waited ’til the last minute to start coding this up and had to cut some corners. I’ll be revisiting it soon to address these issues:
- Have you ever heard of DRY coding? Yeah, I completely failed at it on this project.
- With only 30 minutes left before kick off I gave up on messing with my html layout. I wussed out and used a table instead; it looks tacky (though not as bad as I’d thought it would) but got the job done.
- I’ve also, apparently, forgotten how to do bar charts quickly in d3.js.
Code available here.