I’ve always been impressed by the some of the large-scale data mining studies out there. Specifically, I’ve been impressed by studies that mine our every day interactions on twitter, Facebook, or the media, to find out some simple truths. For example, whether or not you say “Uh” vs. “Um” is: a) partially determined by where you live, b) carries into your language on twitter. That’s amazing.
Even a very simple study like “Uh vs. Um” takes a lot (I mean A LOT) of data. Researchers have a corpus of nearly 6 billion words in their data set, representing 600 million tweets collected over a year. That’s not easy to do for a number of reasons.
- The tweets have to be geo-coded in order to map them to a specific location. Less than 2% of tweets are geo-coded.
- Twitter API doesn’t make it easy to say “give me every geo-coded tweet”.
- Working with 600 million lines of data isn’t easy – let’s just say you’re not using excel or spss.
Nonetheless, I’m interested in what can be learned by applying similar tools to education research. Along those lines, I’ve been dipping my toes into these waters to figure out some of these technical problems outlined above. Simultaneously, trying to think through what educational “research” (or is it just “poking around”) might look like using these approaches. A lot of it is just figuring what’s available in the data.
So, in my early explorations so far, I’ve managed to collect about 600,000 geo-coded tweets in a week, do some basic analysis, and figure out how to generate some semi-pretty graphs. I’ve learned that 600k tweets isn’t enough to do anything serious yet. Big data needs to be VERY BIG to be really helpful.
But, here’s a first look at an early attempt. I’ve tracked the “density” of educational related tweets across the U.S. in the last week. That is tweets in which phrases like “edu”, “school”, “teacher, or “student” appear. Look what the data show (click for a bigger picture):
No immediate trend jumps out. Except perhaps that there is not big trend. That is, there are not big regions of the country where education debates are raging. But there are county-wide hotspots. And these seem to be quite “hot”. In the darkest color of the map nearly 4% of the tweets in that county are education-related! That’s a lot, considering the nationwide norm is around 1.0% – 1.5% (based upon the data here).
Maybe like politics, all education debate is local.
With more data, i’m reasonably certain the data would smooth itself out to make regions with less sharp boundaries between colors. And, some of the “missing data” (0 % regions) would fill in. That’s been the trend so far as I’ve watched the data set grow from 100k to 600k tweets.
In summary, pretty COOL! No idea what it really means yet, I’m still poking around …