Racist Language on Twitter

Click for larger image

Click for larger image

For the last 2 months, I’ve been collecting geo-tagged tweets from across the lower 48 states. To date, that total is over 8 million unique tweets. And as I’ve explored what’s in those tweets and what’s not, I’ve seen some surprising things.

One of those surprises is how much of the talk on twitter contains what might be considered racist language. For example, over 1% of all those 8 million tweets contain some form of the n-word – that seems pretty high. Not all uses of the word are racist, but taken as an aggregate (over 90000 posts), there is something not right.

Moreover, what’s surprising is how unequally that word is used throughout the country. The map on the right shows places where the word is used statistically significantly more often (red), less often (blue) and between those extremes. What you see is band of much higher usage from the mid-south, to the south-east and up to the north-east. Other pockets on the map exist in parts of the midwest and the west-coast. In this band of red, usage is much higher (2%-4% of tweets) than the overall average (1.13% of tweets).

Click for larger image

Click for larger image

I’ve been unsure if this data were “real” enough to trust, until I saw a recent article on the washington post, “The most racist places in America, according to Google“. This article summarizes a research paper, “Association between an Internet-Based Measure of Area Racism and Black Mortality” by Chae et al. 2015)

Using an analysis of terms used in google searches, they analyze how prevalent google searches are using the n-word. They find a very similar result to the twitter analysis presented above. Moreover, the researchers show that regions on the map that were 1 standard deviation higher in use of the n-word in google searchers were “characterized by a one standard deviation greater level of area racism were associated with an 8.2% increase in the all-cause Black mortality rate, equivalent to over 30,000 deaths annually”. That is, these data are correlated with real world impacts.

But of course, correlation isn’t causation. One possible explanation stems from the widespread use of the n-word in african americans discourse. This includes uses of the word in several forms of music. These are hardly considered to be racists. In fact, if you look at the population density of african americans from the 2000 census on the right, it almost aligns perfectly to the twitter map. In short, where african americans live in higher concentration, so too does the concentration of the use of the n-word.

Source: http://commons.wikimedia.org/wiki/File:USA_2000_black_density.png CLICK FOR LARGER IMAGE

Source: http://commons.wikimedia.org/wiki/File:USA_2000_black_density.png
CLICK FOR LARGER IMAGE

The data itself is troubling, very troubling indeed. However, for my own investigations into what’s of value in this twitter data set, three things make me excited. First, there is a wealth of interesting data out there worth studying, available in what people post publicly to the world. Second, these data can be collaborated by other sources. Third, these data are related to significant effects out there in the real world.

Leave a Reply

Your email address will not be published. Required fields are marked *