Below is a memo for a data story from Mark Hansen's Data II class at Columbia Journalism School, in which I received honors. The process and story are described in-depth below. The assignment was to conduct story-worthy R analysis of a Twitter dataset provided to us by Gilad Lotan, Chief Data Scientist at Betaworks. The analysis here is fresh and the product of my own independent analysis. It has not yet been pitched for publication.
The input datasets (besides Gilad's Twitter dataset) are accessible here.
My full R code is accessible here.
In examining the Twitter trending topics data, I was curious how the topics broke down as a comparison between cities. How “common” the conversation is between cities might get at some geographic distinctions of American politics and culture.
Many interesting metrics have been used to group cities together in a sense of commonality: population density, political affiliation, density of graduate degrees, firearm ownership, sports fandom, food cost, public expenditures, and on and on.
However, the ‘most trending’ topics (if they represent the major topics of conversation for a given city), offer a compelling metric for directly comparing the cultures of one city to another.
Behind the Dataset
The attached code is heavily commented and the structure should be largely self-explanatory. Here, I’ll offer clarification and discussion on the code.
I limited the data to include only trending topics that appeared enough times to warrant being considered relevant. Many topics appear briefly each day and never appear again. By a very modest limitation, I hoped to remove irrelevant trends in a way that brought the data closer to the nebulous concept of “the conversation” actually occurring in a given city. Gilad Lotan suggested a ‘floor’ to the topic time windows.
I examined a couple example cities and tried different forms of a floor and settled on one that was fairly lenient. Trends are only included in the final list if, on a single day, they trend for at least 30 minutes overall. These trends could appear for one continuous half-hour or six separate five-minute windows throughout the day. Given the intra-day limitation, I think allowing for non-contiguous counting is reasonable. Here’s a brief table of the effects of different cutoffs:
The choice of 30 minutes is a judgment call. At once, I want to ensure a high enough floor that only relevant topics enter, but not so high that too many topics are filtered out. Considering “round” cutoffs like those in the table above, eliminating half of the unique observations in the dataset seemed excessive, but putting the floor at only fifteen minutes seemed too short. A half-hour floor that eliminated a little over third of the unique trends struck me as a decent floor.
With that basic dataset (“long_us” in the code), I set out to create the “ratios” dataset that gives the portion of one city’s unique trends that appear in another city’s list of unique trends. Considering each city as a “denominator” and “numerator” city, the ratio represents the number of trends that appear in both cities over the total number of unique trends in the denominator city.
Twitter collects trending topics for 48 American cities, meaning 2,256 (48*47) pairs, each with a related ratio (inverting the numerator and denominator city doesn’t invert the calculated ratio as the actual numerator is the trend counts common to both cities). To obtain this dataset, I have a nested pair of for loops that calculates each ratio, produces named subsets of “long_us” for each city (for reference – they’re not used in later code), and creates a summary stats dataset of all 48 cities (“sum”). These nested for loops take roughly five hours to run.
[As a brief aside, I’d love any feedback on efficiency and if the long runtime here is a product of inefficient coding (I know the summary dataset could be calculated outside the for loop for instance). I’m completely new to R, so I imagine it can be improved. The runtime showed one minor downside of R as compared to SAS, from my personal experience.]
The summary dataset “sum” reduces the 2,256 pairs back to 48 denominator cities, giving the mean, median, maximum, and minimum ratio for each denominator city. I calculate this dataset within the outer for loop. I realize I could easily calculate it from the “ratios” dataset, and that may help the runtime. When I wrote the code, I was curious to play with the for loops and experiment to better understand it. I may have had a little too much fun with this coding. Seriously, I really enjoyed turning the corner on R and closing out this analysis.
Interesting Initial Results
Besides the final results for each city, some broader trends emerged from examining this output dataset.
Interesting Initial Results
Besides the final results for each city, some broader trends emerged from examining this output dataset.
- Population appears to be an important factor. New York and Los Angeles, the two largest American cities, are clear outliers with the two lowest average ratios by far. With New York at 48% and Los Angeles at 51%, they stand out from the other 46 cities that span 55% to 70%. The overall average of averages is 63%.
- Speaking broadly, larger cities tended to have a more “unique” conversation and smaller cities tended to have more “common” a conversation (i.e. the larger the city, the lower the ratio). This relationship is born out in greater detail in the regression discussion below.
- The ratios span from 40% (Dallas / Jackson) to 82% (Jackson / Norfolk)
- As Jackson shows, the total counts of unique trends in a given city vary widely and have a large impact on the ratios. Jackson has the second-highest total count of unique trends, with 15,164. Harrisburg has the highest with 15,359. Houston has the lowest count of 9,082. Crucially, these numbers serve as the “denominator counts” for all of their respective cities’ ratios. This point is also discussed in greater depth in the regression section below.
- The “common” or “numerator counts” vary as well though, perhaps lessening the impact the wide span of “denominator counts” has on the span of ratios. The raw numbers of trends in common between two cities stretch from 5,373 (Dallas and Salt Lake City, with ratios of 57% and 48%) to 11,290 (Harrisburg and Jackson, with ratios of 75% and 73%).
- Harrisburg and Jackson are strange outliers. They’re small cities, very small for this Twitter dataset in fact (49,000 and 172,000 residents, respectively). Despite that, Twitter included them in the list of 251 locations. That fact simply struck me as strange. They also behave unlike other small cities in the dataset, having the third and fourth lowest average ratios after New York and LA. One potential explanation is that they’re both state capitals: Harrisburg for Pennsylvania and Jackson for Mississippi – perhaps Twitter’s visibility or use in politics led to these cities’ inclusion in the list of tracked locations.
- Depending on how you want to spin it, New York is either “out of touch” or “ahead of the curve.” Not only does the Big Apple have the lowest mean and median ratio (mean and median differ very little across all cities), it’s the denominator city in twenty of the lowest forty-five pairings. Los Angeles, as mentioned before, is also a strong outlier after New York.
- Examining what cities New York has the least to most in common with gives the following ratios:
In examining the data, my first impression was that population size influenced a city’s average ratio. Larger cities had a more “unique” overall conversation on Twitter. However, large southern cities like San Antonio, Nashville, and St. Louis clearly bucked this trend. Given the geographic (and political) distinctions these cities have compared to coastal and northern cities, and the large share of Twitter’s conversation taken up by politics, I was interested in finding out if what I initially saw as a population effect was actually a political one.
So, I obtained city-level political orientation data and merged it on my Twitter dataset. The city politics measure comes from a recent March 2014 paper forthcoming in the American Political Science Review by political scientists Chris Tausanovitch (UCLA) and Christopher Warshaw (MIT) entitled “Representation in Municipal Government” (accessible on Dr. Tausonovitch’s website here).
The data is normalized across all cities so that a purely “liberal” city has a measure of -1 and a purely “conservative” city has a measure of 1. A ‘neutral’ city thus has a measure of zero. This variable is referred to as “city conservatism” in the APSR paper. I’ll adopt that name from here on.
Given the apparent impact of population, I also added city-level population data from the Census (accessible here) and merged that to the dataset.
This process is commented clearly in the code, leading to the dataset “reg.” After initial review and feedback from Mark Hansen, I used the log of population as a regressor. Plotting the three variables against each other yields this chart:
This chart appears to show a potential positive linear relationship between conservatism and mean (average Twitter commonality) as well as a negative linear relationship between log population and mean.
It’s important to note here that the political science dataset doesn’t fully overlap with the Twitter dataset. The paper only examines cities with populations over 250,000 and, as we saw with Harrisburg and Jackson, that isn’t the case with all 48 cities Twitter collects Trending Topics for. The final dataset with all three variables contains 36 cities (lost data is visible in the code as the “lost” dataset).
And so, the regression equation entails the following formula:
Mean = β0 + β1 * Conservatism + β2 * Log of Population + εi
--- which, in the data, appears as ---
mean = β0 + β1 * cons + β2 * ln_pop + εi
The final regression results bear out the apparent relationships at a statistically significant level. A city’s average commonality of trending topic to other cities is positively linked to conservatism (β = 0.077, σ = 0.024, t-stat = 3.26) and negatively linked to log population (β = -0.029, σ = 0.007, t-stat = -4.15).
There were some early concerns about heteroskedasticity in the data, but the current dataset is not heteroskedastic (as detemined by a studentized Breusch-Pagan Test). In any case, the significance holds up with robust standard errors and the coefficients are (essentially) the same.
Mapping this relationship out gives the following plot of residuals vs. fitted line:
Here are normal and Q-Q plots of conservatism against the city mean variable:
Population, also a statistically significant predictor of city mean commonality of conversation, has an interesting trend to its data:
The linearity of the data in the conservatism Q-Q plot implies a normal distribution. The log population Q-Q plot appears linear, but has an interesting kink to the data around the center of the middle quartile.
Interrogating the Data
While the “mean” ratio of Twitter conversations may offers a proxy for how “common” the overall online conversation of one city is to that of other cities, the construction of the variable should be questioned to ensure that it actually reflects as much.
Of the highest concern to me, as mentioned earlier, is the denominator count. Each city’s ratios involve the same count of total unique trending topics in the denominator. These denominator counts vary from 9,082 (Houston) to 15,359 (Harrisburg). I want to ensure that, when the model predicts variation in the mean ratio, it’s not actually predicting variation in that denominator count.
Re-running the regression with denominator count as the dependent variable does not yield statistically significant results however, implying that the findings are not driven by denominator counts.
Due to clear endogeneity concerns, I don’t include the denominator count in my regression. The variables “denom_count” and “mean” have a correlation coefficient of -0.79 (as expected; the larger the denominator, the lower the ratio).
The regression results offer interesting insights, but are also open to many different forms of interpretation.
The more conservative a city, the more it has in common with other cities. The more progressive a city, the more unique its conversation. Perhaps more liberal cities are more heterogeneous, more local, or produce more unique ideas. Or perhaps more conservative cities are more homogeneous, more atuned to a national conversation, or tend to participate in a common conversation. Ultimately these different potential interpretations redound to how we view Twitter’s Trending Topics and their meaning as representative of the nebulous concept of a city or country’s “conversation.”
I’d be interested in a number of different directions for further analysis.
- Bring in new data. There is a massive amount of research into attributes that predict political orientation: from large obvious factors like religious devotion and professional background to small, less obvious factors like retail buying behavior and entertainment preferences.
- Expand the analysis outside the United States. The code was built to encompass a single country, but could easily work with a reasonably sized international dataset. By changing a few lines in my SQL function, we could find out if New York’s Twitter conversation has more in common with Hong Kong or Houston, and other potentially revealing questions.
- Dig into the existing shared topics to see what appears most. By keeping the “freq” variable defined earlier, the code could find out not only what cities have the most in common, but what topics most strongly define that commonality.
- Weight the variables. Right now, this analysis determines the average ratio of two overall numbers. Those overall numbers are simply counts and as a result weight equally a topic that trends (for a given city) for half an hour and a topic that trends for a month. By weighting the ratio using this frequency variable, the analysis may yield sharper insight.
- Finally, it’s often said that we live in our own “echo chambers” of information and media consumption these days, particularly when it comes to politics. Geographically, America has divided itself into physical communities that represent political ideology. Now more than ever, an American is less likely to have a next-door neighbor of an opposing political party. Yet, digital platforms like Twitter offer the opportunity to bridge that geographic divide. The overall relationship of political orientation and Twitter conversation commonality is clear in the regression results -- I’d be interested in finding a way to go deeper for individual cities and measure how much each city “preaches to the choir.”