Analysing Twitter language use in R, an example with Catalan MPs

The goal of this exercise is to get hundreds of thousands of tweets through R and analyze the language used in this social network. To achieve that, I used accounts for MPs (Member of Parliament) from the Catalan regional Parliament, which is composed by 135 members. I focused in Catalan and Spanish use, as language is a key factor in Catalan politics. I’ll compare the public speech about languages for every party and the real and daily use on Twitter by the MPs.

1 Methods

First, I get the MP’s Twitter accounts from journalist Albert Cuesta (gràcies!). It contains 121 accounts (of a total 135 MP), of which 1 (Enric Millo, from PP) is not currently a MP. So 15 Catalan MP has no Twitter account. Using TwitterR package and Twitter’s API I obtained the accounts data. One of the 120 MP accounts (David Pérez, from PSC-PSOE) has a private account, so I downloaded tweets from 119 MP on 5-June-2017. Due to API limitations, I downloaded the last 3,200 tweets for every account and merged with user’s data. This is a number large enough to be statistically significant. Then I used the package cld2, a R Wrapper for Google’s Compact Language Detector 2, to analyze all the tweet’s text and detect their language. The final data frame has 322,975 observations (tweets) and 34 variables (columns). Later, I filtered those tweets where language couldn’t be detected, remaining in total 283,784 tweets (12% lost).

I’ll focus in Catalan and Spanish languages as they are the main (practically exclusive) languages used by Catalan MPs on Twitter:

df %>% group_by(value) %>% 
  summarise(
    count = n(),
    percentage = round(count/nrow(df)*100, 2)
  ) %>% 
  arrange(desc(percentage))
## # A tibble: 50 × 3
##         value  count percentage
##        <fctr>  <int>      <dbl>
## 1     CATALAN 193489      68.18
## 2     SPANISH  75580      26.63
## 3     ENGLISH  11018       3.88
## 4    GALICIAN   2337       0.82
## 5      FRENCH    525       0.18
## 6  PORTUGUESE    184       0.06
## 7     ITALIAN    144       0.05
## 8      BASQUE    103       0.04
## 9   NORWEGIAN     97       0.03
## 10     GERMAN     61       0.02
## # ... with 40 more rows

With no exhaustive data exploration, I can say that false positive in either Catalan or Spanish are minimum. English has a lot of false positive (mainly tweets in Spanish) and, in addition to the low number, I discarded it. Galician tweets are, indeed, basically tweets in Spanish, so the same: discarded. French has little False Positive, same as in German, but there are very few tweets in these languages. Discarded because has no sense with such a low number. The other languages we can consider as False positive (errors).

2 Language in Twitter’s account configuration

Before analyzing tweet’s languages, one important item is the Twitter interface language configured by users (MPs). This data is provided by Twitter’s API, so there are no errors, as it’s read directly from configuration in real time (I tested with my own account, changing configuration 3 times):

p1.1 <- users %>% group_by(lang) %>% 
  summarise(
    count = n(),
    PerCent = count/nrow(users)*100
  ) %>% 
ggplot(aes(lang, PerCent)) +
  geom_bar(stat = "identity", width = 0.5) +
  labs(
    title = "MP's Twitter account language configuration, totals",
    x = "Language",
    y = "Percentage",
    caption = "Marc Belzunces (@marcbeldata). Data from Twitter API."
  )
#By_party, proportion

p1.2 <- users %>% group_by(party, lang) %>% 
  summarise(
    count = n()
  ) %>% 
ggplot(aes(party, count, fill = lang)) +
  geom_bar(stat = "identity", position = "fill", alpha = 0.8) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  labs(
    title = "MP's Twitter account language configuration, by party",
    x = "Party",
    y = "Proportion",
    caption = "Marc Belzunces (@marcbeldata). Data from Twitter API."
  )

grid.arrange(p1.1, p1.2, ncol = 2)

_config.yml

In general (left figure), Catalan and Spanish are matched in configuration. But when data is subset by political party, things change. No MP of Ciutadans (Cs) has his account configured in Catalan (right figure). Cs is a unionist party that defends bilingualism Catalan/Spanish, but focused in the Spanish linguistic group in Catalonia, same as Partit Popular (PP). PSC and CSQP are the next ones. They are also unionist parties, but defending federalism in Spain and having an electorate more mixed linguistically (although mainly Spanish-speaking). In contrast, Catalan indy parties, CUP an JxSí, focused in the Catalan-speaking group, use more Catalan than Spanish. This data is probably reflecting the MP’s native language.

3 Language use in Twitter’s Timeline

Now, I analyze data from the 286,000 tweets with detected language. First, I plot tweet’s language as a whole, with and without retweets:

p2.1 <- df %>% group_by(value) %>% 
  summarise(
    count = n(),
    percentage = count/nrow(df)*100
  ) %>% 
  filter(
    value == "CATALAN" | value == "SPANISH"
  ) %>% 
  ggplot(aes(value, percentage)) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_y_continuous(breaks = seq(0, 70, by = 10)) +
  labs(
    title = "MP's Twitter language use (Catalan & Spanish), totals",
    subtitle = "Including Retweets",
    x = "Language",
    y = "Percentage",
    caption = "Marc Belzunces (@marcbeldata). Data from Twitter API."
  )

#Total, without retweet
df2 <- df %>% filter(
  isRetweet == FALSE
) 

p2.2 <- df2 %>%   group_by(value) %>% 
  summarise(
    count = n(),
    percentage = count/nrow(df2)*100
  ) %>% 
  filter(
    value == "CATALAN" | value == "SPANISH"
  ) %>% 
  ggplot(aes(value, percentage)) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_y_continuous(breaks = seq(0, 70, by = 10)) +
  labs(
    title = "MP's Twitter language use (Catalan & Spanish), totals",
    subtitle = "Without Retweets",
    x = "Language",
    y = "Percentage",
    caption = "Marc Belzunces (@marcbeldata). Data from Twitter API."
  )

grid.arrange(p2.1, p2.2, ncol = 2)

_config.yml

Results are exactly the same, being Catalan the preferred language used globally by MPs in Twitter. With no differences between retweets and own tweets, that’s mean that retweets are reflecting the language also used by MPs when they type. They follow media and users accounts in the same linguistic proportion they write. From here, and in order to avoid unnecessary plots, I’m using data with retweets.

But, as before, when we look this data by political party, things are different:

ca_es <- df %>% group_by(party, value) %>% 
  summarise(
    count = n()
  ) %>% 
  filter(
    value == "CATALAN" | value == "SPANISH"
  ) %>% 
  mutate(
    value = as.factor(as.character(value))
  )

p3.1 <- ca_es %>% 
  ggplot(aes(party, count, fill = value)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "MP's language used in Twitter",
    x = "Party",
    y = "Number of Tweets",
    caption = "Marc Belzunces (@marcbeldata). Data from Twitter API."
  )


p3.2 <- ggplot(ca_es, aes(party, count, fill = value)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  labs(
    title = "MP's language used in Twitter",
    x = "Party",
    y = "Proportion of Tweets",
    caption = "Marc Belzunces (@marcbeldata). Data from Twitter API."
  )
  
grid.arrange(p3.1, p3.2, ncol = 2)

_config.yml

On the left, total number of tweets, by language. They are reflecting the size of the political groups in Catalan Parliament. JxSí has 62 MP, Cs is the second largest, with 25, followed by PSC (16), CSQP (11), PP (11) and CUP (10). On the right, proportion of tweet’s language by party, where we can see clearly 3 groups. Cs and PP use mainly Spanish (around 75%), followed by CSQP and PSC, which use Catalan in 75% of their tweets. Indy parties (CUP and JxSí), the third group, use basically Catalan language, in 95% of their tweets.

Top MP by language use

Finally, I found which MPs use more Spanish, and therefore less Catalan, in his Twitter Timeline:

#MP ranquing
#Selection of languages Catalan and Spanish
df2 <- df %>% filter(
  value == "CATALAN" | value == "SPANISH"
  )  

#Calculation of percentage for ca and es for every MP
mps <- levels(df2$screenName)

final <- as_tibble()

for (i in mps) {
  mp <- df2 %>% 
    filter(screenName == i)
  name <- str_c(mp$name[1], mp$party[1], sep = " ")
  total_tweets <- nrow(mp)
  ca_tw_percent <- round(as.numeric(count(filter(mp, value == "CATALAN"))/total_tweets*100), 2)
  es_tw_percent <- round(as.numeric(count(filter(mp, value == "SPANISH"))/total_tweets*100), 2)
  diff <- round(ca_tw_percent-es_tw_percent, 2)
  data <- cbind(name, total_tweets, ca_tw_percent, es_tw_percent, diff) %>% as.tibble()
  final <- rbind(final,data) %>% as_tibble()
}

#Top Spanish use
es_champions <- final %>% mutate(
  total_tweets = as.numeric(total_tweets),
  ca_tw = as.numeric(ca_tw_percent),
  es_tw = as.numeric(es_tw_percent),
  diff = as.numeric(diff)
) %>% 
  arrange(desc(es_tw))

head(es_champions, n = 20)
## # A tibble: 20 × 7
##                       name total_tweets ca_tw_percent es_tw_percent   diff
##                      <chr>        <dbl>         <chr>         <chr>  <dbl>
## 1           Andrea Levy PP         2407          2.58         97.42 -94.84
## 2    Fernando de Páramo Cs         2945          3.09         96.91 -93.82
## 3      Antonio Espinosa Cs          853          3.63         96.37 -92.74
## 4         Matías Alonso Cs         2646           5.4          94.6 -89.20
## 5     José María Espejo Cs         2479         11.62         88.38 -76.76
## 6         Jesús Galiano Cs          236         11.86         88.14 -76.28
## 7      Carmen de Rivera Cs         2476         15.27         84.73 -69.46
## 8        Carlos Sánchez Cs         1602         16.23         83.77 -67.54
## 9  Xavier García Albiol PP         2591         16.87         83.13 -66.26
## 10    Noemí de la Calle Cs         2684            18            82 -64.00
## 11     Carlos Carrizosa Cs         2366         18.22         81.78 -63.56
## 12         Sonia Sierra Cs         2525         19.17         80.83 -61.66
## 13       Inés Arrimadas Cs          393         19.34         80.66 -61.32
## 14     carles castillo PSC         2407         19.69         80.31 -60.62
## 15     David Mejía Ayra Cs         2702          20.1          79.9 -59.80
## 16     Fer SánchezCosta PP         2350         20.13         79.87 -59.74
## 17      Alfonso Sánchez Cs         2626         20.75         79.25 -58.50
## 18   Elisabeth Valencia Cs         2619         20.81         79.19 -58.38
## 19     mariajosegcuevas PP         1673         21.76         78.24 -56.48
## 20  Alejandro Fernández PP          171         22.81         77.19 -54.38
## # ... with 2 more variables: ca_tw <dbl>, es_tw <dbl>

And also those who use more Catalan, and therefore less Spanish:

#Top Catalan use
ca_champions <- final %>% mutate(
  total_tweets = as.numeric(total_tweets),
  ca_tw_percent = as.numeric(ca_tw_percent),
  es_tw_percent = as.numeric(es_tw_percent),
  diff = as.numeric(diff)
) %>%  
  arrange(desc(ca_tw_percent))

head(ca_champions, n = 20)
## # A tibble: 20 × 5
##                         name total_tweets ca_tw_percent es_tw_percent
##                        <chr>        <dbl>         <dbl>         <dbl>
## 1         Neus Lloveras JxSí         2858         99.72          0.28
## 2       Meritxell Roigé JxSí         2844         99.68          0.32
## 3       Carme Forcadell JxSí         2040         99.61          0.39
## 4  BERNAT SOLÉ I BARRIL JxSí          493         99.59          0.41
## 5          Jordi Munell JxSí          605         99.34          0.66
## 6         David Bonvehí JxSí         2892         99.27          0.73
## 7              jrcasals JxSí         2906         99.21          0.79
## 8   Lluís Guinó Subirós JxSí         2625         99.16          0.84
## 9          Marta Pascal JxSí         2742         99.02          0.98
## 10  Josep Rull i Andreu JxSí         2902         98.97          1.03
## 11         Albert Batet JxSí         2743         98.91          1.09
## 12       Albert Batalla JxSí         2837         98.84          1.16
## 13     Maria Senserrich JxSí         2793         98.39          1.61
## 14         Marc Solsona JxSí         2534         98.34          1.66
## 15         Dolors Bassa JxSí         2783         98.24          1.76
## 16        Gabriela Serra CUP           51         98.04          1.96
## 17 Jordi Turull i Negre JxSí         2968         97.98          2.02
## 18           Neus Munté JxSí         2744         97.74          2.26
## 19      Violant Cervera JxSí         1746         97.71          2.29
## 20  Antoni Castellà #SÍ JxSí         2690         97.66          2.34
## # ... with 1 more variables: diff <dbl>

As I showed in a previous post, in Catalonia voting is mainly driven by birthplace, or what it’s the same, language. This data seems to agree with previous conclusions. MPs use in Twitter the language of their voters.

Written on June 6, 2017