Text Mining in R, most common words in Twitter by Catalan MP

Text mining in social networks it’s one of the interesting things to do in data science. In the previous post I got 323,000 tweets from all the MP in the Catalan regional Parliament, focusing in language use. Here, I’ll analyze the most frequent words for Catalan and Spanish language (95% of the tweets). I’m using the followings R packages: tidyverse, grid, gridExtra and tidytext. As a reference, I’m using the book Text Mining with R.

1 Methods

First I load all the data from the previous post:

df <- as_tibble(readRDS("Tweets_lang.rds"))

df <- df %>% select(
  text,
  created.x,
  screenName,
  value,
  party
) %>% 
  mutate(
    text = as.character(text),
    line = row_number()
  ) %>% 
  select(
    line, 
    text:party
    ) %>% 
  rename(
    date = created.x
  )
glimpse(df)
## Observations: 322,975
## Variables: 6
## $ line       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ text       <chr> "RT @jorditurull: Pel PP que el President @KRLS par...
## $ date       <fctr> 2017-06-01T17:39:18Z, 2017-05-29T10:48:53Z, 2017-0...
## $ screenName <fctr> rovirola_dolors, rovirola_dolors, rovirola_dolors,...
## $ value      <fctr> CATALAN, CATALAN, CATALAN, CATALAN, SPANISH, CATAL...
## $ party      <fctr> JxSí, JxSí, JxSí, JxSí, JxSí, JxSí, JxSí, JxSí, Jx...

Now the most important step. All the tweets are in the column text as sentences. We need to split every sentence in words. That’s what the unnest_tokens function does.

#Getting words
text_words <- df %>% 
  unnest_tokens(word,text)
text_words
## # A tibble: 5,979,576 × 6
##     line                 date      screenName   value  party        word
##    <int>               <fctr>          <fctr>  <fctr> <fctr>       <chr>
## 1      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí          rt
## 2      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí jorditurull
## 3      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí         pel
## 4      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí          pp
## 5      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí         que
## 6      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí          el
## 7      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí   president
## 8      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí        krls
## 9      1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí       parli
## 10     1 2017-06-01T17:39:18Z rovirola_dolors CATALAN   JxSí          de
## # ... with 5,979,566 more rows

With all the words, we need to remove all the words that has no special meaning and are very frequent, the stop words. I found some stop words listings for Catalan and Spanish in this webpage, but they are very few, so I added more words. With the stop words list, we make a anti_join to remove from the stop words from de total word list.

#CATALAN
text_words_ca <- text_words %>% 
  filter(
    value == "CATALAN"
  )

#Stopwords in Catalan
stopwords_ca <- read_csv("stopwords_ca.csv", col_names = FALSE) %>% 
  rename(word = X1)

#Removing stop words from text
text_words_ca <- text_words_ca %>% 
  anti_join(stopwords_ca)

#SPANISH
text_words_es <- text_words %>% 
  filter(
    value == "SPANISH"
  )

#Stopwords in Spanish
stopwords_es <- read_csv("stopwords_es.csv", col_names = FALSE) %>% 
  rename(word = X1)

#Removing stop words from text
text_words_es <- text_words_es %>% 
  anti_join(stopwords_es)

Now we have all the data, split by language (Catalan or Spanish), with no stop words (al least with most frequents words).

2 Frequency: most common words by language

I have 2 million words for Catalan, and 758,000 words for Spanish in total (repeated or not). Now we count the frequency for every unique word.

#Absolute freq for Catalan
freq_ca <- text_words_ca %>% 
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  )

freq_ca <- freq_ca[1:25,]

#Absolute freq
freq_es <- text_words_es %>% 
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  )

freq_es <- freq_es[1:25,]

I’m selecting the 25 most frequent words to plot them:

p1.1 <- ggplot(freq_ca, aes(word, n)) +
  geom_col(fill = "lightgreen") +
  coord_flip() +
  labs(
    title = "Most common words in Catalan from Catalan Politicians",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )
p1.2 <- ggplot(freq_es, aes(word, n)) +
  geom_col(fill = "lightblue") +
  coord_flip() +
  labs(
    title = "Most common words in Spanish from Catalan Politicians",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

grid.arrange(p1.1, p1.2, ncol = 2)

_config.yml

There are clear differences depending on the language. In Catalan, self-government related words are the most common (Catalonia, Parliament and (Catalan) government in Top 3), while in Spanish the most common words are more related to the unionist parties and their leaders. In Catalan, Catalonia is the Top 1, while Spain is not directly in the Top 25, by it’s the political Catalan euphemism for Spain: l’estat (“the Spanish State”). In Spanish, Spain is the third most frequent word, while Catalonia is the seventh, while the Spanish Government (gobierno) is most frequent thant the Catalan Parliament (parlament).

Most common words by party

Let’s see if we analyze tweets by political party and not by language. First I fuse the Catalan and Spanish stop words, as I’m not splitting by language, as before. Later, I select words for every party and I plot the 25 most frequent.

#All the stop words, in Catalan and Spanish
stopwords_ca_es <- rbind(stopwords_ca, stopwords_es)

text_words_ca_es <- text_words %>% 
  anti_join(stopwords_ca_es)

#JxSí
jxsi_words <- text_words_ca_es %>% 
  filter(
    party == "JxSí"
  ) %>%
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  ) %>% 
  arrange (desc(n)) %>% 
  top_n(25)

p2.1 <- ggplot(jxsi_words, aes(word, n)) +
  geom_col(fill = "#5EB5A1")+
  coord_flip() +
  labs(
    title = "Most common words by JxSí MP in Twitter",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

#Cs
cs_words <- text_words_ca_es %>% 
  filter(
    party == "Cs"
  ) %>%
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  ) %>% 
  arrange (desc(n)) %>% 
  top_n(25)

p2.2 <- ggplot(cs_words, aes(word, n)) +
  geom_col(fill = "#FFA300")+
  coord_flip() +
  labs(
    title = "Most common words by Cs MP in Twitter",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

#psc
psc_words <- text_words_ca_es %>% 
  filter(
    party == "PSC"
  ) %>%
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  ) %>% 
  arrange (desc(n)) %>% 
  top_n(25)

p2.3 <- ggplot(psc_words, aes(word, n)) +
  geom_col(fill = "#E60000")+
  coord_flip() +
  labs(
    title = "Most common words by PSC MP in Twitter",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

#CSQP
csqp_words <- text_words_ca_es %>% 
  filter(
    party == "CSQP"
  ) %>%
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  ) %>% 
  arrange (desc(n)) %>% 
  top_n(25)

p2.4 <- ggplot(csqp_words, aes(word, n)) +
  geom_col(fill = "#C1173E")+
  coord_flip() +
  labs(
    title = "Most common words by CSQP MP in Twitter",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

#PP
pp_words <- text_words_ca_es %>% 
  filter(
    party == "PP"
  ) %>%
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  ) %>% 
  arrange (desc(n)) %>% 
  top_n(25)

p2.5 <- ggplot(pp_words, aes(word, n)) +
  geom_col(fill = "#4EA9FF")+
  coord_flip() +
  labs(
    title = "Most common words by PP MP in Twitter",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

#CUP
cup_words <- text_words_ca_es %>% 
  filter(
    party == "CUP"
  ) %>%
  count(word, sort = TRUE) %>% 
  mutate(
    word = reorder(word,n)
  ) %>% 
  arrange (desc(n)) %>% 
  top_n(25)

p2.6 <- ggplot(cup_words, aes(word, n)) +
  geom_col(fill = "#FFCC00")+
  coord_flip() +
  labs(
    title = "Most common words by CUP MP in Twitter",
    subtitle = "Among 323,000 tweets from all Catalan Parliament MP. Stop words removed",
    y = "Count",
    x = "Word",
    caption = "Marc Belzunces (@marcbeldata)"
  )

grid.arrange(p2.1,p2.2, p2.3,p2.4,p2.5,p2.6, ncol=2)

_config.yml

Basically, we can see that MP use Twitter to talk about themselves and their parties. In all parties the first word is the own party, except in the indy coalition JxSí, where Catalonia is the first word, and the coalition the second. Followed by political words related to the Catalan self-government/status (Catalonia, Spain, Spanish/Catalan Government, referendum/democracy).

This is a swallow analysis. Obviously, with a deeper word selection (removing parties an MP accounts, for example), we can focus more in political and social words. But the scope for this brief post is a first text mining exploration.

Written on June 9, 2017