Text mining in R, analysing Catalan author Enric Vila

This is the second post about text mining in R. In the previous one, I analysed text from 323,000 tweets. Now, I’m analysing one of my favourite authors in Catalan, Enric Vila, with the only two books from him available for kindle, Londres-París-Barcelona: Viatge al cor de la tempesta and Un estiu a les trinxeres. Basically I’m following the excellent book Text Mining with R.

1 Methods

I’m using the packages tidyverse, tidytext, igraph, ggraph and widyr. I’m importing the books’ text from txt files, and joining them to one object:

#Loading Enric Vila's books and removing empty lines
londres <- readLines(con = "Londres-Paris-Barcelona_ Viatge al cor de - Enric Vila.txt") %>%
  as_tibble() %>%
  filter(
    value != ""
    ) %>% 
  mutate(
    book = "Londres-Paris-Barcelona"
    ) %>% 
  rename(
    text = value
  )
  
trinxeres <- readLines(con = "Un estiu a les trinxeres - Enric Vila.txt") %>%
  as_tibble() %>% 
  filter(
    value != ""
    ) %>%
  mutate(
    book = "Un estiu a les trinxeres"
    ) %>% rename(
      text = value
    )

vila_books <- rbind(londres, trinxeres)

Once imported, every paragraph is in a line:

str(vila_books)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3339 obs. of  2 variables:
##  $ text: chr  "© I. Montero Peláez" "Enric Vila és llicenciat en Història Contemporània i doctor en Periodisme. Ha publicat el llibre entrevista Què"| __truncated__ "«Si avui el món explotés molts de nosaltres només tindríem ulls per saber com quedaria Catalunya. La força de l"| __truncated__ "FÈLIX RIERA" ...
##  $ book: chr  "Londres-Paris-Barcelona" "Londres-Paris-Barcelona" "Londres-Paris-Barcelona" "Londres-Paris-Barcelona" ...
vila_books$text[14]
## [1] "Un altre hi hauria oposat resistència o, més ingènuament, hauria intentat marxar més lluny. Jo vaig obrir el Word i em vaig posar a escriure. No em va costar adonar-me que arribar a Londres amb ganes de treure’m uns quants morts de sobre em permetria mirar-me el país des de fora, a través d’un procés de globalització que no sé on portarà, però que segur que farà desaparèixer el món tal com jo l’havia conegut –i tal com vaig aprendre a estimar-lo."

Now, we need to isolate all the words, with the package tydytext and the function unnes_tokens(), counting the frequencies for every word:

#Getting all words, getting total words
vila_words <- vila_books %>% 
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>% 
  ungroup()

#Counting words
total_words <- vila_words %>% 
  group_by(book) %>% 
  summarize(
    total = sum(n)
  )

vila_words <- left_join(vila_words, total_words)

vila_words
## # A tibble: 25,150 × 4
##                        book  word     n  total
##                       <chr> <chr> <int>  <int>
## 1   Londres-Paris-Barcelona    de  4522 106013
## 2   Londres-Paris-Barcelona   que  4338 106013
## 3   Londres-Paris-Barcelona    la  4254 106013
## 4   Londres-Paris-Barcelona     i  3117 106013
## 5  Un estiu a les trinxeres   que  3093  68446
## 6  Un estiu a les trinxeres    de  2836  68446
## 7  Un estiu a les trinxeres    la  2647  68446
## 8   Londres-Paris-Barcelona     a  2578 106013
## 9   Londres-Paris-Barcelona    el  2507 106013
## 10 Un estiu a les trinxeres    el  1842  68446
## # ... with 25,140 more rows

2 Most important words/concepts in Vila’s books

In the previous post, I used the approach of pure frequency, removing stop words. Here I’m using the tf-idf statistic (term frequency - inverse document frequency), which is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites. (source: Text Mining with R). The idea behind that, is that all the words tends to follow the Zipf’s law: the frequency that a word appears is inversely proportional to its rank. So, the most important words are those who are above this normal frequency.

Here is the distribution of Vila’s words according to Zipf’s law (dotted line):

#Rank of words. Ordered previous, rank number is true rank
freq_by_rank <- vila_words %>% 
  group_by(book) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

#Zipf's law
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = book)) + 
  geom_abline(intercept = -0.62, slope = -1.1, color = "gray50", linetype = 2) +
  geom_line(size = 1.2, alpha = 0.8) + 
  scale_x_log10() +
  scale_y_log10()

_config.yml

Enric Vila is following, basically, the Zipf’s law, although he is using a little less the most frequent words. It’s common a deviation in high ranks, but it’s also probably affected because I’m using only two books.

So now, statistic tf_idf is obtained:

book_words <- vila_words %>%
  bind_tf_idf(word, book, n)
book_words
## # A tibble: 25,150 × 7
##                        book  word     n  total         tf   idf tf_idf
##                       <chr> <chr> <int>  <int>      <dbl> <dbl>  <dbl>
## 1   Londres-Paris-Barcelona    de  4522 106013 0.04265515     0      0
## 2   Londres-Paris-Barcelona   que  4338 106013 0.04091951     0      0
## 3   Londres-Paris-Barcelona    la  4254 106013 0.04012715     0      0
## 4   Londres-Paris-Barcelona     i  3117 106013 0.02940205     0      0
## 5  Un estiu a les trinxeres   que  3093  68446 0.04518891     0      0
## 6  Un estiu a les trinxeres    de  2836  68446 0.04143412     0      0
## 7  Un estiu a les trinxeres    la  2647  68446 0.03867282     0      0
## 8   Londres-Paris-Barcelona     a  2578 106013 0.02431777     0      0
## 9   Londres-Paris-Barcelona    el  2507 106013 0.02364804     0      0
## 10 Un estiu a les trinxeres    el  1842  68446 0.02691173     0      0
## # ... with 25,140 more rows

Table is ordered by rank (most frequent word). Book is where the word appears, n is the total number of appearances for the word in the document and total is the total number of words in each book. tf is the term frequency proportion in the book (tf = n/total). idf, inverse document frequency, and tf_idf are close to zero for these words extremely commons, while higher idf/tf_idf corresponds to words less frequent and, thus, more characteristic. Let’s see these characteristic words in Enric Vila books:

book_words %>% 
  select(-total) %>% 
  arrange(desc(tf_idf))
## # A tibble: 25,150 × 6
##                        book      word     n           tf       idf
##                       <chr>     <chr> <int>        <dbl>     <dbl>
## 1  Un estiu a les trinxeres       mas   140 0.0020454081 0.6931472
## 2  Un estiu a les trinxeres       ciu    63 0.0009204336 0.6931472
## 3  Un estiu a les trinxeres  consulta    51 0.0007451129 0.6931472
## 4  Un estiu a les trinxeres junqueras    46 0.0006720626 0.6931472
## 5   Londres-Paris-Barcelona     martí    68 0.0006414308 0.6931472
## 6   Londres-Paris-Barcelona   natàlia    52 0.0004905059 0.6931472
## 7   Londres-Paris-Barcelona     intel    51 0.0004810731 0.6931472
## 8  Un estiu a les trinxeres l’estatut    31 0.0004529118 0.6931472
## 9  Un estiu a les trinxeres   marisol    30 0.0004383017 0.6931472
## 10 Un estiu a les trinxeres        pp    28 0.0004090816 0.6931472
## # ... with 25,140 more rows, and 1 more variables: tf_idf <dbl>

Better a plot by book:

plot_vila <- book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

plot_vila %>% 
  group_by(book) %>% 
  top_n(25) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

_config.yml

Un estiu a les trinxeres is about politics in Catalonia. The firs word (mas) is the former president of Catalonia, CiU their party, consulta was the unnoficial referendum for independence, junqueras another political leader… Analysis is showing a very good insight for the book. Londres-Paris-Barcelona is a more personal book, with Enric Vila’s ideas about cities, nations, women an personal experiences in his stay in London while writing the book. Here first positions are for Vila’s friends (intel is for intel·ligència, tidytext seems having problems with some Catalan character).

3 Bigrams

Isolated words are very interesting, and their analysis are showing very well the books’ subjects. But probably is more interesting, or provides more information, the relationships between words. Here I analyzed which words tends to follow others immediately, that is, words that go together.

#Getting bigrams
vila_bigrams <- rbind(londres, trinxeres) %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

#Sorting by n
vila_bigrams %>% 
  count(bigram, sort = TRUE)
## # A tibble: 92,146 × 2
##     bigram     n
##      <chr> <int>
## 1    de la  1345
## 2     a la   690
## 3   que no   470
## 4   que la   438
## 5   que el   393
## 6   de les   367
## 7  la seva   350
## 8    i que   313
## 9  que els   311
## 10    i la   285
## # ... with 92,136 more rows

As expected, more frequent bigrams are stop words. We have to remove them. I’m using a limited stop words list for Catalan that I have to improve in near future, but for the purpose of this exercise it’s enough:

#Spliting words from bigrams
bigrams_separated <- vila_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

#Loading Catalan stop words
stopwords_ca <- read_csv("stopwords_ca.csv", col_names = F) %>% 
  rename(word = X1)
## Parsed with column specification:
## cols(
##   X1 = col_character()
## )
#Removing stop words from bigrams
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stopwords_ca$word) %>%
  filter(!word2 %in% stopwords_ca$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE) %>% 
  unite(bigram, word1, word2, sep = " ")

bigram_counts
## # A tibble: 29,490 × 2
##              bigram     n
## *             <chr> <int>
## 1         em sembla    79
## 2      estats units    70
## 3          segle xx    66
## 4     president mas    49
## 5         nova york    39
## 6  l’estat espanyol    38
## 7          m’ha dit    34
## 8           em deia    31
## 9    guerra mundial    31
## 10    gran bretanya    28
## # ... with 29,480 more rows

Here the bigrams have more sense than isolated words, although it could be improved. Now, we generate the tf_idf statistic for the bigrams, in order to show the more important ones:

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united
## # A tibble: 35,099 × 2
##                       book                 bigram
## *                    <chr>                  <chr>
## 1  Londres-Paris-Barcelona         montero peláez
## 2  Londres-Paris-Barcelona           peláez enric
## 3  Londres-Paris-Barcelona             enric vila
## 4  Londres-Paris-Barcelona història contemporània
## 5  Londres-Paris-Barcelona      llibre entrevista
## 6  Londres-Paris-Barcelona         pensa heribert
## 7  Londres-Paris-Barcelona       heribert barrera
## 8  Londres-Paris-Barcelona           barrera 2001
## 9  Londres-Paris-Barcelona          2001 l’assaig
## 10 Londres-Paris-Barcelona        l’assaig néstor
## # ... with 35,089 more rows
bigram_tf_idf <- bigrams_united %>%
  count(book, bigram) %>%
  bind_tf_idf(bigram, book, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
## Source: local data frame [30,496 x 6]
## Groups: book [2]
## 
##                        book                  bigram     n           tf
##                       <chr>                   <chr> <int>        <dbl>
## 1  Un estiu a les trinxeres           president mas    49 0.0036643733
## 2  Un estiu a les trinxeres         cases regionals    14 0.0010469638
## 3  Un estiu a les trinxeres       l’oriol junqueras    14 0.0010469638
## 4  Un estiu a les trinxeres tribunal constitucional    14 0.0010469638
## 5  Un estiu a les trinxeres              lópez tena    13 0.0009721807
## 6  Un estiu a les trinxeres               cas pujol    11 0.0008226144
## 7  Un estiu a les trinxeres      justícia espanyola    10 0.0007478313
## 8   Londres-Paris-Barcelona          intel ligència    16 0.0007364109
## 9   Londres-Paris-Barcelona             vida urbana    16 0.0007364109
## 10  Londres-Paris-Barcelona           senyor cuervo    15 0.0006903852
## # ... with 30,486 more rows, and 2 more variables: idf <dbl>, tf_idf <dbl>

Bigram has more sense than isolated words. mas could refers also to a tipical medieval house, but with bigrams it’s clear that we are talking about the former president of Catalonia. Let’s see a plot, by book:

#Plot bigrams
plot_bigrams <- bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram))))


plot_bigrams %>% 
  group_by(book) %>% 
  top_n(25) %>% 
  ungroup %>%
  ggplot(aes(bigram, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

_config.yml

These are the frequency of bigrams, but what about their relationships? Here I’m using the package igraph to build a network of words. More information in Text Mining with R

#Getting count to filter them for a clear plot
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigram_counts %>%
  filter(n > 7) %>%
  graph_from_data_frame()

#Network plot
set.seed(2016)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

_config.yml

4 Correlation of words in paragraphs

Bigrams are two words together, one word followed immediately by another. But words can also be related and not being together. For exemple, a book has several chapters, normally for every topic, so words in chapters are related. Here I’m analysing the correlations between words in fragments of text. As I haven’t the text by chapter, I’m splitting it by 10 paragraphs (lines, as I showed at the beginning). I’m analising only one book, Londres-París-Barcelona.

#Correlating pairs in sections
vila_section_words <- londres %>%
  mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stopwords_ca$word)

word_pairs <- vila_section_words %>%
  pairwise_count(word, section, sort = TRUE)

word_pairs
## # A tibble: 8,965,318 × 3
##    item1 item2     n
##    <chr> <chr> <dbl>
## 1    món    em    98
## 2     em   món    98
## 3   seva    em    91
## 4     em  seva    91
## 5     em   ara    90
## 6    ara    em    90
## 7    tan    em    89
## 8    dir    em    89
## 9   seva   món    89
## 10   món  seva    89
## # ... with 8,965,308 more rows

So, we have 9 millions of word pairs. I’m selecting the most common, ordered by their correlation:

# we need to filter for at least relatively common words first
word_cors <- vila_section_words %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE)

word_cors
## # A tibble: 191,406 × 3
##      item1   item2 correlation
##      <chr>   <chr>       <dbl>
## 1    units  estats   0.7376910
## 2   estats   units   0.7376910
## 3     york    nova   0.7074085
## 4     nova    york   0.7074085
## 5       xx   segle   0.6737424
## 6    segle      xx   0.6737424
## 7     unió europea   0.5977442
## 8  europea    unió   0.5977442
## 9  mundial  guerra   0.5611101
## 10  guerra mundial   0.5611101
## # ... with 191,396 more rows

There are two common topics in Enric Vila’s work, Catalonia and women (dones). Let’s see which words correlate better with them in paragraphs:

#Exploration
word_cors %>%
  filter(item1 %in% c("catalunya", "dones")) %>%
  group_by(item1) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

_config.yml

Finally, I’m representing a correlation network of words, probably the most interesting result for this exercise, showing clusters of relations between words, that is, the relationships between topics:

set.seed(2016)

word_cors %>%
  filter(correlation > .35) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

_config.yml

We can detect topics programmatically, but for this post it’s enough. I must admit: text mining it’s a lot of fun!

Written on June 10, 2017