Ethnic vote in Catalonia 1980-2015

In this little exercise I’m exploring the vote in Catalonia in terms of independence from/union to Spain, a key political issue in Catalonia for the last years, testing the hypothesis if this vote is conditioned by birthplace.

1 Methods

I’m using R programming language in RStudio, specifically the tidyverse packages. For those interested in this evolution of R programming language, I strongly recommend the R for Data Science book. Additionally, I’m using stringr package for string manipulation, and some other packages needed for mapping and plotting (rdgal, rgeos, maptools, ggmap and scales).

I’m using election data for all the contemporary Catalan regional elections, eleven in total: 1980, 1984, 1988, 1992, 1995, 1999, 2003, 2006, 2010, 2012 and 2015. Source and basic structure comes from the Statistical Institute of Catalonia (Idescat). So, let’s to aggregate all the data:

#Load Data
files <- list.files(".", pattern = "^t6011", full.names = T)

#Merge files
final <- tibble()

for (i in files) {
  #Finding rows to skip (variable) in reading files
  content <- as_tibble(readLines(i))
  skip_index <- as.numeric(str_which(content$value, "^Codi"))-1
  #Reading files
  df <- read_csv2(i, skip = skip_index, locale = locale(encoding = "ISO-8859-1"), na = "..")
  year <- str_sub(i, start = 11, end = 14)
  #Common part to all years
  df <- df %>% 
    mutate(
      Year = year,
      Codi = str_sub(Codi, 1, 5)
    ) %>% 
    rename(
      INE = Codi,
      Municipality = Literal,
      Cs = `C's`,
      Other = `Altres candidatures`
    )  %>% 
    filter(
      Municipality != "Catalunya" &
        !str_detect(Municipality, "^Resident")
    )
  #Non-2015 years (base structure)
  if (year != 2015) {
    df <- df %>% 
      mutate(
        JxSí = NA
      ) %>% 
      rename(
        CSQP = IC
      ) %>% 
      select(
        INE:Municipality,
        Year,
        CiU:CUP,
        JxSí,
        Other,
        Total
      )
  }
  #2015: JxSí and CSQP, no CiU ERC
  if (year == 2015) {
    df <- df %>% 
      mutate(
        CiU = NA,
        ERC = NA
      ) %>% 
      rename(
        CSQP = CatSíqueesPot
      ) %>% 
      select(
        INE:Municipality,
        Year,
        JxSí:ERC
      )
  }
  
final <- rbind(final, df) %>% 
    arrange(Municipality, Year)
}

The resulting data frame has 10,382 observations and 13 variables:

options(tibble.print_min = 22)
final
## # A tibble: 10,382 × 13
##      INE       Municipality  Year   CiU   PSC    PP  CSQP   ERC    Cs
##    <chr>              <chr> <chr> <int> <int> <int> <int> <int> <chr>
## 1  25001 Abella de la Conca  1980    50     4     0     1     2  <NA>
## 2  25001 Abella de la Conca  1984    64     0     0     0     3  <NA>
## 3  25001 Abella de la Conca  1988    40     0     0     1     2  <NA>
## 4  25001 Abella de la Conca  1992    46     3     3     0     3  <NA>
## 5  25001 Abella de la Conca  1995    49     6     4     1     7  <NA>
## 6  25001 Abella de la Conca  1999    56    12     6     0     6  <NA>
## 7  25001 Abella de la Conca  2003    53     6     0     6    12  <NA>
## 8  25001 Abella de la Conca  2006    37     8     3     8    15     0
## 9  25001 Abella de la Conca  2010    57     7     0     8     6     0
## 10 25001 Abella de la Conca  2012    47     4     0     7    17     0
## 11 25001 Abella de la Conca  2015    NA    11     5     3    NA     9
## 12 08001             Abrera  1980   411   407     6   389    81  <NA>
## 13 08001             Abrera  1984   700   726    91   214    49  <NA>
## 14 08001             Abrera  1988   602   732    57   199    44  <NA>
## 15 08001             Abrera  1992   733   849    83   147    88  <NA>
## 16 08001             Abrera  1995  1015   936   388   409   182  <NA>
## 17 08001             Abrera  1999   941  1754   384   124   179  <NA>
## 18 08001             Abrera  2003   883  1746   585   397   467  <NA>
## 19 08001             Abrera  2006   886  1497   561   439   476   139
## 20 08001             Abrera  2010  1434  1161   730   418   261   208
## 21 08001             Abrera  2012  1080  1236   962   782   571   599
## 22 08001             Abrera  2015    NA  1275   619   938    NA  1771
## # ... with 1.036e+04 more rows, and 4 more variables: CUP <chr>,
## #   JxSí <int>, Other <int>, Total <int>
glimpse(final)
## Observations: 10,382
## Variables: 13
## $ INE          <chr> "25001", "25001", "25001", "25001", "25001", "250...
## $ Municipality <chr> "Abella de la Conca", "Abella de la Conca", "Abel...
## $ Year         <chr> "1980", "1984", "1988", "1992", "1995", "1999", "...
## $ CiU          <int> 50, 64, 40, 46, 49, 56, 53, 37, 57, 47, NA, 411, ...
## $ PSC          <int> 4, 0, 0, 3, 6, 12, 6, 8, 7, 4, 11, 407, 726, 732,...
## $ PP           <int> 0, 0, 0, 3, 4, 6, 0, 3, 0, 0, 5, 6, 91, 57, 83, 3...
## $ CSQP         <int> 1, 0, 1, 0, 1, 0, 6, 8, 8, 7, 3, 389, 214, 199, 1...
## $ ERC          <int> 2, 3, 2, 3, 7, 6, 12, 15, 6, 17, NA, 81, 49, 44, ...
## $ Cs           <chr> NA, NA, NA, NA, NA, NA, NA, "0", "0", "0", "9", N...
## $ CUP          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "4", "18", NA...
## $ JxSí         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57, NA, N...
## $ Other        <int> 19, 3, 4, 5, 3, 3, 0, 0, 5, 2, 4, 315, 90, 264, 1...
## $ Total        <int> 76, 70, 47, 60, 70, 83, 77, 71, 83, 81, 107, 1609...

Data is ordered by municipality (948 in 2015), and election year. NA data corresponds to the party Cs (Ciudadanos, first Catalan elections in 2006), CUP (Candidatura d’Unitat Popular, first Catalan elections in 2012) and those involving in JxSí, CiU and ERC, as CiU and ERC are not present in 2015, and JxSí is present in 2015 but not in the rest of years.

Now, vote in percentage and indy/unionist assignment are added:

#Percent
colnames(final)[4:13] <- str_c(colnames(final)[4:13], "_votes") 

final <- final %>% 
  rowwise %>% #rowwise will make sure the sum operation will occur on each row
  mutate(
    Indy_votes = sum(CiU_votes,ERC_votes,as.numeric(CUP_votes),JxSí_votes, na.rm = T),
    Unio_votes = sum(PSC_votes,PP_votes,CSQP_votes,as.numeric(Cs_votes), na.rm = T)
  ) %>% 
  mutate(
    CiU_percent = round(CiU_votes/Total_votes*100,2),
    PSC_percent = round(PSC_votes/Total_votes*100,2),
    PP_percent = round(PP_votes/Total_votes*100,2),
    CSQP_percent = round(CSQP_votes/Total_votes*100,2),
    ERC_percent = round(ERC_votes/Total_votes*100,2),
    Cs_percent = round(as.numeric(Cs_votes)/Total_votes*100,2),
    CUP_percent = round(as.numeric(CUP_votes)/Total_votes*100,2),
    JxSí_percent = round(JxSí_votes/Total_votes*100,2),
    Other_percent = round(Other_votes/Total_votes*100,2),
    Indy_percent = round(Indy_votes/Total_votes*100, 2),
    Unio_percent = round(Unio_votes/Total_votes*100, 2)
  )

948 municipalities are a lot of elements for Exploratory Data Analysis, so I’m adding also some extra administrative data, Comarques (Counties, 42 in total) and Vegueries (7 Catalan regions):

#Regions
vegueries <- read_csv("INE Codes Counties Vegueries.csv")
vegueries$INE <- str_pad(vegueries$INE, 5, "left", "0")

df <- left_join(final, vegueries, by = "INE") %>% 
  select(
    INE:Municipality,
    County:Vegueria,
    Year:Unio_percent
  )
options(tibble.print_min = 22)
df
## Source: local data frame [10,382 x 28]
## Groups: <by row>
## 
## # A tibble: 10,382 × 28
##      INE       Municipality         County    Vegueria  Year CiU_votes
##    <chr>              <chr>          <chr>       <chr> <chr>     <int>
## 1  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  1980        50
## 2  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  1984        64
## 3  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  1988        40
## 4  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  1992        46
## 5  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  1995        49
## 6  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  1999        56
## 7  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  2003        53
## 8  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  2006        37
## 9  25001 Abella de la Conca  Pallars Jussà Alt Pirineu  2010        57
## 10 25001 Abella de la Conca  Pallars Jussà Alt Pirineu  2012        47
## 11 25001 Abella de la Conca  Pallars Jussà Alt Pirineu  2015        NA
## 12 08001             Abrera Baix Llobregat   Barcelona  1980       411
## 13 08001             Abrera Baix Llobregat   Barcelona  1984       700
## 14 08001             Abrera Baix Llobregat   Barcelona  1988       602
## 15 08001             Abrera Baix Llobregat   Barcelona  1992       733
## 16 08001             Abrera Baix Llobregat   Barcelona  1995      1015
## 17 08001             Abrera Baix Llobregat   Barcelona  1999       941
## 18 08001             Abrera Baix Llobregat   Barcelona  2003       883
## 19 08001             Abrera Baix Llobregat   Barcelona  2006       886
## 20 08001             Abrera Baix Llobregat   Barcelona  2010      1434
## 21 08001             Abrera Baix Llobregat   Barcelona  2012      1080
## 22 08001             Abrera Baix Llobregat   Barcelona  2015        NA
## # ... with 1.036e+04 more rows, and 22 more variables: PSC_votes <int>,
## #   PP_votes <int>, CSQP_votes <int>, ERC_votes <int>, Cs_votes <chr>,
## #   CUP_votes <chr>, JxSí_votes <int>, Other_votes <int>,
## #   Total_votes <int>, Indy_votes <dbl>, Unio_votes <dbl>,
## #   CiU_percent <dbl>, PSC_percent <dbl>, PP_percent <dbl>,
## #   CSQP_percent <dbl>, ERC_percent <dbl>, Cs_percent <dbl>,
## #   CUP_percent <dbl>, JxSí_percent <dbl>, Other_percent <dbl>,
## #   Indy_percent <dbl>, Unio_percent <dbl>
write_csv(df, "Catalan Parliament 1980-2015.csv")

2 Exploratory Data Analysis

First, I explore the vote summatory of current indy parties (CiU, ERC, CUP and JxSí) for all the elections, by county. I’m using boxplot to show the dispersion of the votes through the years, ordered by their median.

#Reading tidied data
df <- read_csv("Catalan Parliament 1980-2015.csv")

#Selection of data
df2 <- df %>%
  select(
    INE,
    Municipality,
    County,
    Vegueria,
    Year,
    Indy_votes,
    Unio_votes,
    Total_votes
  ) %>% 
  filter(
    !is.na(County)
  )

#Group by County
county <- df2 %>% 
  group_by(County,Year) %>% 
  summarise(
    indy_votes = sum(Indy_votes, na.rm = T),
    unio_votes = sum(Unio_votes, na.rm = T),
    total_votes = sum(Total_votes, na.rm = T),
    indy_percent = round(indy_votes/total_votes*100, 2),
    unio_percent = round(unio_votes/total_votes*100, 2)
  ) %>% 
  arrange(County, Year, indy_percent)

#County as factor for the plot
county$County <- as.factor(county$County)

#Plot
ggplot(county) +
  geom_boxplot(aes(x = reorder(County, indy_percent, FUN = median), y = indy_percent)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  labs(
    title = "Votes for indy parties in Catalonia 1980-2015, by county",
    caption = "Marc Belzunces. Data from Idescat.",
    x = "County (ordered by vote median)",
    y = "Vote (%) to Catalan indy parties"
  )

_config.yml

Vote dispersion is, in general, low. 34 of the counties has a median higher than 50% of the vote to indy parties. Only in 8 counties the median is lower than 50%. Let’s see a map:

#Map of median
#Load Base County map
shape <- readOGR(dsn = ".", layer = "Comarques_WGS84")

#You need to "fortify" the shape layer to see internal GIS fields as long and lat.
#For that, you need the maptools package, that it needs the rgeos package.
#Yeah, it's not easy. Good tutorial: http://www.kevjohnson.org/making-maps-in-r/

shape <- fortify(shape, region = "Comarca")

#Calculation of the median by county, and rename for the join
med <- county %>% 
  summarise(
    median = median(indy_percent)
  ) %>% 
  rename(
    id = County
  )

#Joining spatial layer with median data, by County ("id")
#Unfortunately, tidyverse doesn't work well, I use base R merge.
map <- merge(shape, med, "id")
#map <- left_join(shape, median, "id")

#Plot map
ggplot() +
  geom_polygon(data = map, aes(x = long, y = lat, group = group, fill = median)) +
  coord_map() +
  scale_fill_distiller(palette = "Blues", trans = "reverse",
                       breaks = pretty_breaks(n = 10)) +
  guides(fill = guide_legend(reverse = TRUE)) +
  theme_nothing(legend = TRUE) +
  labs(
    title = "Votes (%) for indy parties in Catalonia 1980-2015, by county",
    caption = "Marc Belzunces. Data from Idescat.",
    fill = ""
  )

_config.yml

The 8 counties with a median <50% are situated in the coast, around the Metropolitan Area of Barcelona, Tarragona and the coastal area in between. These areas are the most populated areas in Catalonia, as total votes show if we add them to the previous boxplot figure (in orange):

#Total votes plot
ggplot(county) +
  geom_boxplot(aes(x = reorder(County, indy_percent, FUN = median), y = indy_percent)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  geom_boxplot(aes(x = reorder(County, indy_percent, FUN = median), y = total_votes/10000),
               fill = "orange") +
  labs(
    title = "Votes for indy parties in Catalonia 1980-2015, by county",
    caption = "Marc Belzunces. Data from Idescat.",
    x = "County (ordered by vote median)",
    y = "Vote (%) to Catalan indy parties"
  ) +
  scale_y_continuous(
    sec.axis = sec_axis(~ . * 10000, name = "Total votes (orange)")
  )

_config.yml

Catalan birthrate is among the lowest in the World, and these areas are so populated due to 2 recent massive immigration events. First one, corresponding to Spanish immigration around the 1970s. Second one, from North Africa, Latin America and Europe, mainly, in 2000s. So, lower indy votes is in the most populated areas, and these areas concentrate people not born in Catalonia. So, we have high vote fidelity through time for indy parties as a block (with left and right options inside the block), as low dispersion shows, and apparently fewer indy votes in those counties with more immigration. Is some statistical relationship between place of birth and vote in terms of independence/union to Spain?

In orther to explore that, I’ll use demographics data from the Spanish National Statistics (INE). This data contains the birthplace percentage by area for every municipality. I’ll try to establish a statistical predictive model.

3 Predictive Statistical Model for indy vote

First, I’ll use demographic data for 2015. Unit of analysis is municipality:

#Demographics from Spanish INE
dem <- read_csv("Place of Birth Catalonia 2002-2015.csv")
#Selection of year and most relevant variables
dem <- dem %>% 
  filter(
    Year == 2015
  )  %>% 
  mutate(
    INE = str_pad(INE, 5, "left", "0")
  ) %>% 
  select(
    INE,
    Total.Population,
    contains("Cent")
  )

#Joining demographic data and political data
pol <- df %>% 
  filter(
    Year == 2015,
    !is.na(County)
  )

total <- left_join(pol, dem, by = "INE") %>% 
  select(
    INE:Vegueria,
    Total.Population,
    Indy_percent,
    contains(".Cent"),
    Born.Outside.Catalonia.PerCent
  ) %>% 
    rename(
      Total_Outside_Cat = Born.Outside.Catalonia.PerCent
  ) %>% 
  arrange(Municipality)

glimpse(total)
## Observations: 947
## Variables: 16
## $ INE               <chr> "25001", "08001", "25002", "25003", "08002",...
## $ Municipality      <chr> "Abella de la Conca", "Abrera", "Àger", "Agr...
## $ County            <chr> "Pallars Jussà", "Baix Llobregat", "Noguera"...
## $ Vegueria          <chr> "Alt Pirineu", "Barcelona", "Lleida", "Lleid...
## $ Total.Population  <int> 170, 12071, 613, 5491, 246, 828, 2465, 912, ...
## $ Indy_percent      <dbl> 70.09, 30.52, 80.00, 72.29, 85.28, 79.41, 74...
## $ Total.Spain.Cent  <dbl> 92.35, 90.74, 87.93, 82.79, 97.15, 82.85, 94...
## $ Catalonia.Cent    <dbl> 87.06, 67.29, 79.28, 73.37, 92.68, 73.19, 80...
## $ Other.CCAA.Cent   <dbl> 5.29, 23.44, 8.65, 9.42, 4.47, 9.66, 13.47, ...
## $ Rest.World.Cent   <dbl> 7.65, 9.26, 12.07, 17.21, 2.85, 17.15, 6.00,...
## $ Europe.Cent       <dbl> 5.29, 2.05, 8.16, 7.43, 2.85, 7.73, 1.05, 5....
## $ Africa.Cent       <dbl> 0.59, 3.62, 1.14, 6.92, 0.00, 4.59, 1.99, 2....
## $ America.Cent      <dbl> 1.76, 3.15, 2.77, 2.62, 0.00, 4.59, 2.84, 1....
## $ Asia.Cent         <dbl> 0.00, 0.44, 0.00, 0.24, 0.00, 0.24, 0.12, 0....
## $ Oceania.Cent      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ Total_Outside_Cat <dbl> 12.94, 32.71, 20.72, 26.63, 7.32, 26.81, 19....

Selected variables are percentage of indy vote and birthplace (by continent, Catalonia, Spain (rest of) and a total born outside Catalonia) for every municipality. Minimum value for total born outside Catalonia is 0.83%, median 20.33% and a maximum of 59.90%.

Now I explore a linear model to find out if there is some significant correlation with birthplace and indy vote. In order to get the best model, I used the step() function, that choose the best model according to the Akaike Information Criterion (AIC) in a stepwise Algorithm. The initial model includes all the variables.

#Selection of numerical data
total2 <- total %>% 
  select(
    Indy_percent,
    Catalonia.Cent:Total_Outside_Cat
  )

#Linear model
lm.initial <- lm(Indy_percent ~ ., data = total2)
summary(lm.initial)
## 
## Call:
## lm(formula = Indy_percent ~ ., data = total2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.156  -4.257   0.752   5.419  55.928 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)       3474.653   5856.229   0.593    0.553
## Catalonia.Cent     -33.784     58.562  -0.577    0.564
## Other.CCAA.Cent    -36.060     58.563  -0.616    0.538
## Rest.World.Cent      6.866     64.970   0.106    0.916
## Europe.Cent        -40.973     46.242  -0.886    0.376
## Africa.Cent        -40.671     46.246  -0.879    0.379
## America.Cent       -40.797     46.240  -0.882    0.378
## Asia.Cent          -42.154     46.252  -0.911    0.362
## Oceania.Cent       -37.776     46.734  -0.808    0.419
## Total_Outside_Cat       NA         NA      NA       NA
## 
## Residual standard error: 8.652 on 938 degrees of freedom
## Multiple R-squared:  0.7291, Adjusted R-squared:  0.7267 
## F-statistic: 315.5 on 8 and 938 DF,  p-value: < 2.2e-16
lm.aic <- step(lm.initial, direction = "both")
## Start:  AIC=4095.82
## Indy_percent ~ Catalonia.Cent + Other.CCAA.Cent + Rest.World.Cent + 
##     Europe.Cent + Africa.Cent + America.Cent + Asia.Cent + Oceania.Cent + 
##     Total_Outside_Cat
## 
## 
## Step:  AIC=4095.82
## Indy_percent ~ Catalonia.Cent + Other.CCAA.Cent + Rest.World.Cent + 
##     Europe.Cent + Africa.Cent + America.Cent + Asia.Cent + Oceania.Cent
## 
##                   Df Sum of Sq   RSS    AIC
## - Rest.World.Cent  1     0.836 70217 4093.8
## - Catalonia.Cent   1    24.912 70241 4094.2
## - Other.CCAA.Cent  1    28.382 70245 4094.2
## - Oceania.Cent     1    48.911 70265 4094.5
## - Africa.Cent      1    57.896 70274 4094.6
## - America.Cent     1    58.270 70275 4094.6
## - Europe.Cent      1    58.771 70275 4094.6
## - Asia.Cent        1    62.180 70279 4094.7
## <none>                         70216 4095.8
## 
## Step:  AIC=4093.83
## Indy_percent ~ Catalonia.Cent + Other.CCAA.Cent + Europe.Cent + 
##     Africa.Cent + America.Cent + Asia.Cent + Oceania.Cent
## 
##                   Df Sum of Sq   RSS    AIC
## - Oceania.Cent     1    55.410 70273 4092.6
## - Catalonia.Cent   1    67.250 70284 4092.7
## - Africa.Cent      1    67.312 70284 4092.7
## - America.Cent     1    67.774 70285 4092.7
## - Europe.Cent      1    68.400 70286 4092.8
## - Asia.Cent        1    72.621 70290 4092.8
## - Other.CCAA.Cent  1    75.483 70293 4092.8
## <none>                         70217 4093.8
## + Rest.World.Cent  1     0.836 70216 4095.8
## 
## Step:  AIC=4092.58
## Indy_percent ~ Catalonia.Cent + Other.CCAA.Cent + Europe.Cent + 
##     Africa.Cent + America.Cent + Asia.Cent
## 
##                   Df Sum of Sq   RSS    AIC
## - Catalonia.Cent   1    19.200 70292 4090.8
## - Africa.Cent      1    19.365 70292 4090.8
## - America.Cent     1    20.487 70293 4090.9
## - Europe.Cent      1    22.198 70295 4090.9
## - Asia.Cent        1    34.836 70307 4091.0
## - Other.CCAA.Cent  1    44.792 70317 4091.2
## <none>                         70273 4092.6
## + Oceania.Cent     1    55.410 70217 4093.8
## + Rest.World.Cent  1     7.335 70265 4094.5
## 
## Step:  AIC=4090.83
## Indy_percent ~ Other.CCAA.Cent + Europe.Cent + Africa.Cent + 
##     America.Cent + Asia.Cent
## 
##                     Df Sum of Sq    RSS    AIC
## - Africa.Cent        1         3  70294 4088.9
## - America.Cent       1        69  70361 4089.8
## <none>                            70292 4090.8
## + Catalonia.Cent     1        19  70273 4092.6
## + Total_Outside_Cat  1        19  70273 4092.6
## + Rest.World.Cent    1        15  70276 4092.6
## + Oceania.Cent       1         7  70284 4092.7
## - Asia.Cent          1       926  71217 4101.2
## - Europe.Cent        1      1584  71875 4109.9
## - Other.CCAA.Cent    1    136421 206713 5110.3
## 
## Step:  AIC=4088.87
## Indy_percent ~ Other.CCAA.Cent + Europe.Cent + America.Cent + 
##     Asia.Cent
## 
##                     Df Sum of Sq    RSS    AIC
## - America.Cent       1        70  70365 4087.8
## <none>                            70294 4088.9
## + Oceania.Cent       1         8  70287 4090.8
## + Africa.Cent        1         3  70292 4090.8
## + Rest.World.Cent    1         2  70292 4090.8
## + Catalonia.Cent     1         2  70292 4090.8
## + Total_Outside_Cat  1         2  70292 4090.8
## - Asia.Cent          1       966  71260 4099.8
## - Europe.Cent        1      1670  71964 4109.1
## - Other.CCAA.Cent    1    139735 210029 5123.4
## 
## Step:  AIC=4087.82
## Indy_percent ~ Other.CCAA.Cent + Europe.Cent + Asia.Cent
## 
##                     Df Sum of Sq    RSS    AIC
## <none>                            70365 4087.8
## + America.Cent       1        70  70294 4088.9
## + Rest.World.Cent    1        38  70326 4089.3
## + Catalonia.Cent     1        38  70326 4089.3
## + Total_Outside_Cat  1        38  70326 4089.3
## + Africa.Cent        1         4  70361 4089.8
## + Oceania.Cent       1         3  70362 4089.8
## - Asia.Cent          1      1190  71554 4101.7
## - Europe.Cent        1      1859  72224 4110.5
## - Other.CCAA.Cent    1    169655 240020 5247.8
summary(lm.aic)
## 
## Call:
## lm(formula = Indy_percent ~ Other.CCAA.Cent + Europe.Cent + Asia.Cent, 
##     data = total2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.978  -4.357   0.766   5.407  56.108 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     96.21504    0.67665 142.193  < 2e-16 ***
## Other.CCAA.Cent -2.29704    0.04817 -47.683  < 2e-16 ***
## Europe.Cent     -0.33961    0.06803  -4.992 7.12e-07 ***
## Asia.Cent       -1.62320    0.40653  -3.993 7.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.638 on 943 degrees of freedom
## Multiple R-squared:  0.7285, Adjusted R-squared:  0.7276 
## F-statistic: 843.4 on 3 and 943 DF,  p-value: < 2.2e-16

The best linear model is a multivariate model that includes the variables born in rest of spain (Other.CCAA.Cent), born in rest of Europe and born in Asia, contributing negatively to the indy vote. With a Multiple R Square of 0.7285 and a very small p-value, rejecting the Null hypothesis (There is no relationship between birthplace and vote). This is the plot of the resulting model:

#Ploting results
ggplot(total, aes(x = Indy_percent,
                  y = 96.21504+(-2.29704*Other.CCAA.Cent-0.33961*Europe.Cent-1.62320*Asia.Cent))) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(
    title = "Prediction model for indy vote in Catalonia by municipality",
    subtitle = "Multiple R-squared:  0.7285, Adjusted R-squared:  0.7276, p-value: < 2.2e-16.
y = 96.21504+(-2.29704*Other.CCAA.Cent-0.33961*Europe.Cent-1.62320*Asia.Cent)",
    x = "Indy Vote (%)",
    y = "Model",
    caption = "Marc Belzunces"
  )

_config.yml

Residuals, the difference between real and prediction results, have a normal distribution around 0 with a standard deviation of 8.62, with some extreme isolated residual.

#Residuals
resid <- as.tibble(lm.aic$residuals)

ggplot(resid, aes(value)) +
  geom_histogram(bins = 40) +
  labs(
    title = "Distribution of model residual values",
    caption = "Marc Belzunces"
  )

_config.yml

4 Conclusions

According to the model, in Catalonia the vote in terms of independence from/union with Spain is largely determined by birthplace (ethnic vote). Most populated areas, with around 40-50% of inhabitants born outside Catalonia (typically Spanish-speaking individuals) are those areas with fewer votes to indy parties. Rest of areas, with fewer population and most of people born in Catalonia (typically Catalan-speaking individuals) has a clear indy majority.

Predictive variables are born in rest of Spain, rest of Europe and Asian. Asian community are a significant group (specially from Pakistan) but probably most of them has no right to vote because they don’t have Spanish nationality. In contrast, Latin American, which are not included in the model and a significant number of them have the right to vote, are one of the largest immigrant communities in Catalonia and, according to polls belligerent with the idea of the independence of Catalonia. But, well, model show a prediction capability, not necessarily a explanatory capability (Correlation is not causation). Probably a more refined model could be found if as a unit of analysis are used, instead of municipalities (~1,000), census tracts (~5,000).

Written on May 28, 2017