Extreme Machine Learning. Prediction of Catalan native speakers by administrative level

This exercise represents an intended worst-case scenario for machine learning, as my aim was to test the results on a scarce dataset (n=52). In the Catalan linguistic area in Spain there is no census data about native speakers. Indeed, there is no real useful data about Catalan language from a data science perspective. I focus my attention in this case because until 1950, virtually 100% of the population of the Catalan linguistic area was Catalan native speaker, but from 1950 until now, population has doubled by immigration, that in addition to demographic and political reasons, resulted in Catalan as a minority language in most of its former linguistic area. Currently, we don’t have extensive quantitative data to know the real situation or evolution of Catalan language in its territory. The only usable data, with geographically data associated that allows connect it to census data, is the Demographic Survey 2007 of Catalonia, that contains data about first spoken language (native language/mother tongue) by some administrative levels. I used this data to build a statistical model with machine learning to predict the percentage of Catalan native speakers in every unit of an administrative level. Here I show the results for the lowest administrative level in Spain, the sección censal (direct translation: Census Section), that typically involves around 2,000 inhabitants.

1 Methods

I used data for native Catalan speakers coming from the Demographic Survey 2007 of Catalonia. I also used census data from the continuous municipal register statistics of the Spanish Statistical Institute (INE). For mapping, I used cartographic databases from INE and the Cartographic and Geological Institute of Catalonia (ICC). RStudio with R 3.2.4 was used to process the data, with the caret library for the machine learning. QQIS was used to merge the results with the cartographic databases, and CartoDB for publishing the map.

I use native language as a synonymous of first spoken language or mother tongue.

1.2 Machine Learning and selection of census variables

In order to train the model in Machine Learning I selected, at first, 50 predictors (variables) from the INE databases for every 8,780 census section of the Catalan linguistic area in Spain (Catalonia, Valencia, Balearic Island and a strip from Aragon). While linguistic data is for 2007 (the only year available), census data is from 2011. The reason is that 2011 is the first year where country of birth for foreign people by census section is available in the INE website.

After several tests, I selected 6 predictors for the machine learning as it was simple and the performance of the model was more or less the same:

  • Born in the same Autonomous Community (CCAA)
  • Born in other Autonomous Community (CCAA)
  • Born in every Continent, except Oceania (numbers were 0 or close)

Some entries from the resulting dataset, as an example:

##   Same.CCAA Other.CCAA Europe Africa America Asia Catalan.Percent
## 1     70.64      18.32   3.88   4.32    2.56 0.27            51.2
## 2     59.06      16.62  10.80   6.95    6.06 0.50            32.1
## 3     71.37      16.82   2.47   4.92    4.03 0.38            47.0
## 4     69.31      14.50   9.98   1.20    4.72 0.28            55.5
## 5     63.84      21.46   7.40   2.69    4.44 0.17            50.3
## 6     70.33      20.44   2.28   3.81    2.79 0.33            44.5

So, to perform the machine learning, I had 6 predictors from the census data, and the only 52 observations from the Demographic Survey of Catalonia. 42 comarques (Counties) and the 11 cities with +100.000 inhabitants in Catalonia (Badalona, Barcelona, l’Hospitalet de Llobregat, Mataró, Sabadell, Santa Coloma de Gramenet, Terrassa, Girona, Lleida, Reus and Tarragona).

2 Results

For Machine Learning I used Random Forest (RF). My second goal was to test RF in this kind of data. A Cross-validation (k = 10) was used as a control of the model.

## Random Forest 
## 
## 48 samples
##  6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 42, 43, 43, 43, 43, 44, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   RMSE SD   Rsquared SD
##   2     6.513228  0.9204508  2.967623  0.10148635 
##   4     6.428330  0.9271926  2.392436  0.09172418 
##   6     6.375740  0.9323328  2.135887  0.07502963 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 6.

With 52 total samples, 48 of them used for the training dataset, we have no data to properly test the model, unfortunately:

##    testing[, 7] prediction.rf
## 24         56.7      63.27637
## 28         63.2      62.48968
## 45         28.9      19.63966
## 51         31.4      29.14161

There is some result pretty close, but others up to 10 percentage points difference.

You can explore the real census data, and the model predictions, in all the 8,780 Census Sections in this map. Click on arrows to enlarge the map, mouse pointer over the map makes show the data:

3 Discussion

I worked the last decade with demographic and political data in Catalonia. As far as I can say, although the strong limitations of data for the model, the big picture makes sense. At least for Catalonia. For the other territories I haven’t so much experience, but I really think the model also has sense. On the other hand, consider this is the first time (as far as I know) that someone attempt to quantify by administrative levels the number of Catalan native speakers, so this is a start point to improve in the future.

Nevertheless, the model has strong limitations that need interpretation of the results. The main issues are:

-Percentage deviations: As I said in methods, there isn’t enough data to test properly the model (overfitting, real accuracy…). Up to 10 percentage points of deviation has been detected between survey and model prediction. However, I think this isn’t extremely important in this kind of data. At worst, consider this model more as qualitative model than a quantitative one.

-Lower and higher prediction values. There are few values in the lower and higher end to train the model. At the municipality level (not shown here), prediction for Santa Coloma de Gramenet (survey: 6% Catalan native speakers) is 12%. Consider all the values around 15% to be more close to 5% than to 15%. In the higher end, the same could be valid. Highest value of the model is 77%. Real figures probably are higher: 85% or more.

-Catalan/Spanish native speakers ratio in born in the same autonomous Community. This is a key factor. Before 1950, virtually 100% of the studied area was Catalan native speaker, with few exception (Alacant and València cities, and some very reduced social group). In the 60’s and 70’s population of the area doubled with immigration from Spain, virtually all Spanish native speakers. So, at the beginning of this scenario, born in same autonomous community (CCAA) = 100% native Catalan speakers, and born in other CCAA = 100% native Spanish speakers. Descendants of this groups, both born in the same CCAA, has been transferring their native language (some few adopting Catalan language, at least in Catalonia), so variable born in same CCAA is descending in the % of Catalan native speakers, incorporating Spanish native speakers. That means in rural areas (lower immigration from Spain) the model may be underestimating the real figures (as detected in the previous point) in the variable born in the same CCAA, while in urban areas, where the Spanish immigration were concentrated, the model may be overestimating the real figures of the same variable. This could be improved in the future introducing a variable urban area/rural area to train the model, but I think with 52 observations at the moment the model won’t improve.

-Border issue. The Catalan linguistic area in Spain is divided in several CCAA. So born in the same CCAA means born in Catalonia, Valencia, Balearic Islands or Aragon. That means that Catalan native people living in a different CCAA in the same Catalan linguistic area, is considered by the model as born in other CCAA, at the same level as Spanish native speakers from other CCAA. This is clearly affecting the border areas between Catalonia and Valencia (in the middle of the map), and specially the border area between Aragon and Catalonia (upper-left part of the map). Although there is historical relationship between territories in these areas (they spoke the same language, and geography favor this relations), the effect seems affect more Valencia and Aragon territories than the Catalan territories. Born in other CCAA are clearly higher in the non-Catalonian part of the border (mouse pointer over a place in the map will show data). The most affected County (Baix Cinca, Aragon, with the city of Fraga) has very good communications with the city of Lleida (Catalonia, 25 km of distance), 2 highways. Is the reason the higher taxes in Catalonia compare to Valencia or Aragon? In the Catalonian part of the border, Tremp and Talarn also shows a higher percentage of *born in other CCAA*. However, this case could be explained with the Spanish military base of Talarn. I think here the variable are really showing people born outside the Catalan Linguistic Area.

This border problem could be solved, I think. The Catalan Statistical Institute has data of birth segregated by CCAA for 2014 (not public for other years). That means that data for the born in other CCAA segregated by actual CCAA exist, but unfortunately INE doesn’t provide it publicly.

-Valencia and Alacant cities. Sociolinguistics explain that linguistic substitution from Catalan to Spanish occurred in Alacant at the beginning of XX century, and the same in Valencia city. My first option was to exclude these cities from the model, as in this both cities born in the same CCAA equals to born in other CCAA. But, as the model shows, considering that no linguistic substitution occurred, model figures of Catalan native speakers are close to 0, so extinction of Catalan language in this cities could be explained also by historical recent immigration processes (last 50 years). So, consider this while exploring data in this both cities.

According to the results of the model, here it’s my interpretation of the situation of the Catalan language in its linguistic area (please, consider previous discussion):

4 Conclusions

This is the first attempt ever made to quantify the Catalan native speakers in all units of an administrative level. The model shown here has strong data limitations. But overall, I think the big picture makes sense. We can have, at least, an idea of the Catalan native speakers by administrative level with general census data. In future, the model could be improved by adding an urban/rural area variable, and segregating the variable born in other CCAA by actual CCAA and, of course, with more linguistic data.

From a data science point of view, we need a variable of native language included in the official census, as other countries do. This would allow a lot of research.

For any comments, you can use Twitter (@marcbeldata).

Written on April 28, 2016