A few days ago I discovered in Twitter the ggjoy package, an extension for ggplot2. Thanks to Ilya Kashnitsky here I’m playing with ggjoy with, as usual, Catalan election data. As a former geologist, I find very fun density plots. I spent several years working with grain size density plots and the sedimentological concepts associated to them. Easily applicable to data science, as other techniques used by geologist for a century(read the comment about the map in the articles’ end).
This is the second post about text mining in R. In the previous one, I analysed text from 323,000 tweets. Now, I’m analysing one of my favourite authors in Catalan, Enric Vila, with the only two books from him available for kindle, Londres-París-Barcelona: Viatge al cor de la tempesta and Un estiu a les trinxeres. Basically I’m following the excellent book Text Mining with R.
Text mining in social networks it’s one of the interesting things to do in data science. In the previous post I got 323,000 tweets from all the MP in the Catalan regional Parliament, focusing in language use. Here, I’ll analyze the most frequent words for Catalan and Spanish language (95% of the tweets). I’m using the followings R packages: tidyverse, grid, gridExtra and tidytext. As a reference, I’m using the book Text Mining with R.
The goal of this exercise is to get hundreds of thousands of tweets through R and analyze the language used in this social network. To achieve that, I used accounts for MPs (Member of Parliament) from the Catalan regional Parliament, which is composed by 135 members. I focused in Catalan and Spanish use, as language is a key factor in Catalan politics. I’ll compare the public speech about languages for every party and the real and daily use on Twitter by the MPs.
With digitalization (I’m currently using zero paper), there is a lot of information in PDFs. A nice file format for printing and reading in tablets (or annotate in GoodNotes) but it’s difficult working with it from a Data Science perspective. Here I show an example using my electricity utility bills. Nowadays utilities send bills by email and, generally, people later copy manually the relevant information to spreadsheets. In this example, I’ll read all the files, detect and extract information, build a data table with it (a spreadsheet), and generate a mosaic of plot showing graphical information in more detail than any bill can show. All automatically.
This is the second post about the Catalan ethnic vote. In the first one, I defined the Catalan vote in terms of independence of Catalonia/Union to Spain with a model based on birthplace. In this second, I’ll explore the time series for the Catalan ethnic parties’ vote (CiU, ERC, CUP and JxSí) in Catalan regional elections. As in the previous post, election data comes from the Statistical Institute of Catalonia (Idescat), processed with RStudio and plotted with ggplot2.
In this little exercise I’m exploring the vote in Catalonia in terms of independence from/union to Spain, a key political issue in Catalonia for the last years, testing the hypothesis if this vote is conditioned by birthplace.
I analyze data of the Modular Survey of Social Habits (EMHS 2010) from the Government of the Balearic Islands. This survey contains linguistic variables focused in Catalan and Spanish languages, and it’s one of the few surveys in the Catalan linguistic area that includes this kind of information. Data shows that both groups, Catalan native speakers (CatNS) and Spanish native speakers (EspNS) transmit their group language with high loyalty. As Catalan sociolinguists pointed out, there is some language shift from EspNS transmitting Catalan to their children. Data shows that, but in very small figures. In addition to this shift, CatNS also exhibits shift language into Spanish in the same magnitude, fact didn’t point out previously, at least publicly. The main reason to the language shift seems to be the language of the couple in linguistically mixed couples, with significant differences between Catalan and Spanish Linguistic groups. In conclusion, the exogenous linguistic group (EspNS) shows the same behavior as the endogenous linguistic group (CatNS) in language transmission, in the Balearic Islands. EspNS shows significant different behavior compare to the other exogenous linguistic groups, which show clearly less linguistic loyalty and more language shift. The survey didn’t include any variable about birthrate, a key factor to determine which of the linguistic groups are expanding demographically and which are not. Code for reproduce the analysis in R is provided.
This exercise represents an intended worst-case scenario for machine learning, as my aim was to test the results on a scarce dataset (n=52). In the Catalan linguistic area in Spain there is no census data about native speakers. Indeed, there is no real useful data about Catalan language from a data science perspective. I focus my attention in this case because until 1950, virtually 100% of the population of the Catalan linguistic area was Catalan native speaker, but from 1950 until now, population has doubled by immigration, that in addition to demographic and political reasons, resulted in Catalan as a minority language in most of its former linguistic area. Currently, we don’t have extensive quantitative data to know the real situation or evolution of Catalan language in its territory. The only usable data, with geographically data associated that allows connect it to census data, is the Demographic Survey 2007 of Catalonia, that contains data about first spoken language (native language/mother tongue) by some administrative levels. I used this data to build a statistical model with machine learning to predict the percentage of Catalan native speakers in every unit of an administrative level. Here I show the results for the lowest administrative level in Spain, the sección censal (direct translation: Census Section), that typically involves around 2,000 inhabitants.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).The goal of the project is to predict the manner in which they did the exercise using any of the other variables to predict with. This was the course project for the Practical Machine Learning course, part of the Data Science Specialization by Johns Hopkins University on Coursera.
The study focuses in fuel consumption (MPG, miles per gallon), which variables are more significant and, specifically, if manual or automatic transmission are better for MPG. We elaborated a multivariable linear model that explains the relationship with MPG. This model shows that cars with manual transmission get more MPG that those with automatic transmission, which is also confirmed with a t-test and a boxplot. This was the course project for the Regression Models course, part of the Data Science Specialization by Johns Hopkins University on Coursera.
In this project we’re going to analyze the ToothGrowth data in the R datasets package. This dataset corresponds to the the effect of vitamin C on tooth growth in guinea pigs. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). This was the second part of the course project for the Statistical Inference course, part of the Data Science Specialization by Johns Hopkins University on Coursera.
In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. We set lambda = 0.2 for all of the simulations. We will investigate the distribution of averages of 40 exponentials with a thousand simulations. This was the first part for the course project for the Statistical Inference course, part of the Data Science Specialization by Johns Hopkins University on Coursera.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. This was the second course project for the Reproducible Reasearch course, part of the Data Science Specialization by Johns Hopkins University on Coursera.
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data. This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day. This was my peer assessment for the course Reproducible Reasearch, part of the Data Science Specialization by Johns Hopkins University on Coursera.