Retrieving data from PDFs in R, an example with utilities bills

With digitalization (I’m currently using zero paper), there is a lot of information in PDFs. A nice file format for printing and reading in tablets (or annotate in GoodNotes) but it’s difficult working with it from a Data Science perspective. Here I show an example using my electricity utility bills. Nowadays utilities send bills by email and, generally, people later copy manually the relevant information to spreadsheets. In this example, I’ll read all the files, detect and extract information, build a data table with it (a spreadsheet), and generate a mosaic of plot showing graphical information in more detail than any bill can show. All automatically.

1 Methods

I’m using the following packages:

library(tidyverse) #general use
library(pdftools) #read pdfs
library(stringr) #work with strings
library(lubridate) #work with dates

I have 14 bills in PDF from my company, PepeEnergy, a company that provides electricity from sources 100% renewables, and it’s one of the cheapest. The important information in the PDF is this:

_config.yml

Steps:

  • Reading every PDF with pdftools (txt <- pdf_txt(filename)).

  • Extracting the relevant information with stringr (str_extract(txt, pattern)). Examples of patterns:
    • ”.\……€ kW/día”, where every dot is a position and \. is the decimal dot.
    • ”.\……€ kWh”
    • …\…kWh”
    • “…días”
  • Generating a data table
  • Adding new variables from the data table to rebuild the bill information.
  • Doing that for every file in my folder (files <- file.list(“.”, pattern)), following a pattern in the bills filenames, that I always introduce manually to order them in Finder: bill’s date, company name, invoiced month.

2 Extracting information from the bills

Here is the code for the previous steps:

#List of utility bills
files <- list.files(".", pattern = "PepeEnergy ")

#Loop for all files
final <- tibble()

for (i in files) {
  #Reading pdf
  txt <- pdf_text(i)
  
  #Extraction of relevant data
  Date <- str_sub(i, start = 1, end = 10)
  Power_Price <- str_extract(txt, ".\\......€ kW/día")[1] %>%
    str_sub(1,7) %>% as.numeric()
  Electricity_Price <- str_extract(txt, ".\\......€ kWh")[1] %>%
    str_sub(1,7) %>% as.numeric()
  kWh <- str_extract(txt, "...\\...kWh")[1] %>%
    str_sub(1,6) %>% as.numeric()
  Power <- str_extract(txt, ".\\..kW")[1] %>%
    str_sub(1,3) %>% as.numeric()
  Days_Invoiced <- str_extract(txt, "...días")[1] %>%
    str_sub(1,2) %>% as.numeric()
  Electricity_Tax_PerCent <- str_extract(txt, ".\\.........%")[1] %>%
    str_sub(1,7) %>% as.numeric()
  Rent_Price <- str_extract(txt, ".\\......€/día")[1] %>%
    str_sub(1,7) %>% as.numeric()
  
  #Dataframe with all the data and generation of new variables
  df <- cbind(Date,Electricity_Price, Power_Price, kWh, Rent_Price, Electricity_Tax_PerCent,
              Days_Invoiced) %>% 
    as_tibble() %>% 
    mutate(
      Date = ymd(Date),
      Year = as.factor(year(Date)),
      Month = month(Date, label = TRUE),
      Electricity_Price = as.numeric(Electricity_Price),
      Power_Price = as.numeric(Power_Price),
      kWh = as.numeric(kWh),
      Rent_Price = as.numeric(Rent_Price),
      Electricity_Tax_PerCent = as.numeric(Electricity_Tax_PerCent),
      Days_Invoiced = as.numeric(Days_Invoiced),
      Cost_Rent = Rent_Price*Days_Invoiced,
      Cost_Power = Power*Power_Price*Days_Invoiced,
      Cost_Electricity = Electricity_Price*kWh,
      Cost_Electricity_Tax = Electricity_Tax_PerCent/100*(Cost_Power+Cost_Electricity),
      Total_Cost = Cost_Power+Cost_Electricity+Cost_Electricity_Tax+Cost_Rent,
      VAT_21_PerCent = Total_Cost*0.21,
      Daily_kWh = round(kWh/Days_Invoiced, 2),
      Final_Cost = round(Total_Cost*1.21,2),
      Daily_Cost = round(Final_Cost/Days_Invoiced, 2),
      Total_Cost_kWh= round(Final_Cost/kWh, 3)
    ) %>% 
    select(
      Date,
      Year,
      Month,
      kWh,
      Electricity_Price:VAT_21_PerCent,
      Daily_kWh,
      Daily_Cost,
      Total_Cost_kWh,
      Final_Cost
    )
  #All the data
  final <- rbind(final, df)
  }

So, I reconstructed a spreadsheet from my PDFs! Updated with new bills in a few seconds with the same code.

head(final)

## # A tibble: 6 × 19
##         Date   Year Month   kWh Electricity_Price Power_Price Rent_Price
##       <date> <fctr> <ord> <dbl>             <dbl>       <dbl>      <dbl>
## 1 2016-04-30   2016   Apr   127           0.09898     0.10394    0.02714
## 2 2016-05-31   2016   May   263           0.09141     0.10394    0.02667
## 3 2016-06-30   2016   Jun   168           0.09314     0.10394    0.02667
## 4 2016-07-31   2016   Jul   166           0.10149     0.10394    0.02645
## 5 2016-08-31   2016   Aug   199           0.10543     0.10394    0.02667
## 6 2016-09-30   2016   Sep   268           0.10648     0.10394    0.02667
## # ... with 12 more variables: Electricity_Tax_PerCent <dbl>,
## #   Days_Invoiced <dbl>, Cost_Rent <dbl>, Cost_Power <dbl>,
## #   Cost_Electricity <dbl>, Cost_Electricity_Tax <dbl>, Total_Cost <dbl>,
## #   VAT_21_PerCent <dbl>, Daily_kWh <dbl>, Daily_Cost <dbl>,
## #   Total_Cost_kWh <dbl>, Final_Cost <dbl>

glimpse(final)

## Observations: 14
## Variables: 19
## $ Date                    <date> 2016-04-30, 2016-05-31, 2016-06-30, 2...
## $ Year                    <fctr> 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ Month                   <ord> Apr, May, Jun, Jul, Aug, Sep, Oct, Nov...
## $ kWh                     <dbl> 127, 263, 168, 166, 199, 268, 272, 180...
## $ Electricity_Price       <dbl> 0.09898, 0.09141, 0.09314, 0.10149, 0....
## $ Power_Price             <dbl> 0.10394, 0.10394, 0.10394, 0.10394, 0....
## $ Rent_Price              <dbl> 0.02714, 0.02667, 0.02667, 0.02645, 0....
## $ Electricity_Tax_PerCent <dbl> 5.1127, 5.1127, 5.1127, 5.1127, 5.1127...
## $ Days_Invoiced           <dbl> 7, 33, 30, 31, 30, 30, 32, 31, 31, 31,...
## $ Cost_Rent               <dbl> 0.18998, 0.88011, 0.80010, 0.81995, 0....
## $ Cost_Power              <dbl> 3.201352, 15.092088, 13.720080, 14.177...
## $ Cost_Electricity        <dbl> 12.57046, 24.04083, 15.64752, 16.84734...
## $ Cost_Electricity_Tax    <dbl> 0.8063654, 2.0007487, 1.5014773, 1.586...
## $ Total_Cost              <dbl> 16.76816, 42.01378, 31.66918, 33.43091...
## $ VAT_21_PerCent          <dbl> 3.521313, 8.822893, 6.650527, 7.020491...
## $ Daily_kWh               <dbl> 18.14, 7.97, 5.60, 5.35, 6.63, 8.93, 8...
## $ Daily_Cost              <dbl> 2.90, 1.54, 1.28, 1.30, 1.50, 1.82, 1....
## $ Total_Cost_kWh          <dbl> 0.160, 0.193, 0.228, 0.244, 0.227, 0.2...
## $ Final_Cost              <dbl> 20.29, 50.84, 38.32, 40.45, 45.10, 54....

3 Ploting data in a mosaic

Basically, I’m doing that because I want to see graphically the data and its evolution through time. As I want a mosaic of plots, and unfortunately ggplot2 can’t do that itself, I’m using the function for multiple graphs found in Cookbook for R’s book:

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Now, I’m defining a series of plots that I’m interested in. Aesthetics is not important, as they are plots for Exploratory Data Analysis. This is the first version, so probably I’ll improve them in future.

#Plots
#Firts plot: time series

p1 <- ggplot(final, aes(Date, Electricity_Price)) +
  geom_line() +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE ) +
  labs(
    title = "kWh price evolution",
    y = "EUR/kWh"
  ) +
  scale_x_date(date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

#Third plot: Annual comparison

p3 <- ggplot(final, aes(Month, Electricity_Price, color = Year, group = Year)) +
  geom_line() +
  geom_point() +
  guides(color = guide_legend(reverse = TRUE)) +
  scale_color_manual(values = c("red", "blue")) +
  labs(
    title = "kWh price",
    x = "Month",
    y = "EUR/kWh",
    color = "Year"
  ) +
  theme(legend.position = "bottom")

#Seventh plot: Increment price kWh (%) by month

p7 <- final %>% 
  group_by(Month) %>% 
  summarise(
    Increment = (Electricity_Price[2] - Electricity_Price[1])/Electricity_Price[2]*100
  ) %>% 
  filter(
    !is.na(Increment)
  ) %>% 
  ggplot() +
  geom_bar(aes(Month, Increment), stat = "identity", width = 0.5) +
  labs(
    title = "Annual kWh price increase (%) ",
    y = "Percent"
  )  

#Eighth: evolution total price kWh

p8 <- ggplot(final, aes(Date, Total_Cost_kWh)) +
  geom_line() +
  geom_point() +
  geom_smooth(method = "loess", se = F) +
  labs(
    title = "Evolution of total kWh price",
    y = "EUR/kWh"
  ) +
  scale_x_date(date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

And now the mosaic for all these plots, with the function I mentioned before. I’m only showing 4 of 8 (privacy), with price evolution:

multiplot(p1, p8, p3, p7, cols = 2)

_config.yml

Written on June 3, 2017