The goal of this lab is to show how to access the Velib data provided by JCDecaux and to visualize the information. We will use R and RStudio to perform this task. We don’t expect that you know anything about R so far but expect that you will be able to modify some code by the end of the tutorial.

A very short introduction to R

R is a domain specific language dedicated to data analysis. It can be seen an open-source implementation of S, a statistical programming language invented in 1976. It is widely used in both the industrial and the academic world. Its main strength is the huge number of packages available while its main drawback is its in-memory processing design which limits the size of the data it can handle. R remains nevertheless a very powerful tools to design a data processing chain.

R is an interpreted language, typically accessed through a command-line interpreter. RStudio is a R oriented IDE (Integrated Development Environment) that greatly eases the use of R. In this lab, we will use various package from the R universe: jsonlite (as well as httr) to download the data, dplyr to format them and ggplot2 (as well as gridExtra and ggmap) to visualize them.

In RStudio, a console in which command can be entered is available in the bottom left part of the interface. This lab is not a real introduction to R but we are going to look at a few example to understand the basic syntax by using R as calculator:

1 + 1 #If we press Enter, R will give the following result
## [1] 2
sin(2) #We can use mathematical function
## [1] 0.9092974
a  <- 2 #Store a variable
a #View it
## [1] 2
exp(a) #Use it
## [1] 7.389056

The most important command is, like in most interpreted language, help and we will probably use it a lot.

help(exp)

Note that in RStudio, you can also use F1 to ask for help.

Instead of working directly in the console, a much better way is to write a script in which the commands can be more easily edited. In RStudio, this can be done using the File menu.

We will see through this lab that R allows to define functions, to store data sets in an adapted format, the data frame format, and to perform analysis of those data frames.

Package installation

The first step is to install them in RStudio if this is not already done using the Packages tab (or install if you use the console). Once this is done, we will be able to load the libraries or use the function with the library:: syntax.

How to access JCDecaux data

The goal is to access to the live Velib data provided by JCDecaux as described in the web page https://developer.jcdecaux.com/#/opendata/vls?page=getstarted. Data can be accessed through a simple RESTful API documented in this web site. The only requirement for the live data is to ask for an API key. You should ask for one (this is free and immediate) and not use mine…

Let us store the key in a variable.

DecauxKey <- "da879af595184f071c181408b837b7da636f924f"

We define then a function to retrieve the data using the API provided by Decaux:

UrlDecaux <- function(decaux,key) {
  if (grepl('\\?',decaux, perl = TRUE)) {
    delim <- '&'
  }
  else {
    delim <- '?'
  }
  sprintf("https://api.jcdecaux.com/vls/v1/%s%sapiKey=%s",decaux,delim,key)
}

GetJsonDecaux <- function(decaux, key = DecauxKey) {
  jsonlite::fromJSON(UrlDecaux(decaux,key), flatten = TRUE)
}

We can now use the API to retrieve the list of the contracts.

Contracts <- GetJsonDecaux("contracts")

We can explore this data frame by clicking on its name in the top right panel or by using the View function

View(Contracts)

A data frame can be queried as if it were a database. For instance, we can use the filter function from dplyr to subset the data frame according to some logical condition.

library("dplyr")
filter(Contracts, commercial_name == "Velib")
##    name
## 1 Paris
##                                                                                                                                                                                                                                                                                                                                                                                      cities
## 1 Arcueil, Aubervilliers, Bagnolet, Boulogne Billancourt, Charenton, Clichy, Fontenay-sous-Bois, Gentilly, Issy les Moulineaux, Ivry, Joinville, Le Kremlin Bicêtre, Le Pré St Gervais, Les Lilas, Levallois-Perret, Malakoff, Montreuil, Montrouge, Neuilly, Nogent, Pantin, Paris, Puteaux, Saint Cloud, Saint Denis, Saint Mandé, Saint Maurice, Saint Ouen, Suresnes, Vanves, Vincennes
##   commercial_name country_code
## 1           Velib           FR

We are now ready to extract the name of the Velib contract using the dedicated syntax of R.

filter(Contracts, commercial_name == "Velib")[["name"]]
## [1] "Paris"

In the first case, we obtain a vector of strings. We may now store this name in order to reuse it later.

DecauxContractName <- filter(Contracts, commercial_name == "Velib")[["name"]]
DecauxContractName
## [1] "Paris"

We know now that the contract name for Velib is Paris and are thus ready to extract the Velib specific data.

The Velib dataset

We may now retrieve the Velib data set with a variation of the previous commands:

Stations <- GetJsonDecaux(sprintf("stations?contract=%s",DecauxContractName))
str(Stations)
## 'data.frame':    1230 obs. of  13 variables:
##  $ number               : int  31705 10042 8020 1022 35014 20040 28002 15111 12124 9021 ...
##  $ name                 : chr  "31705 - CHAMPEAUX (BAGNOLET)" "10042 - POISSONNIÈRE - ENGHIEN" "08020 - METRO ROME" "01022 - RUE DE LA PAIX" ...
##  $ address              : chr  "RUE DES CHAMPEAUX (PRES DE LA GARE ROUTIERE) - 93170 BAGNOLET" "52 RUE D'ENGHIEN / ANGLE RUE DU FAUBOURG POISSONIERE - 75010 PARIS" "74 BOULEVARD DES BATIGNOLLES - 75008 PARIS" "37 RUE CASANOVA - 75001 PARIS" ...
##  $ banking              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ bonus                : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ status               : chr  "CLOSED" "OPEN" "OPEN" "OPEN" ...
##  $ contract_name        : chr  "Paris" "Paris" "Paris" "Paris" ...
##  $ bike_stands          : int  50 33 44 37 25 26 60 24 55 22 ...
##  $ available_bike_stands: int  0 32 43 26 13 25 27 23 1 21 ...
##  $ available_bikes      : int  0 1 0 9 12 0 6 1 48 1 ...
##  $ last_update          : num  1.44e+12 1.44e+12 1.44e+12 1.44e+12 1.44e+12 ...
##  $ position.lat         : num  48.9 48.9 48.9 48.9 48.9 ...
##  $ position.lng         : num  2.42 2.35 2.32 2.33 2.41 ...

Let’s look at the first lines with head, have a glimpse (from dplyr) and a summary.

head(Stations)
##   number                           name
## 1  31705   31705 - CHAMPEAUX (BAGNOLET)
## 2  10042 10042 - POISSONNIÈRE - ENGHIEN
## 3   8020             08020 - METRO ROME
## 4   1022         01022 - RUE DE LA PAIX
## 5  35014     35014 - DE GAULLE (PANTIN)
## 6  20040     20040 - PARC DE BELLEVILLE
##                                                              address
## 1      RUE DES CHAMPEAUX (PRES DE LA GARE ROUTIERE) - 93170 BAGNOLET
## 2 52 RUE D'ENGHIEN / ANGLE RUE DU FAUBOURG POISSONIERE - 75010 PARIS
## 3                         74 BOULEVARD DES BATIGNOLLES - 75008 PARIS
## 4                                      37 RUE CASANOVA - 75001 PARIS
## 5     139 AVENUE JEAN LOLIVE / MAIL CHARLES DE GAULLE - 93500 PANTIN
## 6                           57 & 36 RUE JULIEN LACROIX - 75020 PARIS
##   banking bonus status contract_name bike_stands available_bike_stands
## 1    TRUE  TRUE CLOSED         Paris          50                     0
## 2    TRUE FALSE   OPEN         Paris          33                    32
## 3    TRUE  TRUE   OPEN         Paris          44                    43
## 4    TRUE FALSE   OPEN         Paris          37                    26
## 5    TRUE FALSE   OPEN         Paris          25                    13
## 6    TRUE FALSE   OPEN         Paris          26                    25
##   available_bikes  last_update position.lat position.lng
## 1               0 1.435412e+12     48.86453     2.416171
## 2               1 1.435412e+12     48.87242     2.348395
## 3               0 1.435412e+12     48.88215     2.319860
## 4               9 1.435412e+12     48.86822     2.330494
## 5              12 1.435413e+12     48.89327     2.412716
## 6               0 1.435412e+12     48.87039     2.384222
glimpse(Stations)
## Observations: 1230
## Variables:
## $ number                (int) 31705, 10042, 8020, 1022, 35014, 20040, ...
## $ name                  (chr) "31705 - CHAMPEAUX (BAGNOLET)", "10042 -...
## $ address               (chr) "RUE DES CHAMPEAUX (PRES DE LA GARE ROUT...
## $ banking               (lgl) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
## $ bonus                 (lgl) TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, ...
## $ status                (chr) "CLOSED", "OPEN", "OPEN", "OPEN", "OPEN"...
## $ contract_name         (chr) "Paris", "Paris", "Paris", "Paris", "Par...
## $ bike_stands           (int) 50, 33, 44, 37, 25, 26, 60, 24, 55, 22, ...
## $ available_bike_stands (int) 0, 32, 43, 26, 13, 25, 27, 23, 1, 21, 39...
## $ available_bikes       (int) 0, 1, 0, 9, 12, 0, 6, 1, 48, 1, 26, 0, 1...
## $ last_update           (dbl) 1.435412e+12, 1.435412e+12, 1.435412e+12...
## $ position.lat          (dbl) 48.86453, 48.87242, 48.88215, 48.86822, ...
## $ position.lng          (dbl) 2.416171, 2.348395, 2.319860, 2.330494, ...
summary(Stations)
##      number          name             address           banking       
##  Min.   :  901   Length:1230        Length:1230        Mode :logical  
##  1st Qu.:10007   Class :character   Class :character   FALSE:2        
##  Median :15018   Mode  :character   Mode  :character   TRUE :1228     
##  Mean   :15730                                         NA's :0        
##  3rd Qu.:19114                                                        
##  Max.   :44102                                                        
##    bonus            status          contract_name       bike_stands   
##  Mode :logical   Length:1230        Length:1230        Min.   : 8.00  
##  FALSE:1097      Class :character   Class :character   1st Qu.:23.00  
##  TRUE :133       Mode  :character   Mode  :character   Median :30.00  
##  NA's :0                                               Mean   :32.72  
##                                                        3rd Qu.:40.75  
##                                                        Max.   :72.00  
##  available_bike_stands available_bikes  last_update         position.lat  
##  Min.   : 0.00         Min.   : 0.00   Min.   :1.435e+12   Min.   :48.81  
##  1st Qu.: 9.00         1st Qu.: 1.00   1st Qu.:1.435e+12   1st Qu.:48.84  
##  Median :19.00         Median : 5.00   Median :1.435e+12   Median :48.86  
##  Mean   :19.74         Mean   :11.36   Mean   :1.435e+12   Mean   :48.86  
##  3rd Qu.:29.00         3rd Qu.:18.00   3rd Qu.:1.435e+12   3rd Qu.:48.88  
##  Max.   :65.00         Max.   :66.00   Max.   :1.435e+12   Max.   :48.92  
##   position.lng  
##  Min.   :2.222  
##  1st Qu.:2.311  
##  Median :2.344  
##  Mean   :2.343  
##  3rd Qu.:2.376  
##  Max.   :2.479

We may note a few issues; th spatus as well as the contract name are considered as strings whereas they are better understand as a factor and the last_update column seems mysterious. After checking the documentation on JCDecaux website, we learn that this this the number of milliseconds since epoch. We can now modify the data frame with the mutate command from dplyr. It takes in input the name of a data frame and then a list of new column to add to this data frame and output the modified data frame. Note R (almost) never modifies the values of the arguments and its up to the user to store the new value!

Stations <- mutate(Stations, status = factor(status, level=c("CLOSED","OPEN")))
Stations <- mutate(Stations, contract_name = factor(contract_name))
Stations <- mutate(Stations, date = as.POSIXct(last_update/1000, origin = "1970-01-01"))
StationsDate <- max(Stations[,'date'])

We can use summary to verify that our dataset appears to be clean.

summary(Stations)
##      number          name             address           banking       
##  Min.   :  901   Length:1230        Length:1230        Mode :logical  
##  1st Qu.:10007   Class :character   Class :character   FALSE:2        
##  Median :15018   Mode  :character   Mode  :character   TRUE :1228     
##  Mean   :15730                                         NA's :0        
##  3rd Qu.:19114                                                        
##  Max.   :44102                                                        
##    bonus            status     contract_name  bike_stands   
##  Mode :logical   CLOSED:  33   Paris:1230    Min.   : 8.00  
##  FALSE:1097      OPEN  :1197                 1st Qu.:23.00  
##  TRUE :133                                   Median :30.00  
##  NA's :0                                     Mean   :32.72  
##                                              3rd Qu.:40.75  
##                                              Max.   :72.00  
##  available_bike_stands available_bikes  last_update         position.lat  
##  Min.   : 0.00         Min.   : 0.00   Min.   :1.435e+12   Min.   :48.81  
##  1st Qu.: 9.00         1st Qu.: 1.00   1st Qu.:1.435e+12   1st Qu.:48.84  
##  Median :19.00         Median : 5.00   Median :1.435e+12   Median :48.86  
##  Mean   :19.74         Mean   :11.36   Mean   :1.435e+12   Mean   :48.86  
##  3rd Qu.:29.00         3rd Qu.:18.00   3rd Qu.:1.435e+12   3rd Qu.:48.88  
##  Max.   :65.00         Max.   :66.00   Max.   :1.435e+12   Max.   :48.92  
##   position.lng        date                    
##  Min.   :2.222   Min.   :2015-06-27 13:45:42  
##  1st Qu.:2.311   1st Qu.:2015-06-27 15:37:47  
##  Median :2.344   Median :2015-06-27 15:40:37  
##  Mean   :2.343   Mean   :2015-06-27 15:39:41  
##  3rd Qu.:2.376   3rd Qu.:2015-06-27 15:42:13  
##  Max.   :2.479   Max.   :2015-06-27 15:43:05

Data exploration

Now that we have a properly formatted data frame, we can try to visualize it. We will use the package ggplot2 to explore the data set. We start by a one dimensional exploration using the qplot function that has a quite straightforward syntax while being quite clever…

We start by visualizing individually (almost) all the variables:

library("ggplot2")
qplot(data = Stations, bike_stands)

qplot(data = Stations, available_bike_stands)

qplot(data = Stations, available_bikes)

qplot(data = Stations, bonus)

qplot(data = Stations, banking)

qplot(data = Stations, date)

Note that ggplot2 proposes different visualization depending on the nature of the variable: continuous or nominal.

The number of bike_stands, of available_bike_stands and of available_bikes are expected to be related by \[ \textbf{bike_stands} = \textbf{available_bike_stands} + \textbf{available_bikes} \] We may want to investigate this relationship.

qplot(data = Stations, bike_stands, available_bikes)

qplot(data = Stations, bike_stands, available_bike_stands)

qplot(data = Stations, bike_stands, available_bikes + available_bike_stands)

We see an unexpected phenomenon! Is it common or not and do we see other patterns? This is not as easy as it seems as many stations can correspond to the same point in the graph. We will use two tricks to try to avoid this issue: jittering in which the points are slightly randomly displaced and transparency to visualize point superpositions.

qplot(data = Stations, bike_stands, available_bikes + available_bike_stands, color = status, geom = "jitter")

qplot(data = Stations, bike_stands, 
      available_bikes + available_bike_stands, 
      color = status, alpha = .25, size = 1.5)

qplot(data = Stations, bike_stands, 
      available_bikes + available_bike_stands, 
      color = status, geom = "jitter", alpha = .25, size = 1.5)