Web Scraping with R

web scraping
data science
Author

Celâl Güney

Published

July 4, 2023

An introduction to web scraping in R

Web scraping is nowadays rife in statistics and data science. This data collection method has become fashionable not only in computer and data science, but also in economics, sociology and social sciences in general. The reasons for the success of web scraping need hardly to be told at length: the usefulness of web scrapping, extracting data from whatever website we visit, is self-evident. Go to any Wikipedia page, and you can extract whatever data table in contains into your favorite statistical software. The potential of web scraping is huge and will be increasingly used in social sciences.

The main goal of this post is to show the basics of web scraping using R and the package rvest. To do so, I draw extensively on Paul C. Bauer’s Computational Social Sciences manual, which explains in depth how to do web scraping in R.

Getting data: videogame prices

I will here use web scraping to get video-game prices in Switzerland. The first step of web scraping is to find a webpage from which you want to collect data. To collect the prices of video-games, I need a web page from for instance a chain stores. In Switzerland, MediaMarkt is a famous chain stores selling consumer electronics and most Swiss people who still buy or order physical copy of their favorite video-games go to or order from MediaMarkt.

For the anecdote, at the time and for many years (at least until I was 18 in the 2010s), MediaMarkt used to sell video-games slightly below market prices and some stores in Geneva even sometimes sold copies before the release date. The situation has much changed since and the stores are now barely able to sell and distribute physical copies day one (it was the case this year at least for Wo Long and Dead Space 2 remake).

I search then, on MediaMarkt website, the webpage dedicated to video-games. The webpages are separated by video-games platforms/consoles, so I choose to focus on Playstation 5 games. I store the link in an object called url. Then, the web page can be read in R using read_html():

url <- "https://www.mediamarkt.ch/fr/category/_jeux-ps5-772119.html?searchParams=&sort=&view=PRODUCTGRID&page=1"
mediamarkt <- read_html(url)

The next step is to search and select each videogame price as well as the videogame’s titles using the chrome extension selectorgadget. This step is a bit tricky and is a pain because selectorgadget is a chrome-only extension. So if you don’t have chrome, you have to install it and then add selector gadget on it. To use selectorgadget, just click on what you want to collect on the webpage and then copy the css selector which is on the left of the selectorgadget bar.

Using selectorgadget

Then, you just have to paste it into the html_nodes function (here .info-wrapper for the price, then .product-link for the title).

price <- html_nodes(mediamarkt, ".info-wrapper")
game <- html_nodes(mediamarkt, ".product-link")

Then, html_text() will extract the information we want:

price <- html_text(price, trim = TRUE)
game <- html_text(game, trim = TRUE)
head(price)
[1] "274.95\r\n\t\r\n\t\t\tTVA compris, plus frais d’expédition 0.-"
[2] "72.95\r\n\t\r\n\t\t\tTVA compris, plus frais d’expédition 0.-" 
[3] "67.95\r\n\t\r\n\t\t\tTVA compris, plus frais d’expédition 0.-" 
[4] "69.95\r\n\t\r\n\t\t\tTVA compris, plus frais d’expédition 0.-" 
[5] "73.95\r\n\t\r\n\t\t\tTVA compris, plus frais d’expédition 0.-" 
[6] "54.95\r\n\t\r\n\t\t\tTVA compris, plus frais d’expédition 0.-" 
head(game)
[1] "Marvel's Spider-Man 2: Édition Collector (CiaB) - PlayStation 5 - Allemand, Français, Italien"
[2] "Sports FC 24 : Édition Standard - PlayStation 5 - Allemand, Français, Italien"                
[3] "Marvel’s Spider-Man: Miles Morales - PlayStation 5 - Allemand, Français, Italien"             
[4] "STAR WARS Jedi : Survivor - PlayStation 5 - Allemand, Français, Italien"                      
[5] "F1 23 - PlayStation 5 - Allemand, Français, Italien"                                          
[6] "Ratchet & Clank : Rift Apart - PlayStation 5 - Allemand, Français, Italien"                   

It worked well for the game titles (even if they could be simplified), but not so much for prices. We need to extract the latter from the text in the price object. This can be done using function gsub() and some Regex manipulation:

price_final <- as.numeric(gsub("([0.001-9]+).*$", "\\1", price))
#"([0.001-9]+).*$" means: "keep only characters from 0.001 to 9 multiple times" it will thus extract the number from the character list we got with html_text()
price_final
 [1] 274.95  72.95  67.95  69.95  73.95  54.95  44.95  69.95  62.95  59.95
[11]  69.95  74.95  54.95  40.95  72.95  29.95  69.95  14.00  72.95  26.95
[21]  22.00  24.00  74.95  89.95

Similarly, the video-game titles can be simplified using some Regex:

game_final <- gsub("- PlayStation 5 - Allemand.*", "", game)

And finally, prices and titles can be combined in a dataframe

data <- data.frame(game_final, price_final)
data %>% gt()
game_final price_final
Marvel's Spider-Man 2: Édition Collector (CiaB) 274.95
Sports FC 24 : Édition Standard 72.95
Marvel’s Spider-Man: Miles Morales 67.95
STAR WARS Jedi : Survivor 69.95
F1 23 73.95
Ratchet & Clank : Rift Apart 54.95
Elden Ring : Édition Standard 44.95
Street Fighter 6 69.95
Hogwarts Legacy 62.95
Need for Speed Unbound 59.95
Diablo IV 69.95
Marvel's Spider-Man 2 74.95
The Last of Us Part I 54.95
MotoGP 23 : Édition Day One 40.95
Gran Turismo 7 72.95
Der Herr der Ringe: Gollum 29.95
Diablo IV - PlayStation 5 - Français 69.95
New Tales from the Borderlands: Deluxe Edition 14.00
Madden NFL 24 - PlayStation 5 - English 72.95
Far Cry 6 26.95
Assassin's Creed Valhalla 22.00
Assassin's Creed : Valhalla - Édition Ragnarök 24.00
Wild Hearts 74.95
Mortal Kombat 1: Premium Edition 89.95

We have thus collected every game’s title and price on the first page of MediaMarkt Playstation 5 games.

data %>% 
  ggplot()+
  aes(y = reorder(game_final, -price_final), x = price_final)+
  geom_col()+
  geom_text(aes(label = price_final), size = 3, hjust = 1.4, color = "white")+
  theme_minimal(base_size = 11)+
  labs(title = "Playstation 5 games",
    y = "", x = "price in swiss francs")

Getting data on all pages

We have until now collected data only the first page of the website, but ideally we want to be able to collect data on all pages automatically. This can be done as follow:

First, create a function which will automatically most of the steps: reading the url (read_html), the css nodes found with the chrome extension SelectorGadget and extracting the information of interest:

scrape_website <- function(url){
  
  website <- read_html(url)
  
  title <- html_nodes(website, ".product-link")
  price <- html_nodes(website, ".info-wrapper")
  
  
  df <- data.frame(game = gsub("- PlayStation 5 - Allemand.*", "", html_text(title, trim = TRUE)),
                   price = as.numeric(gsub("([0.005-9]+).*$", "\\1", html_text(price, trim = TRUE))))
  
  return(df)
}

But what about the url? In our example, there are 21 pages in total, resulting in 21 different url. Fortunately, since those pages are in fact a list of products, their patterns are very similar. The only difference between the pages are the “page=#” at the end, with # the number of the page. We can thus define base_url as the recurring url pattern common to all the 21 pages and then number of each particular page (pages object):

base_url <- "https://www.mediamarkt.ch/fr/category/_jeux-ps5-772119.html?searchParams=&sort=&view=PRODUCTGRID&page="

pages <- seq(1, 21, by = 1)

This is an example on how to extract data using the function for the first page:

url = paste(base_url, pages[1], sep = "")
datasets <- list()
datasets[[1]] <- scrape_website(url = url)

Finally, to perform web scrapping on all the pages, Paul Bauer proposes the following for loop code

datasets <- list()
for (i in 1:length(pages)){
  
    # informative message about progress of loop
      message(i, '/', length(pages))
  
    # prepare URL
      url <- paste(base_url, pages[i], sep="")
      
    # scrape website
      datasets[[i+1]] <- scrape_website(url)
      
    # wait a couple of seconds between URL calls
      Sys.sleep(0.2)
}

However, I propose to simplify the code by using the map() function from the purrr package:

url <- paste(base_url, pages[1:21], sep = "") #create a vector with all the pages url

data <- url %>% 
  map_df(scrape_website) #apply the scrape_website function for each element of the url vector and store into a dataframe

We thus have all the data we wanted! Let’s now have a quick look at what we collected.

data %>% 
  summarise(mean = mean(price, na.rm = TRUE),
            median = median(price, na.rm = TRUE),
            sd = sd(price, na.rm = TRUE),
            min = min(price, na.rm = TRUE),
            max = max(price, na.rm = TRUE),
            ) %>% 
  gt(caption = "Video-games summary statistics (CHF)")
Video-games summary statistics (CHF)
mean median sd min max
37.23819 34.95 23.70447 5 300.95
data %>% 
  filter(price <= 90) %>% 
  ggplot()+
  aes(x = price) %>% 
  geom_histogram(binwidth = 9, color = "white", fill = "darkblue", alpha = 0.8)+
  scale_x_continuous(breaks = seq(0, 90, by = 5))+
  theme_minimal(base_size = 13)+
  labs(title = "Video-game price",
       subtitle = "Switzerland, Playstation 5 games, July 2023",
       y = "", x = "Swiss Francs (CHF)")

Getting data from a webpage table

When browsing the internet, it often happens that we come across tables but without any possibility to export this table. Web scraping allows to easily read those tables into R.

I will here import data from a table on the results of the famous 1992 Swiss voting on the adhesion to the European Economic Area. The Swiss government website provides a table here. I am almost certain that the Swiss government provides an excel or csv file on this but since I am too lazy to search for it (for my defense, the website page should allow to export directly the table into csv or excel), I will use web scraping to read this table in R and then extract the data into a dataframe.

The first step is the same as above: read the url to import the html code into R:

url <- "https://www.bk.admin.ch/ch/f/pore/va/19921206/can388.html"
page <- read_html(url)

Then, html_table() can directly import the table information

table <- html_table(page, header = TRUE)

data <- table[[1]] #extract the table dataframe from the list
head(data) %>% 
  gt()
Canton Electeurs Votants % Particip. Oui Non % Oui % Non
Zurich 768'126 618'209 80.48% 297'503 316'154 48.5% 51.5%
Berne 686'459 540'179 78.69% 255'224 281'026 47.6% 52.4%
Lucerne 224'458 181'614 80.91% 70'878 109'447 39.3% 60.7%
Uri 25'290 19'816 78.36% 4'943 14'728 25.1% 74.9%
Schwyz 77'278 64'315 83.23% 17'094 46'962 26.7% 73.3%
Obwald 20'713 16'940 81.78% 4'737 12'062 28.2% 71.8%

There is one problem here: the numbers either contain “‘” or “%”. As a result, R read them as characters and it will not be possible to convert those number into numeric format unless all the’ and % are deleted.

data2 <- as_tibble(lapply(data, function(x) {gsub("'|%", "", x)})) # collapse all ' and % in the values to transform them into numeric

data2 <- data %>% 
  map_df(function(x) {gsub("'|%", "", x)}) # same using map_df from dplyr

All the ’ and % symbols are now removed. The last step is to convert the values into numeric values:

data2[,-1] <- data2[,-1] %>% 
            map_df(as.numeric) ## convert all columns into numeric except for canton (first column)

str(data2)
tibble [27 × 8] (S3: tbl_df/tbl/data.frame)
 $ Canton     : chr [1:27] "Zurich" "Berne" "Lucerne" "Uri" ...
 $ Electeurs  : num [1:27] 768126 686459 224458 25290 77278 ...
 $ Votants    : num [1:27] 618209 540179 181614 19816 64315 ...
 $ % Particip.: num [1:27] 80.5 78.7 80.9 78.4 83.2 ...
 $ Oui        : num [1:27] 297503 255224 70878 4943 17094 ...
 $ Non        : num [1:27] 316154 281026 109447 14728 46962 ...
 $ % Oui      : num [1:27] 48.5 47.6 39.3 25.1 26.7 28.2 33.9 31.9 43.8 64.9 ...
 $ % Non      : num [1:27] 51.5 52.4 60.7 74.9 73.3 71.8 66.1 68.1 56.2 35.1 ...