Data wrangling with R (1): Sheffield house prices

Background

This post gives an example of assembling and manipulating a data set in R, in particular:

  • some basic dplyr commands for manipulating data frames;
  • working with strings: selecting rows in a data frame that contain particular text within string variables;
  • combining two data frames.

The data set that we will construct can be used to visualise how house prices vary in different parts of Sheffield. I use this data set for teaching; I think it gives students an opportunity to produce a nice visualisation using the leaflet package, and the end result is quite striking.

The data

The UK Land Registry publishes data on house price sales each year. We’ll use the 2017 data file. We could read the file into R straight from the web

library(tidyverse)
house <- read_csv("http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2017.csv", col_names = FALSE)

though the file is quite large (about 180Mb), so I would suggest downloading the file separately, and then importing from your hard disk

There are no column headers in the file (so we use col_names = FALSE), but they are explained here. We’ll select the columns of interest only, and give them more informative names (the postcode column name will need to match the column name used for postcode in another data set.)

landRegistry %>%
  select(X2, X4, X5) ->
  houseSales
colnames(houseSales) <- c("Price", "Postcode", "Type")
houseSales
# A tibble: 1,055,011 x 3
    Price Postcode Type 
    <int> <chr>    <chr>
 1 125000 BD23 2BE F    
 2 195000 YO8 9GN  S    
 3 275000 DL8 4BD  D    
 4 715000 HG5 9HA  D    
 5 167500 YO12 6TW S    
 6 300000 YO18 8QA D    
 7 304000 YO10 4HN S    
 8 782000 YO30 7DZ D    
 9 315000 BD23 6QN D    
10 250000 YO13 0PY T    
# ... with 1,055,001 more rows

Next, we want to select house sales in Sheffield. We’ll use the central postcode districts only, which are S1 to S11 (see this map).

There are different ways to do this. One way is to make use of a regular expression (Chapter 14 of R for Data Science is a useful guide here).

(Sheffield <- paste("^S", 1:11, " ", sep = "", collapse = "|"))
[1] "^S1 |^S2 |^S3 |^S4 |^S5 |^S6 |^S7 |^S8 |^S9 |^S10 |^S11 "

(Read this as “a string starting with S1 followed by a space, or a string starting with S2 followed by a space, and so on”. The space is important, otherwise “S31” would count as containing “S3”.)

We can then search over this pattern using the function stringr::str_detect():

stringr::str_detect(c("S3 7RH", "S31 8AB", "WS1 4CD"), Sheffield)
[1]  TRUE FALSE FALSE

We will now extract the S1-S11 house sales, and we will also exclude any property type “Other” (coded as O):

houseSales %>%
  filter(stringr::str_detect(Postcode, Sheffield))  %>%
  filter(Type != "O") ->
  SheffieldHouseSales

SheffieldHouseSales
# A tibble: 5,495 x 3
    Price Postcode Type 
    <int> <chr>    <chr>
 1 128995 S2 1PL   T    
 2 250000 S10 2LT  T    
 3 194995 S2 3BY   S    
 4 236950 S7 2HJ   T    
 5 414950 S7 2HJ   S    
 6 620000 S7 2HJ   D    
 7 377450 S7 2HJ   T    
 8 140000 S3 8DX   F    
 9  99995 S5 8DF   S    
10 161000 S4 8BX   S    
# ... with 5,485 more rows

The postcode data

Next, we get the data to give us latitude and longitude for each postcode. These data are helpfully curated at https://www.doogal.co.uk: a bit of searching there will give a file we can import straight into R:

postcodeCoords <- read_csv("https://www.doogal.co.uk/AdministrativeAreasCSV.ashx?district=E08000019")

We’ll just select the columns we want

ShefPostcodeCoords <- postcodeCoords %>%
  select(Postcode, Latitude, Longitude)

ShefPostcodeCoords
# A tibble: 18,812 x 3
   Postcode Latitude Longitude
   <chr>       <dbl>     <dbl>
 1 S1 1AA       53.4     -1.46
 2 S1 1AB       53.4     -1.46
 3 S1 1AD       53.4     -1.47
 4 S1 1AH       53.4     -1.46
 5 S1 1AL       53.4     -1.46
 6 S1 1AU       53.4     -1.46
 7 S1 1AW       53.4     -1.48
 8 S1 1AY       53.4     -1.46
 9 S1 1BA       53.4     -1.47
10 S1 1BB       53.4     -1.48
# ... with 18,802 more rows

Finally, we can join the two data sets together

SheffieldDf <- inner_join(SheffieldHouseSales, ShefPostcodeCoords, by = "Postcode") 

SheffieldDf
# A tibble: 5,494 x 5
    Price Postcode Type  Latitude Longitude
    <int> <chr>    <chr>    <dbl>     <dbl>
 1 128995 S2 1PL   T         53.4     -1.42
 2 250000 S10 2LT  T         53.4     -1.49
 3 194995 S2 3BY   S         53.4     -1.45
 4 236950 S7 2HJ   T         53.3     -1.50
 5 414950 S7 2HJ   S         53.3     -1.50
 6 620000 S7 2HJ   D         53.3     -1.50
 7 377450 S7 2HJ   T         53.3     -1.50
 8 140000 S3 8DX   F         53.4     -1.48
 9  99995 S5 8DF   S         53.4     -1.48
10 161000 S4 8BX   S         53.4     -1.45
# ... with 5,484 more rows

Plotting the data

(SheffieldDf is the data set I give the students, so I won’t post the code for the analysis here, but I will show the end result, obtained using the leaflet package)

Figure 1: Mapping house prices in Sheffield. The blue circles show the most expensive house sales (highest 5%), and the red circles show the cheapest house sales (lowest 5%): the division across the city is striking.

Acknowledgements

House price data

Obtained from https://data.gov.uk/dataset/4c9b7641-cf73-4fd9-869a-4bfeed6d440e/hm-land-registry-price-paid-data

Contains HM Land Registry data (c) Crown copyright and database right 2018.

This data is licensed under the Open Government Licence v3.0.

Postcode data

Obtained from https://www.doogal.co.uk/AdministrativeAreasCSV.ashx?district=E08000019

Contains OS data (c) Crown copyright and database rights 2018

Contains Royal Mail data (c) Royal Mail copyright and database rights 2018

Contains National Statistics data (c) Crown copyright and database rights 2018

Related