Data wrangling with R (2): the gender pay gap


The UK government requires employers with 250 or more employees to provide data on gender pay inequality. Various statistics are reported; here I will investigate the proportion of female employees in each pay quartile.

About 10500 employers submitted reports in 2017, but we will look at the data from a single sector only: universities. Two questions of interest are

  • does the proportion of female employees vary by pay quartile? (For example, are female employees more likely to be in lower-paid jobs?)
  • Are there any differences between the ‘pre-92’ and ‘post-92’ institutions? (‘Post-92’ institutions are typically former polytechnics, which were awarded university status in 1992).

(Separately, we would want to know whether males and females are paid equally for doing the same job, but I don’t think there’s anything in this data set that can tell us this.)

The data wrangling problem

I want to extract university employers (about 100) from a gender-pay data set of about 10500 employers. I can obtain a list of universities from a separate source, but the problem is that the names used in the two data sets may be different, for example, “Sheffield University” in one data set, and “The University of Sheffield” in the other. I’ll need to search for universities in the gender-pay data set using some sort of approximate string matching. A second (simpler) problem will be how to restructure/relabel the data set, so that we can get the plot we want using ggplot2.

The gender pay gap data

The data are available at the UK government’s gender pay gap service. I’ve already had a look to see what columns I want, so I’ll select those straight away. There are some employers with “university” in the title that are NHS trusts, or students’ unions, so I’ll filter those out. The command tolower() is handy for converting text to lower case.

pay <- read_csv("") %>%
         ) %>%
  filter(!str_detect(tolower(EmployerName), "nhs"),
         !str_detect(tolower(EmployerName), "students'"),
         !str_detect(tolower(EmployerName), "students"))

Extracting the universities

To make a list of universities, I’ll use the register at the Office for Students. I downloaded an xlsx file separately, which you can get here. Again, I’ve had a look at the excel file in advance, to see if I need to skip at rows at the start, and to see what columns I want. For the analysis, I will only use “higher education providers” that can award research degrees.

universities <- readxl::read_xlsx("OfSRegisterData-2018-11-09-1610.xlsx",
                                  skip = 4) %>%
  select(c(1, 17, 18)) %>%
  rename("name" = 1, "highestDegree" = 2, "dateDegree" = 3) %>%
  filter(highestDegree == "Research") 

Approximate string matching

To illustrate the problem, an employer I would be trying to find in the pay data frame is

## [1] "University of Newcastle upon Tyne"

but, in the pay data frame this is actually listed as

## [1] "Newcastle University"

A function I can try is agrep(). This computes a measure of similarity between two strings called the Levenshtein distance, so I’ll actually make use of a function to calculate the distance directly: RecordLinkage::levenshteinSim(). This will give a value on a 0-1 scale, with 1 corresponding to a perfect match. There are a couple of things we can do to help: arrange the words inside each string in alphabetical order, and remove small words such as “the” and “of”. To illustrate how this might help, consider searching for “the university of southampton”, within a list containing “southampton university” and “the university of northampton”. After arranging the words in alphabetical order, we try

RecordLinkage::levenshteinSim("of southampton the university",
                              c("southampton university",
                                "of northampton the university"))
## [1] 0.7586207 0.9310345

Not what we want. But this would (obviously) work:

RecordLinkage::levenshteinSim("southampton university", 
                              c("southampton university", 
                                "northampton university"))
## [1] 1.0000000 0.9090909

There are also a few employers that include the title “Higher Education Corporation”, which I will try getting rid of to start with, making use of the function tm::removeWords(). I’ll also change everything to lower case.

universities$name <- tolower(universities$name) %>% 
 tm::removeWords("higher education corporation")

To arrange the words within a string in order, I need to define a function (thanks to stack overflow)

orderString <- function(x){
  x %>% 
    str_split(., ' ') %>% 
    lapply(., 'sort') %>%  
    lapply(., 'paste', collapse=' ') %>% 

Now I’ll apply the modifications to each string, and then loop through the names in universities, searching for the closest match in pay:

shortnamesUni <- orderString(tm::removeWords(tolower(universities$name),
                                             c("the", "of")))
shortnamesPay <- orderString(tm::removeWords(tolower(pay$EmployerName),
                                             c("the", "of")))

universitiesIndex <- maxDistance <- rep(0, nrow(universities))
for(i in 1:nrow(universities)){
  distances = RecordLinkage::levenshteinSim(shortnamesUni[i],
  universitiesIndex[i] <- which.max(distances)
  maxDistance[i] <- max(distances)

universitiesIndex should give the row numbers in pay that match the employer names in universities. I’ll check this, looking at worst matches first. (There are only around 100 universities to check in total, so it’s not hard to do this by eye. I’d need a better solution if I was trying to automate the whole process.)

matches <- data.frame(universities$name, 
matches %>%
  arrange(maxDistance) %>%
## 1                             st. george's hospital medical school
## 2 the chancellor, masters and scholars of the university of oxford
## 3             imperial college of science, technology and medicine
## 4                                university of newcastle upon tyne
## 5                                                 birkbeck college
## 6                       university of the west of england, bristol
##                                           pay.EmployerName.universitiesIndex.
## 1                                                   The Royal Hospital School
## 2 The Provost and Scholars of the Queen's College in the University of Oxford
## 3                               THE SCIENCE AND TECHNOLOGY FACILITIES COUNCIL
## 4                                                        Newcastle University
## 5                                                             Burnley College
## 6                                           University of the West of England
##   maxDistance
## 1   0.5277778
## 2   0.5483871
## 3   0.6200000
## 4   0.6451613
## 5   0.6875000
## 6   0.7428571

Looking at a few more rows, I spotted seven cases where the right employers weren’t found. We’ve now reached the limit of my ability, so I’ll search manually at, change the names in universities by hand, and repeat the code above.

universities$name[c(20, 22, 23, 31, 58, 64, 80)] <- c("Imperial College",
                                "St George's, University of London",
                                "The University of Oxford",
                                "Kingston University Higher Education Corporation",
                                "Southampton Solent University, Limited",
                                "Birkbeck College, University of London")

shortnamesUni <- orderString(tm::removeWords(tolower(universities$name),
                                             c("the", "of")))

for(i in 1:nrow(universities)){
  distances = RecordLinkage::levenshteinSim(shortnamesUni[i],
  universitiesIndex[i] <- which.max(distances)
  maxDistance[i] <- max(distances)

Adding in the university type

I’ll now add in a variable for whether each university is ‘pre-92’ or ‘post-92’. I can get this mostly from the dateDegree column (the year degree-awarding powers were granted), but there are a few missing values, and the situation with London universities is a little complicated, so I’ll make some changes manually.

universities$dateDegree <- as.numeric(universities$dateDegree)
index <- c(5, 12, 14, 20, 23, 31, 60, 63)
## [1] "university of cambridge"                             
## [2] "queen mary university of london"                     
## [3] "king's college london"                               
## [4] "Imperial College"                                    
## [5] "UCL"                                                 
## [6] "The University of Oxford"                            
## [7] "the london school of economics and political science"
## [8] "london school of hygiene and tropical medicine"
universities$dateDegree[index] <- 1900 
universities$dateDegree[universities$name == "university of brighton"] <- 1992

Finally, I can make the data set I wanted:

genderPay <- pay[universitiesIndex, ]
genderPay$institution <- ifelse(universities$dateDegree < 1992,

Plotting the data

One way to visualise the data is with a box plot of percentages of female employees, within each pay quartile, with separate boxes drawn for pre and post 92s. There are a couple of fiddly things to do here. First, I need to rearrange the data in a ‘tidy’ format: one dependent variable per row only, with a new variable (Quartile) to describe the pay quartile. gather() from the tidyr package will do this:

genderPay %>%
  gather(Quartile, FemalePercentage, -EmployerName, -institution )

Finally, I need to re-order the factor levels of Quartile, so that ggplot2 will plot them in the order I want, and then give shorter labels for use in the box plot.

payDfPlotting <-
  genderPay %>%
  gather(Quartile, FemalePercentage, -EmployerName, -institution ) %>%
  mutate(Quartile = factor(Quartile, 
            levels = c("FemaleLowerQuartile", "FemaleLowerMiddleQuartile",
                       "FemaleUpperMiddleQuartile", "FemaleTopQuartile"))) %>%
  mutate(Quartile = recode(Quartile, "FemaleLowerQuartile" = "first",
                           "FemaleLowerMiddleQuartile" = "second",
                           "FemaleUpperMiddleQuartile" = "third",
                           "FemaleTopQuartile" = "fourth"))

The end result

ggplot(payDfPlotting, aes(x = Quartile, y = FemalePercentage/  100)) +
  geom_boxplot(aes(colour = institution)) +
  labs(x = "Pay quartile (first: lowest 25% of salaries, fourth: highest 25% of salaries)", y = "Percentage of female employees", 
       caption = "Data source:")+
  scale_y_continuous(labels = scales::percent) 
Percentages of female employees within each pay quartile at UK universities (grouped into 'pre-92' and 'post-92' institutions). Low paid employees are more likely to be female; high paid employees are more likely to be male.

Figure 1: Percentages of female employees within each pay quartile at UK universities (grouped into ‘pre-92’ and ‘post-92’ institutions). Low paid employees are more likely to be female; high paid employees are more likely to be male.

The general trend of a decreasing percentage of female staff as salaries increase is clear to see. The contrast between pre and post-92s within the highest pay quartile is also clear.


Gender pay gap data from Contains public sector information licensed under the Open Government Licence v3.0.