Data Wrangling Workshop

Paul Gramieri and Jonathan Che
February 15, 2017

What is Data Wrangling?

Often, datasets need manipulation before we can analyze them. We want our data to look like this…

firstname lastname age state
Fred Flores 24 MA
Sally Smith 31 NY
Jim Jones 27 GA

But we often get data that looks more like this:

firstname lastname age state
Fred Flores Twenty Four MA
Smith, Sally 31 New York
Jim Jones '27' The Peach State

Basic Data Wrangling in R

Base R is helpful for some basic manipulation.

  • Finding size of data frame: use 'dim' function
  • Extract column from dataset: dataset$variable
  • Square bracket notation: object[row, column]

dataset[1,2] returns value in 1st row, 2nd column

dataset[3,] returns 3rd row

dataset[,4] returns 4th column

  • For row or column vectors…

vector[5] returns 5th item

Basic Data Wrangling in R (cont'd)

  • Finding type of object: use 'class' function
  • Common types of objects:

data frame, vector, matrix, character, numeric

  • Test type of object (will return 'TRUE' or 'FALSE'):

is.numeric(), is.vector(), is.data.frame(), etc

  • Convert type of object:

as.numeric(), as.vector(), as.data.frame(), etc

Packages and Basic Syntax

The 'dplyr' and 'tidyr' packages contain more functions to help us wrangle our data.

library('dplyr')
library('tidyr')

All functions follow the basic syntax:

function(dataset, argument)

See cheat sheet for list of functions. We'll go through some with an example.

Load in Data

We'll be using the 'CPS85' dataset from the 'mosaicData' package.

library('mosaicData')
dim(CPS85)
[1] 534  11

'select' function

We don't care about all 11 columns. Use 'select' to extract the columns you need.

CPS85.a <- CPS85 %>%
  select(wage, educ, sex, age, sector)
head(CPS85.a)
  wage educ sex age   sector
1  9.0   10   M  43    const
2  5.5   12   M  38    sales
3  3.8   12   F  22    sales
4 10.5   12   F  47 clerical
5 15.0   12   M  58    const
6  9.0   16   F  49 clerical

'filter' function

Use 'filter' to extract rows based on some arguments

CPS85.b <- CPS85 %>%
  select(wage, educ, sex, age, sector) %>%
  filter(sector == 'clerical',
         educ >= 12)   # At least a HS education
head(CPS85.b)
   wage educ sex age   sector
1 10.50   12   F  47 clerical
2  9.00   16   F  49 clerical
3  3.50   12   M  53 clerical
4  9.10   13   M  35 clerical
5 11.25   17   F  55 clerical
6  8.00   14   F  35 clerical

'mutate' function

We can also create new variables. Let's create one called 'work_years' that specifies number of years in the workforce.

CPS85.c <- CPS85.a %>%
  mutate(work_years = age - educ - 6)
head(CPS85.c)
  wage educ sex age   sector work_years
1  9.0   10   M  43    const         27
2  5.5   12   M  38    sales         20
3  3.8   12   F  22    sales          4
4 10.5   12   F  47 clerical         29
5 15.0   12   M  58    const         40
6  9.0   16   F  49 clerical         27

'arrange' function

To order our dataset in terms of a certain variable, we can use 'arrange'

CPS85.d <- CPS85.a %>%
  arrange(age)
head(CPS85.d)
   wage educ sex age   sector
1  3.35   12   M  18    other
2  3.35   12   F  18  service
3  5.00   12   F  18 clerical
4 10.00   16   F  18     prof
5  4.00   13   M  19    other
6  6.67   12   M  19    manuf

'group_by' and 'summarize'

If we want to find some summary statistics to compare distinct groups in the data, we can use 'group_by' and 'summarize'

CPS85.e <- CPS85 %>%
  group_by(sex) %>%
  summarize(avg_wage = mean(wage))
CPS85.e
# A tibble: 2 x 2
     sex avg_wage
  <fctr>    <dbl>
1      F 7.878857
2      M 9.994913

'gather' and 'spread' functions

Consider the following two data frames that display the same information about students' scores on two tests:

firstname lastname test1 test2
Fred Flores 91 82
Sally Smith 93 89
firstname lastname test score
Fred Flores test1 91
Sally Smith test1 93
Fred Flores test2 82
Sally Smith test2 89

'gather' and 'spread' functions

Sometimes we want data that looks like the first table, and sometimes we want data that looks like the second table. To go from the first table to the second table, we use the gather() function:

gather(key, value, cols_to_gather)

table1 %>% gather(test, score, test1:test2)
  firstname lastname  test score
1      Fred   Flores test1    91
2     Sally    Smith test1    93
3      Fred   Flores test2    82
4     Sally    Smith test2    89

'gather' and 'spread' functions

To go from the second table to the first table, we use the spread() function:

spread(key, value)

table2 %>% spread(test, score)
  firstname lastname test1 test2
1      Fred   Flores    91    82
2     Sally    Smith    93    89

Your turn!

Any questions? Remember, your cheat sheet is a good resource! Use it!

Use your cheat sheet to work through the worksheet.