Data Wrangling Workshop

Paul Gramieri and Jonathan Che
February 15, 2017

What is Data Wrangling?

Often, datasets need manipulation before we can analyze them. We want our data to look like this…

firstname	lastname	age	state
Fred	Flores	24	MA
Sally	Smith	31	NY
Jim	Jones	27	GA

But we often get data that looks more like this:

firstname	lastname	age	state
Fred Flores		Twenty Four	MA
	Smith, Sally	31	New York
Jim	Jones	'27'	The Peach State

Basic Data Wrangling in R

Base R is helpful for some basic manipulation.

Finding size of data frame: use 'dim' function
Extract column from dataset: dataset$variable
Square bracket notation: object[row, column]

dataset[1,2] returns value in 1st row, 2nd column

dataset[3,] returns 3rd row

dataset[,4] returns 4th column

For row or column vectors…

vector[5] returns 5th item

Basic Data Wrangling in R (cont'd)

Finding type of object: use 'class' function
Common types of objects:

data frame, vector, matrix, character, numeric

Test type of object (will return 'TRUE' or 'FALSE'):

is.numeric(), is.vector(), is.data.frame(), etc

Convert type of object:

as.numeric(), as.vector(), as.data.frame(), etc

Packages and Basic Syntax

The 'dplyr' and 'tidyr' packages contain more functions to help us wrangle our data.

library('dplyr')
library('tidyr')

All functions follow the basic syntax:

function(dataset, argument)

See cheat sheet for list of functions. We'll go through some with an example.

Load in Data

We'll be using the 'CPS85' dataset from the 'mosaicData' package.

library('mosaicData')
dim(CPS85)

[1] 534  11

'select' function

We don't care about all 11 columns. Use 'select' to extract the columns you need.

CPS85.a <- CPS85 %>%
  select(wage, educ, sex, age, sector)
head(CPS85.a)

  wage educ sex age   sector
1  9.0   10   M  43    const
2  5.5   12   M  38    sales
3  3.8   12   F  22    sales
4 10.5   12   F  47 clerical
5 15.0   12   M  58    const
6  9.0   16   F  49 clerical

'filter' function

Use 'filter' to extract rows based on some arguments

CPS85.b <- CPS85 %>%
  select(wage, educ, sex, age, sector) %>%
  filter(sector == 'clerical',
         educ >= 12)   # At least a HS education
head(CPS85.b)

   wage educ sex age   sector
1 10.50   12   F  47 clerical
2  9.00   16   F  49 clerical
3  3.50   12   M  53 clerical
4  9.10   13   M  35 clerical
5 11.25   17   F  55 clerical
6  8.00   14   F  35 clerical

'mutate' function

We can also create new variables. Let's create one called 'work_years' that specifies number of years in the workforce.

CPS85.c <- CPS85.a %>%
  mutate(work_years = age - educ - 6)
head(CPS85.c)

  wage educ sex age   sector work_years
1  9.0   10   M  43    const         27
2  5.5   12   M  38    sales         20
3  3.8   12   F  22    sales          4
4 10.5   12   F  47 clerical         29
5 15.0   12   M  58    const         40
6  9.0   16   F  49 clerical         27

'arrange' function

To order our dataset in terms of a certain variable, we can use 'arrange'

CPS85.d <- CPS85.a %>%
  arrange(age)
head(CPS85.d)

   wage educ sex age   sector
1  3.35   12   M  18    other
2  3.35   12   F  18  service
3  5.00   12   F  18 clerical
4 10.00   16   F  18     prof
5  4.00   13   M  19    other
6  6.67   12   M  19    manuf

'group_by' and 'summarize'

If we want to find some summary statistics to compare distinct groups in the data, we can use 'group_by' and 'summarize'

CPS85.e <- CPS85 %>%
  group_by(sex) %>%
  summarize(avg_wage = mean(wage))
CPS85.e

# A tibble: 2 x 2
     sex avg_wage
  <fctr>    <dbl>
1      F 7.878857
2      M 9.994913

'gather' and 'spread' functions

Consider the following two data frames that display the same information about students' scores on two tests:

firstname	lastname	test1	test2
Fred	Flores	91	82
Sally	Smith	93	89

firstname	lastname	test	score
Fred	Flores	test1	91
Sally	Smith	test1	93
Fred	Flores	test2	82
Sally	Smith	test2	89

'gather' and 'spread' functions

Sometimes we want data that looks like the first table, and sometimes we want data that looks like the second table. To go from the first table to the second table, we use the gather() function:

gather(key, value, cols_to_gather)

table1 %>% gather(test, score, test1:test2)

  firstname lastname  test score
1      Fred   Flores test1    91
2     Sally    Smith test1    93
3      Fred   Flores test2    82
4     Sally    Smith test2    89

'gather' and 'spread' functions

To go from the second table to the first table, we use the spread() function:

spread(key, value)

table2 %>% spread(test, score)

  firstname lastname test1 test2
1      Fred   Flores    91    82
2     Sally    Smith    93    89

Your turn!

Any questions? Remember, your cheat sheet is a good resource! Use it!

Use your cheat sheet to work through the worksheet.