Paul Gramieri and Jonathan Che
February 15, 2017
Often, datasets need manipulation before we can analyze them. We want our data to look like this…
firstname | lastname | age | state |
---|---|---|---|
Fred | Flores | 24 | MA |
Sally | Smith | 31 | NY |
Jim | Jones | 27 | GA |
But we often get data that looks more like this:
firstname | lastname | age | state |
---|---|---|---|
Fred Flores | Twenty Four | MA | |
Smith, Sally | 31 | New York | |
Jim | Jones | '27' | The Peach State |
Base R is helpful for some basic manipulation.
dataset[1,2] returns value in 1st row, 2nd column
dataset[3,] returns 3rd row
dataset[,4] returns 4th column
vector[5] returns 5th item
data frame, vector, matrix, character, numeric
is.numeric(), is.vector(), is.data.frame(), etc
as.numeric(), as.vector(), as.data.frame(), etc
The 'dplyr' and 'tidyr' packages contain more functions to help us wrangle our data.
library('dplyr')
library('tidyr')
All functions follow the basic syntax:
function(dataset, argument)
See cheat sheet for list of functions. We'll go through some with an example.
We'll be using the 'CPS85' dataset from the 'mosaicData' package.
library('mosaicData')
dim(CPS85)
[1] 534 11
We don't care about all 11 columns. Use 'select' to extract the columns you need.
CPS85.a <- CPS85 %>%
select(wage, educ, sex, age, sector)
head(CPS85.a)
wage educ sex age sector
1 9.0 10 M 43 const
2 5.5 12 M 38 sales
3 3.8 12 F 22 sales
4 10.5 12 F 47 clerical
5 15.0 12 M 58 const
6 9.0 16 F 49 clerical
Use 'filter' to extract rows based on some arguments
CPS85.b <- CPS85 %>%
select(wage, educ, sex, age, sector) %>%
filter(sector == 'clerical',
educ >= 12) # At least a HS education
head(CPS85.b)
wage educ sex age sector
1 10.50 12 F 47 clerical
2 9.00 16 F 49 clerical
3 3.50 12 M 53 clerical
4 9.10 13 M 35 clerical
5 11.25 17 F 55 clerical
6 8.00 14 F 35 clerical
We can also create new variables. Let's create one called 'work_years' that specifies number of years in the workforce.
CPS85.c <- CPS85.a %>%
mutate(work_years = age - educ - 6)
head(CPS85.c)
wage educ sex age sector work_years
1 9.0 10 M 43 const 27
2 5.5 12 M 38 sales 20
3 3.8 12 F 22 sales 4
4 10.5 12 F 47 clerical 29
5 15.0 12 M 58 const 40
6 9.0 16 F 49 clerical 27
To order our dataset in terms of a certain variable, we can use 'arrange'
CPS85.d <- CPS85.a %>%
arrange(age)
head(CPS85.d)
wage educ sex age sector
1 3.35 12 M 18 other
2 3.35 12 F 18 service
3 5.00 12 F 18 clerical
4 10.00 16 F 18 prof
5 4.00 13 M 19 other
6 6.67 12 M 19 manuf
If we want to find some summary statistics to compare distinct groups in the data, we can use 'group_by' and 'summarize'
CPS85.e <- CPS85 %>%
group_by(sex) %>%
summarize(avg_wage = mean(wage))
CPS85.e
# A tibble: 2 x 2
sex avg_wage
<fctr> <dbl>
1 F 7.878857
2 M 9.994913
Consider the following two data frames that display the same information about students' scores on two tests:
firstname | lastname | test1 | test2 |
---|---|---|---|
Fred | Flores | 91 | 82 |
Sally | Smith | 93 | 89 |
firstname | lastname | test | score |
---|---|---|---|
Fred | Flores | test1 | 91 |
Sally | Smith | test1 | 93 |
Fred | Flores | test2 | 82 |
Sally | Smith | test2 | 89 |
Sometimes we want data that looks like the first table, and sometimes we want data that looks like the second table. To go from the first table to the second table, we use the gather() function:
gather(key, value, cols_to_gather)
table1 %>% gather(test, score, test1:test2)
firstname lastname test score
1 Fred Flores test1 91
2 Sally Smith test1 93
3 Fred Flores test2 82
4 Sally Smith test2 89
To go from the second table to the first table, we use the spread() function:
spread(key, value)
table2 %>% spread(test, score)
firstname lastname test1 test2
1 Fred Flores 91 82
2 Sally Smith 93 89
Any questions? Remember, your cheat sheet is a good resource! Use it!
Use your cheat sheet to work through the worksheet.