5 Data management

The estimated amount of time to complete this chapter is 1-2 hours.

Often we do not receive the data files in a useful form but we will have to do some rearranging and transformation of the data in order to prepare for the statistical analyses. This includes subsetting datasets, merging datasets as well as defining new variables and deleting variables.

Kap 5.1-5.3 fungerer som en intro til hvordan man jonglerer med data i R. Det behoever du ikke bruge tid paa lige nu - men du vil helt sikkert faa glaede af det naar du skal i gang med at arbejde med data selv. Derudover introduceres du her til nogle funktioner, man ofte bruger. Er du presset paa tid nu, saa vent med disse kapitler. Kapitel 5.4 boer du dog gennemgaa inden kursusstart.

Forklar variabeltyper for sleepdata!

5.1 Merge two data sets

Information on our patients can occur in two or more data sets. E.g. consider a situation where one data set contains the information from sleepData and another data set contains individual specific information such as the following (the following data is for illustrative purposes and has absolutely nothing to do with the orginial data in the Cushny and Peebles paper):

sleepPatientInfo <- data.frame(ID = 1:10, BMI = 18:27, 
                               smoking = rep( c("Yes", "No"), 5)
                               )
sleepPatientInfo
   ID BMI smoking
1   1  18     Yes
2   2  19      No
3   3  20     Yes
4   4  21      No
5   5  22     Yes
6   6  23      No
7   7  24     Yes
8   8  25      No
9   9  26     Yes
10 10  27      No

The sleepPatientInfo thus contains information on BMI and whether the patient is smoking or not (we used the rep() (REPeat) function to repeat the vector c('Yes','No') 5 times to obtain a length of 10).

If we wish to add this new information onto the sleepData data frame, we can use the function merge(). To merge two data sets it is important to identify which variable(s) that connects the two data sets. In the sleepData and sleepPatientInformation example it is “ID”. “ID” is then called a key. Using the merge() function where “ID” is the key between the two data sets, we create a new data set sleepMerge that contains information from the sleepData and sleepPatientInformation data set:

sleepData <- sleep
sleepMerge <- merge( x = sleepData, y = sleepPatientInfo, by = "ID" )
head( sleepMerge )
  ID extra group BMI smoking
1  1   0.7     1  18     Yes
2  1   1.9     2  18     Yes
3 10   2.0     1  27      No
4 10   3.4     2  27      No
5  2  -1.6     1  19      No
6  2   0.8     2  19      No

From the Environment pane we see that the sleepMerge data contains 20 observations and 5 variables. I.e. it is a copy of the original sleepData with the two variables BMI and smoking from the sleepPatientInfo added.

Sometimes there may be more than one variable needed to merge two data sets properly (see e.g. statmethods.net to learn the syntax).

5.2 Subsetting

A subset of a data frame is a data frame containing only some of the rows or some of the columns (variables). In Chapter 3.3 we saw how we could define such smaller data sets using indexing. A useful built-in function is the subset() command that creates a subset of an existing data set based on one or more logical conditions on the variables in the data set. The function has the following structure:

subset(x, subset) 

Here x is a data set and subset is the condition (logical expression). How to use the function is shown below:

sleepData <- sleep 
drug1 <- subset(sleepData, group == 1)
drug2 <- subset(sleepData, group == 2)

Here the first subset only contains the data lines where the group variable equals the value 1. Note that we use double equal signs for comparisons (asking whether group “is equal to” 1). The second subset similarly contains the data lines where the group variable equals 2. The sleep data has 20 observations and from the Environment pane we see that we now have two datasets drug1 and drug2 each with 10 observations and 3 variables. Have a look at each of the datasets to see what they contain (using e.g. View()).

5.3 Long vs. wide format

A data set may contain several measurements on the same individuals. Such a data set often occur in either of the two formats: wide or long. We will in this section explain the differences between the two formats and show how to change from one to another.

We can represent the sleepData data set both in long and wide format. Both formats will contain the same information but in different forms. In the long format the data contains one row for each measurement and each patient contributes with several lines. In the wide format there is only one row for each patient and each measurement is in a separate column. The original format of the sleepData data set is the long format. Here long format refers to the fact that we have repeated measurements (multiple measurements on the same individual represented by ID) and the measurements are listed on top of each other instead of side-by-side. Often the long format is preferred when doing a statistical analysis, but this is not always the case!

5.3.1 From long to wide format

A wide format of the sleepData data set can be achieved by placing values of the extra column of group 2 next to the group 1 extra column and thereby reducing the data frame from 20 lines to 10 lines.

We may use the above two subsets drug1 and drug2 containing the group 1 respectively group 2 measurements (Chapter 5.2). In wide format we will need 10 lines (one for each individual) and three variables (the ID, the extra hours slept on drug 1 and the extra hours slept on drug 2).

From drug1 we will extract the ID and the extra hours slept on drug1. We rename the extra variable as we need a new name to make it clear that these measurements were taken while the patients were on drug 1. Lets denote this variable hoursDrug1. In practice we define the new variable hoursDrug1 as a copy of the extra variable and afterwards delete the extra variable:

drug1$hoursDrug1 <- drug1$extra
drug1$extra <- NULL
head( drug1 )
  group ID hoursDrug1
1     1  1        0.7
2     1  2       -1.6
3     1  3       -0.2
4     1  4       -1.2
5     1  5       -0.1
6     1  6        3.4

Note that we deleted the extra variable by assigning it the value NULL which is basically just an empty vector. We should also get rid of the group variable as we do not need that anymore. We could use drug1$group <- NULL as above, another way is to refer specifically to the column we wish to delete using indexing:

# Delete column no 1 
drug1 <- drug1[ ,-1]
head( drug1 )
  ID hoursDrug1
1  1        0.7
2  2       -1.6
3  3       -0.2
4  4       -1.2
5  5       -0.1
6  6        3.4

We could also have specified that we only wanted column no 2 and 3 as drug1 <- drug1[ ,2:3].

The same maneuver is performed for the drug2 data set:

drug2$hoursDrug2 <- drug2$extra
drug2$extra <- NULL
drug2$group <- NULL

Note that we carefully defined the two datasets, drug1 and drug2, to only have the ID variable in common in order to be able to ‘set the data sets side by side’. We may do that using the merge() function described above (Chapter 5.1):

sleepWide <- merge( drug1, drug2, by='ID')
sleepWide
   ID hoursDrug1 hoursDrug2
1   1        0.7        1.9
2  10        2.0        3.4
3   2       -1.6        0.8
4   3       -0.2        1.1
5   4       -1.2        0.1
6   5       -0.1       -0.1
7   6        3.4        4.4
8   7        3.7        5.5
9   8        0.8        1.6
10  9        0.0        4.6

Voila! However note that merge() has changed the order or the ID variable. If we want our data sorted by the ID number we sleepWide data by IDnum using the function order():

sleepWide <- sleepWide[ order(sleepWide$ID), ]
sleepWide
   ID hoursDrug1 hoursDrug2
1   1        0.7        1.9
3   2       -1.6        0.8
4   3       -0.2        1.1
5   4       -1.2        0.1
6   5       -0.1       -0.1
7   6        3.4        4.4
8   7        3.7        5.5
9   8        0.8        1.6
10  9        0.0        4.6
2  10        2.0        3.4

To see a few more examples of sorting, see e.g. statmethods.net.

Exercise: Order the data by ID using indexing instead of order(). You first need to run sleepWide <- merge( drug1, drug2, by='ID') to redefine the unsorted data frame. Now use indexing (Chapter 3.3) to have the appropriate order of the data.

Click here for a solution. In the unsorted data set we first need the first row, then the third, fourth etc until the 10’th row. The second row should be the last. I.e. we need the following order of the data lines:

sleepWide <- merge( drug1, drug2, by='ID')
sleepWide <- sleepWide[ c(1,3:10,2), ]
sleepWide
   ID hoursDrug1 hoursDrug2
1   1        0.7        1.9
3   2       -1.6        0.8
4   3       -0.2        1.1
5   4       -1.2        0.1
6   5       -0.1       -0.1
7   6        3.4        4.4
8   7        3.7        5.5
9   8        0.8        1.6
10  9        0.0        4.6
2  10        2.0        3.4


If you need to organize your own data from long to wide it might be worth having a closer look at the dcast() function in the reshape2() package.

5.3.2 From wide to long format

To reproduce the long format of the sleepData data set from the sleepWide data set, the two columns hoursDrug1 and hoursDrug2 must be placed on top of each other and we need a variable indicating which group the measurement belongs to (like the group variable in sleepData). This can be achieved by 1) separating hoursDrug1 and hoursDrug2 into two separate data frames, 2) adding a group variable to each data frame, 3) rename the variables of the data frames to contain the same variable names and 4) pasting the two data frames on top of each other.

# 1) make a copy of the wideData, delete hoursDrug2
wide1 <- sleepWide
wide1$hoursDrug2 <- NULL
# 2) adding the group variable
wide1$group <- 1

Renaming can be done by making a copy of hoursDrug1 to e.g. hoursDrug and then deleting the hoursDrug1 variable afterwards. Another solution is to use the command rename.variable() in the questionr package. We therefore need to install the package using the command install.packages('questionr') (see 4.7 for details on how to install packages) and then use

rename.variable( df, old, new )

with df being the data frame, old being the old name of the variable, new being the new name (the names need to be specified using quotation).

# 3) Renaming the variables
library(questionr)
wide1 <- rename.variable( wide1, old = "hoursDrug1", new = "hoursDrug" )
head( wide1 )
  ID hoursDrug group
1  1       0.7     1
2  2      -1.6     1
3  3      -0.2     1
4  4      -1.2     1
5  5      -0.1     1
6  6       3.4     1

We do the same for the hoursDrug2 variable

wide2 <- sleepWide
wide2$hoursDrug1 <- NULL
wide2$group <- 2
wide2 <- rename.variable( wide2, old = "hoursDrug2", new = "hoursDrug" )

The two data sets now contain identical variable names and we may place them on top of each other using rbind() (row bind):

# 4) 
sleepLong <- rbind( wide1, wide2 )
sleepLong
   ID hoursDrug group
1   1       0.7     1
2   2      -1.6     1
3   3      -0.2     1
4   4      -1.2     1
5   5      -0.1     1
6   6       3.4     1
7   7       3.7     1
8   8       0.8     1
9   9       0.0     1
10 10       2.0     1
11  1       1.9     2
12  2       0.8     2
13  3       1.1     2
14  4       0.1     2
15  5      -0.1     2
16  6       4.4     2
17  7       5.5     2
18  8       1.6     2
19  9       4.6     2
20 10       3.4     2

Voila! We now have a data set identical to the sleep data except that the original extra variable now has the name hoursDrug.

Exercise 1: Reorder the sleepLong data set such that the first variable is the ID variable, the second is the group and the third is hoursDrug.

Click here for a solution.

# we need the order corresponding 1st, 3rd and 2nd variable
sleepLong <- sleepLong[ , c(1,3,2)]
# or referring to the variable names
sleepLong <- sleepLong[ , c("ID","group","hoursDrug")]

Exercise 2: Reorder the sleepLong data such that the first two rows correspond to individual with ID=1, the third and fourth row to individual with ID=2 etc.

Click here for a solution.
We can sort the data according to the ID variable using order():

sleepLong <- sleepLong[ order(sleepLong$ID) , ]
head(sleepLong)
   ID group hoursDrug
1   1     1       0.7
11  1     2       1.9
2   2     1      -1.6
12  2     2       0.8
3   3     1      -0.2
13  3     2       1.1

Exercise 3: Reorder the sleepLong data such that the first row correspond to ID=1 and group=2, the second ID=1 and group=1, the third ID=2 and group=1 etc. I.e. a data frame sorted according to increasing ID and decreasing group within each individual. Consult statmethods.net to find out how to do that.

Click here for a solution.
We can sort the data according toincreasing ID and decreasing group within ID:

sleepLong <- sleepLong[ order(sleepLong$ID,-sleepLong$group) , ]
head(sleepLong)
   ID group hoursDrug
11  1     2       1.9
1   1     1       0.7
12  2     2       0.8
2   2     1      -1.6
13  3     2       1.1
3   3     1      -0.2



If you need to organize your own data from long to wide it might be worth having a closer look at the melt() function in the reshape2() package.

5.4 Defining new variables

We often need to derive new variables from the variables we have registered in a data set. In the quiz in section 3.4 we defined a new variable in the sleepData data set as we defined the increase in sleep in minutes from the original variable extra containing the increase in sleep in hours. In this chapter we show how to define various kinds of new variables based on the variables in the SundBy data set.

5.4.1 Calculation of new variables

The data SundBy is loaded into a data frame named d (as described in Chapter 4):

d <- read.csv('http://publicifsv.sund.ku.dk/~sr/sundby.csv')

We can calculate new variables based on the variables in the data set. For example we often need to log-transform our variables when performing a regression analysis. We will then need a new variable in the data set being the log of the variable of interest, for example the log of weight:

d$logwgt <- log( d$wgt )

Note that we may use $-notation to add a new variable to a data set as well as we use $-notation to refer to an existing variable.


In the video below (5:28 min)we show how to add a BMI-variable to the SundBy data set. The BMI is determined from weight and height using the formula \[\begin{eqnarray*} {\rm BMI} &=& \frac{\rm weight\ in\ kilo}{\rm (height\ in\ meters)^2} \end{eqnarray*}\]


In the video we defined the bmi variable in two different ways:

# Defining height in meters before calculating BMI
d$htms <- d$ht / 100
d$bmi <- d$wgt / d$htms^2

# Calculating height in meters inside the calculation - be careful with ()!
d$bmi <- d$wgt / (d$ht/100)^2

An important point, when defining new variables, is that we should always check the values of the new variables to make sure that we did the calculations right. In the video we carefully used the head() command every time we modified the code.

5.4.2 Grouped variables from numeric variables

Sometimes we wish to break a quantitative (numeric or integer) variable into groups. A group variable is also termed a categorical variable or a factor variable. For this purpose we may use the cut() function.

In the video below (5:23 min) we demonstrate how to make a grouped version of the BMI-variable with values “Underweight” / “Normal” / “Overweight” / “Obese” corresponding to the groups <= 20, 20-25, 25-30, > 30:


Click here to find the code produced in the video

d <- read.csv("http://publicifsv.sund.ku.dk/~sr/intro/datasets/sundby0_English.csv")
d$bmi <- d$wgt / (d$ht/100)^2
head(d)

summary( d$bmi )

d$bmiGrp <- cut( d$bmi, breaks=c(16,20,25,30,36),
                 labels=c('Underweight','Normal','Overweight','Obese'))
summary( d$bmiGrp )

Contents of the video:

The cut() function is used to split a numerical variable into a grouped variable based on its values. The function takes the following arguments:

cut(x, breaks, labels, right)

Here x is the numerical vector to be divided into groups. The argument breaks is a vector specifying the values defining the groups. Arguments labels and right are optional.

Having specified breaks=c(16,20,25,30,36) as in the video, the groups will be defined as (16,20], (20-25] etc. The open paranthesis ( is to be understood such that the value is not included in the interval, the bracket paranthesis such that the value is included in the interval. E.g. the second group (20-25] will contain values from > 20 (but not =20!) to 25 (25 included). To define the groups the other way around, i.e. including the lower value in each interval (e.g. [20,25)), use additional argument right=FALSE.

Warning: Always make sure that the minimum (maximum) value specified in the breaks are smaller (larger) than the smallest (largest) observed value - otherwise missing values will occur. Use e.g. a summary() of your variable to determine these values.

The label argument can be used to add informative names to the groups, e.g. labels=c('Underweight','Normal','Overweight','Obese'). There have to be as many names as groups and the names have to be specified using quotation around each name.

Quiz

How many people are normal weight in the SundBy data set?

Test your result and find multiple solutions here.

5.4.3 Binary variables

The ifelse() function can be used to create new variables with 2 levels. It has the following form:

ifelse(test, yes, no)

Here test is a logical statement that produces a TRUE/FALSE vector, if the statement is TRUE, the custom value for yes is returned and if the statement is FALSE, the custom value for no is returned. The yes and no arguments can be any type of element (numeric, integer, character), but they must be of the same type.

Say we wish to make an indicator that returns 1 if the patient is normal weight (BMI above 20 and below 25) and 0 otherwise. This can be done using the ifelse function:

d$normalBMI <- ifelse( (20 < d$bmi & d$bmi <= 25), 1, 0 ) 

Using the head() function, we find that the indicator seems to return the correct result:

head( d, 6 )
    id gender age  wgt  ht      bmi      bmiGrp normalBMI
1 1434      2  31 70.0 172 23.66144      Normal         1
2  978      1  28 68.0 170 23.52941      Normal         1
3  701      2  55   NA 165       NA        <NA>        NA
4  134      1  26 76.0 180 23.45679      Normal         1
5  292      2  21 60.0 175 19.59184 Underweight         0
6 1481      2  27 61.5 178 19.41043 Underweight         0

5.4.4 Combining groups

Sometimes we wish to group character vectors with a large number of categories into fewer categories. To do that we use a manual approach.

5.4.4.1 Combining two groups

There are only very few underweight people in the SundBy data and we may therefore wish to combine the underweight and normal weight people. We do that by making a copy of the bmiGrp variable named bmi3Grp and next assigning the underweight people the value Normal:

d$bmi3Grp <- d$bmiGrp
d$bmi3Grp[ d$bmiGrp == 'Underweight' ] <- 'Normal'

We should always perform a check of our new variable. We can do that using summary():

> summary( d$bmiGrp )
Underweight      Normal  Overweight       Obese        NA's 
         18         128          35          12           7 
> summary( d$bmi3Grp )
Underweight      Normal  Overweight       Obese        NA's 
          0         146          35          12           7 

Note that the bmi3Grp variable now as expected has 18+128=146 people registered as normal weight.

5.4.4.2 Combine several groups

Suppose we want a variable indicating whether a person is overweight or not. Here we may use the ifelse() function described above or we may use a manual approach:

d$obese[ d$bmiGrp == 'Obese' ] <- 'Obese'
d$obese[ d$bmiGrp %in% c('Underweight','Normal','Overweight') ] <- 'Not Obese'

We use %in% and not == when we wish to match with more than one element.

Note that we cannot follow the procedure used in Section 5.4.4.1 as we need a new level Not obese for our factor variable

> d$obese <- d$bmiGrp
> d$obese[ d$bmiGrp %in% c('Underweight','Normal weight','Overweight') ] <- 'Not obese'
Warning in `[<-.factor`(`*tmp*`, d$bmiGrp %in% c("Underweight", "Normal
weight", : invalid factor level, NA generated

We might also define the new variable in several steps instead of using %in%:

d$obese[ d$bmiGrp == 'Obese' ] <- 'Obese'
d$obese[ d$bmiGrp == 'Underweight' ] <- 'Not Obese'
d$obese[ d$bmiGrp == 'Normal weight' ] <- 'Not Obese'
d$obese[ d$bmiGrp == 'Overweight' ] <- 'Not Obese'

5.4.5 Renaming variables

There are several ways to rename a variable in a data frame. To show how to rename variables we will work on a copy of the SundBy dataset named d2 as the code in the following chapters will not work with the new variable names.

d2 <- read.csv('http://publicifsv.sund.ku.dk/~sr/sundby.csv')

The names() command gives the variable names for a data frame:

> names(d2)
[1] "id"     "gender" "age"    "wgt"    "ht"    

We can change the name of a variable by changing the names of the data frame. We may specify for which column to change the name, e.g.

> names(d2)[4]
[1] "wgt"
> names(d2)[4] <- 'weight'
> names(d2)
[1] "id"     "gender" "age"    "weight" "ht"    

We can also change the name of the variable using indexing (picking the element of the names for which the names of d2 equals 'ht')

> names(d2)[ names(d2)=='ht' ]
[1] "ht"
> names(d2)[ names(d2)=='ht' ] <- 'height'
> names(d2)
[1] "id"     "gender" "age"    "weight" "height"

An alternative solution is to make a copy of the variable and next to delete the original variable:

> d2$idno <- d2$id
> d2$id <- NULL
> head( d2, 3 )
  gender age weight height idno
1      2  31     70    172 1434
2      1  28     68    170  978
3      2  55     NA    165  701

Note that new variables are always added to the right in the data frame. If we want our new id-variable as the first variable in the dataset, we may specify that we want column number 5 to be the first, followed by column 1 to 4:

> d2 <- d2[ ,c(5,1:4)]
> head( d2, 3 )
  idno gender age weight height
1 1434      2  31     70    172
2  978      1  28     68    170
3  701      2  55     NA    165

5.5 Quiz

Before you start the quiz it is important that you understood how to use the functions in this chapter. We also expect that you have knowledge about handling missing values and some of the functions from other chapters. If you do not - read the content summaries and maybe the previous chapter + the chapther about vectors and data frames. We have prepared 4 questions to this chapter.


LAV IKKE DISSE QUIZZER ENDNU, DE ER IKKE HELT TILPASSET KAPITLET

5.5.1 Quiz question 1

Use the sleepMerge data set created in this chapter.

By help of the subset function, create a new data frame including only patients that belonged to Viborg or Odense hospital.

What logical statement should you use in your subset function?

Start the quiz here.

5.5.2 Quiz question 2

Use the sleepMerge data set created in this chapter.

Use the cut function to assign a grouped variable to the sleepMerge data set, call the new variable bmiGrp. The groups of BMI is defined as follows: Underweight: BMI is less than 18.5, Normal Weight: BMI is 18.5 to 25, Overweight: BMI is more than 25 to 30 and Obese: BMI is more than 30 to …

What vector is appropriate for the breaks argument?

Start the quiz here.

5.5.3 Quiz question 3

Use the sleepData or sleepMerge data set.

Create a new variable using the ifelse() function that indicates whether or not sleep increased.

How many times were sleep increased?

Start the quiz here.

5.5.4 Quiz question 4

The natural logarithm (log-function in R) returns NaN if you use it on a negative value. Try to add a new variable extraLog to the sleepData data set. The variable should be equal to the logarithm of the extra value if greater than 0 and NA otherwise.

What is the mean of the new column?

Start the quiz here.

bookdown::render_book(“index.Rmd”, “bookdown::gitbook”)