3 Data structures in R

The estimated amount of time to complete this chapter is 1-2.5 hours.

In this chapter you will be given an introduction to data formats and structures in R, in particular how data sets are organized and how we access the values in a data set. You probably have been working with data sets in Excel. Data sets are organized in a similar way in R, namely in rows and columns. You will read about data sets and watch two videos. Finally there is a quiz where you will be guided through some of the steps illustrated in the videos using a specific R data set.

Depending on your familiarity with R, we expect you to use between 45 min and 3 hours on this chapter.

3.1 Data types

Every object in R is of a certain type. This type is found using the typeof() command which returns the type of the object in the parenthesis. We here mention some of the most common data types

Numeric

All calculations we have done so far have been with numbers of the numeric class. These include integers, decimal numbers, constants like \(\pi\) or \(e\) and many more. To check if an object, for instance \(\pi\), is numeric, write

> is.numeric(pi)
[1] TRUE

Note that numeric is not a type, but several types of objects are in the numeric class. The type of \(\pi\) is the same as the type of 2

> typeof(pi)
[1] "double"
> typeof(2)
[1] "double"

The double type is the most common data type since practically all numbers are of this type.

Text values

Often times a variable will not be a number but a word or a combination of letters with a certain meaning. Objects like these are of the character type. We create a character type object and then check to see, that it actually is of the desired type.

> name <- "health_variable"
> typeof(name)
[1] "character"

Character names can include symbols such as _, - and & and it can also include numbers.

> name <- "health_variable_1"
> typeof(name)
[1] "character"

Note that this might sometimes lead to confusion whenever a number, for instance 2.5, is saved as a character string instead of as a number

> variable <- "2.5"
> typeof(variable)
[1] "character"

This is a common issue when working with real life data. If we would like the variable to be treated as a numeric object of the double type, we can change the type by writing

> variable <- "2.5"
> variable <- as.numeric(variable)
> typeof(variable)
[1] "double"

Logical values

A logical value is a value indicating whether something is TRUE or FALSE. To check if two things are equal in R, we have to use two equates signs, ==. For instance

> 7 + 11 == 18
[1] TRUE
> 7 + 11 == exp(5)
[1] FALSE

Luckily the output tells us that the first statement is true while the second is false. Two check if two things are not equal, write !=

> "variable" != "cariable"
[1] TRUE

Likewise we can use operators such as <= (smaller than or equal to), >= (greater than or equal to), along with < and >.

> 2 >= 3
[1] FALSE

It is worth noting that R can perform calculations with logical values as it stores TRUE values as 1 and FALSE values as 0.

> 3 + (5 == 5)
[1] 4

3.2 Vectors

If we have several objects of the same type, we can combine them in a vector by using the combine function, c(). The numbers 2, 5 and - 3.5 are stored in a vector called y by

> y <- c(2, 5, -3.5)

How to do calculations using vectors is illustrated in the video below (9:45 min).

Click here to find the code produced in the video

# Author: Anne 
# Description: Basic data structures in R

# vectors
x <- c(1,0,1,0,1,1,1,1,0,0)

?c # first introduction to a function - how to get help
help(c)

(x+2)^2 # impose a function on all entries in a vector

height <- c(1.65, 1.79,1.62,1.87) # store more meaning full vectors
weight <- c(55.2, 89.7, 49.8, 92.0)

bmi <- weight/height^2 # use the vectors to calculate new information
bmi2 <- weight/(height^2)

firstName <- c("Anne", "Anna", "Anders","Andreas") # store characters
mathMajor <- c(TRUE, FALSE, TRUE, FALSE) # store a logical function

typeof(firstName) # find the type
firstName
mathMajor

# indexes for vectors
FirstName
FirstName[3]
FirstName[c(1,2,3)]

FirstName[3] <- "Andre" # change an entry
FirstName

Contents of the video:

A vector is a collection of elements all of the same type. A coarse overview of the types includes:

  • Numeric (decimal numbers and integers)
    • E.g. 1, -1, 0, 3.98, 3.14 etc.
  • Logical (true or false indicators)
    • TRUE/FALSE and T/F
  • Character (names, levels etc.)
    • E.g. “Apple”, “Pear”, “A”, “B” etc.

Besides having a type, a vector also has a length (i.e. the number of elements). The length can be found by:

length(vector)

To extract elements of a vector, one can use indexing. An example of indexing is shown below, where the elements nr. 1 and nr. 2 to 5 is extracted respectively:

> x <- c(1,-1,0,4,5,9,57)
> x
[1] 1 -1 0 4 5 9 57
> x[1]
[1] 1
> x[2:5]
[1] -1 0 4 5

You can also extract based on conditions instead of position. Say that you want all of the values larger than 1:

> x[x>1]
[1] 4 5 9 57

x > 1 is a condition that gives a logical vector as output:

> x>1 
[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

This logical output is what selects the correct elements in the vector x:

> x[c(FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE)] 
[1]  4  5  9 57

3.3 Data frames

Data sets in R are saved in data frames. A data frame consists of equally lengthened vectors, organized in columns, thereby making it two-dimensional. A data frame thus consists of rows (one for each subject in our study) and columns (one for each variable/vector). In this video (12 min) you are introduced to data frames - how they can be defined and how to locate specific values within a data frame.

Click here to find the code produced in the video

# Author: Anne 
# Description: Basic data structures in R

# vectors
x <- c(1,0,1,0,1,1,1,1,0,0)

?c # first introduction to a function - how to get help
help(c)

(x+2)^2 # impose a function on all entries in a vector

height <- c(1.65, 1.79,1.62,1.87) # store more meaning full vectors
weight <- c(55.2, 89.7, 49.8, 92.0)

bmi <- weight/height^2 # use the vectors to calculate new information
bmi2 <- weight/(height^2)

firstName <- c("Anne", "Anna", "Anders","Andreas") # store characters
mathMajor <- c(TRUE, FALSE, TRUE, FALSE) # store a logical function

typeof(firstName) # find the type
firstName
mathMajor

# indexes for vectors
FirstName
FirstName[3]
FirstName[c(1,2,3)]

FirstName[3] <- "Andre" # change an entry
FirstName

# collecting data (data.frame)
roomies <- data.frame(FirstName,height,weight,mathMajor)
roomies$age <- c(54,25,46,76)

age 
roomies$age # collect the age variable from the roomies dataset

# indexes on a data frame
roomies
roomies[2]
roomies[,2] # second column
roomies[2,] # first row
roomies[2,"age"] 
roomies[2,5] 
roomies$age[2]

Contents of the video:

A data frame may consist of different types of data but every column must be of the same type. An example of a data frame named kids consisting of different types of data types is the following:

> kids <- data.frame(subject = c(1,2,3,4), gender = c("F","M","F","M"), age = c(7, 5, 9, 2))
> kids
  subject gender age
1       1      F   7
2       2      M   5
3       3      F   9
4       4      M   2

For each of the 4 kids, the data set contains the subject id, gender and age. To find the number of observations (subjects, corresponding to the number of rows) and the number of columns (variables, corresponding to the number of variables) we may use the command dim:

> dim(kids)
[1] 4 3

The data consists of four observations (kids) and there are three columns (variables) in total.

Extracting elements of a data frame can be done in multiple ways. To extract one particular element, the two following methods can be applied:

kids[i,j]
kids[i,"gender"]

i can be any number between 1 and the number of rows in the data frame (4 in our example). j can be any number between 1 and the number of columns in the data frame (3 in our example). j can also be a column/variable name in the data set. Referring specifically to the column names (variables) requires single (‘’) or double (" “) quotation around the variable name. The code above will not work as i and j are not defined. An example of how to use it to obtain the value of the 2nd row in the 2nd column is:

> kids[2,2]
[1] M
Levels: F M
> kids[2,"gender"]
[1] M
Levels: F M

Extracting entire rows and columns is also possible:

#extract entire 2nd row
kids[2,]
#extract entire 2nd column
kids[,2]
kids[,"gender"]
kids$gender

The $ notation, used for extracting the values of the gender column, uses that gender is a name of one of the columns in the kids data set.

To extract e.g. the first 3 rows of the data set one can use a sequence. The sequence 1:3 generates the following:

1:3
[1] 1 2 3

Using the sequence on the kids data set extracts the first 3 rows:

> kids[1:3,]
  subject gender age
1       1      F   7
2       2      M   5
3       3      F   9

The first 3 columns can be extracted in the same manner: kids[,1:3]. The :-sign creates a sequence and it can create sequences from any negative or positive number to any negative or positive number. Try to play around a but with positive and negative numbers!

The same way as with vectors, you can also extract observations based on a specific condition on the rows. If we wish to see the females only , we may use:

> kids[ kids$gender=="F",]
  subject gender age
1       1      F   7
3       3      F   9

3.4 Quiz

R has several built-in data sets. In this quiz we will consider the data set named sleep based on the paper by Cushny and Peebles (1905) The action of optical isomers: II hyoscines, comparing the effect of two soporific drugs on the number of hours of sleep in a group of 10 patients.

In the sleep study, each patient was studied over several nights given 1) Hyoscyamine, 2) Hyoscine or 3) no treatment. The average hours of sleep with each treatment were registered. We are interested in comparing each of the two treatments to no treatment as well as comparing the two active treatments.

The sleep data contains two measurements for each patient: The difference in average hours of sleep with Hyoscyamine compared to control and Hyoscine compared to control. The data has three variables: extra is the difference between hours of sleep on treatment and hours of sleep without treatment, group indicates the treatment (1=Hyoscyamine, 2=Hyoscine) and ID is the patient id number:

sleep
   extra group ID
1    0.7     1  1
2   -1.6     1  2
3   -0.2     1  3
4   -1.2     1  4
5   -0.1     1  5
6    3.4     1  6
7    3.7     1  7
8    0.8     1  8
9    0.0     1  9
10   2.0     1 10
11   1.9     2  1
12   0.8     2  2
13   1.1     2  3
14   0.1     2  4
15  -0.1     2  5
16   4.4     2  6
17   5.5     2  7
18   1.6     2  8
19   4.6     2  9
20   3.4     2 10

We note that the first 10 lines correspond to treatment group 1, the last 10 to group 2. Each patient contribute with two measurements, one for each treatment. Access to the data set is obtained by typing and running:

sleepData <- sleep

Note that the data set sleepData now appears in your Environment. You can view the data in RStudio just typing sleepData or View( sleepData ).

In the quizzes you will be introduced to some new commands that are very useful when working with data frames. There is a total of 4 quiz questions.

Quiz question 1

How many many records (observations) does the data contain? How many variables?

Start the quiz here. You might find the answer to this quiz obvious - do the quiz anyway to lean a few more commands you may use to find the answer.

Har du problemer med at tilgå denne quiz? Så indskriv dig i rummet påAbsalon via dette link: https://absalon.ku.dk/enroll/4DNDRY.

Quiz question 2

How many times were an increase in the average hours of sleep observed comparing treatment to placebo? (I.e. how many observations in the extra column have a value > 0?)

Start the quiz here.

Quiz question 3

Assign a new vector to your data set called extraMinutes which is the extra sleep calculated in minutes instead of hours.

Which of the following commands can be used to achieve this?

  1. sleep$extraMinutes <- sleep$extra*60
  2. sleepData$extraMinutes <- sleepData$extra/60
  3. sleepData[,"extraMinutes"] <- sleepData[,"extra"]*60
  4. sleepData[,"extraMinutes"] <- sleepData[,"extra"]/60

Start the quiz here.

Quiz question 4

Missing data values are represented by the value NA (Not Available).

What happens when running the code:

sleepData[3,"extra"] <- NA
sleepData

Start the quiz here.