6 Descriptive analysis
The estimated amount of time to complete this chapter is 1-3 hours.
This chapter introduces some of the most common commands used for descriptive analysis. We show how to determine various descriptive statistics and how to calculate the confidence interval for the mean. The commands introduced in this chapter is used all the time.
Depending on your familiarity with R, we expect you to use between 1 and 4 hours on this chapter. If you are a bit experienced with R you might have fun playing around with your own functions as described in Chapter 6.5, if you are inexperienced and feel that the learning curve is steep, you may skip that chapter.
6.1 Tables
Tables are used to summarize the distribution of categorical variables. Below we illustrate how to make one-way and two-way tables using the table()
command.
6.1.1 One-way tables
In the video below we determine the number of males and females in the SundBy data set (how to load the data is described in Chapter 4). We also show how to determine proportions from a one-way table.
Click here to find the code produced in the video
d <- read.csv("http://publicifsv.sund.ku.dk/~sr/intro/datasets/sundby0_English.csv")
head(d)
# One-way table to count the number of males (code 1) and females
table( d$gender )
# One-way table to count the number of tall people (height > 180cm)
table( d$ht > 180 )
summary( d$ht )
table( d$ht > 180, useNA = 'ifany' )
tbl <- table( d$ht > 180 )
tbl
addmargins( tbl )
prop.table( tbl )
round( prop.table( tbl ), 2)
Contents of the video:
To produce a one-way table, the table()
command is used with one variable as the argument:
> table( d$gender )
1 2
89 111
We note that we have 89 males and 111 females.
We may also make tables based on a logical condition, e.g. counting the number of tall people as
> table( d$ht > 180 )
FALSE TRUE
151 44
You can add the additional argument useNA="ifany"
to count the number of missing values:
> table( d$ht > 180, useNA='ifany' )
FALSE TRUE <NA>
151 44 5
We do not get much information using the table()
command. Having made a table we may use the table for further calculations such as determining the total and the percentages:
> tbl <- table( d$ht > 180 )
> addmargins( tbl )
FALSE TRUE Sum
151 44 195
> prop.table( tbl )
FALSE TRUE
0.774359 0.225641
Among the 195 participants for which the height is registered, 23% are taller than 180 cm.
6.1.2 Two-way tables
The commands we use to work with two-way tables are the exact same as used for one-way tables, see the video below (5:25 min):
Click here to find the code produced in the video
d <- read.csv("http://publicifsv.sund.ku.dk/~sr/intro/datasets/sundby0_English.csv")
table( d$gender, d$ht > 180 )
table( d$gender, d$ht > 180, dnn=c('Gender','Height > 180cm') )
tbl <- table( d$gender, d$ht > 180, dnn=c('Gender','Height > 180cm') )
addmargins( tbl )
prop.table( tbl )
prop.table( tbl, margin=1 )
prop.table( tbl, margin=2 )
Contents of the video:
The table()
function can also be used with two variables separated by a comma to make a two-way table. The variable used as first argument will appear in the table rows, the variable used as second argument will appear in the columns:
> table( d$gender, d$ht > 180 )
FALSE TRUE
1 44 43
2 107 1
To ease readability of the table you can add names to the rows and columns using extra argument dnn
. As the table contains two variables, we need to give two names that are combined using c()
:
> table( d$gender, d$ht > 180, dnn=c('Gender', 'Height > 180cm') )
Height > 180cm
Gender FALSE TRUE
1 44 43
2 107 1
Be careful that the elements of dnn
have the same order as the variables used as input to table()
, hence the first variable has to correspond to the text given in the first element of dnn
. We find that 43 males and 1 female are taller than 180cm.
We may add the totals to the table using the addmargins()
command:
tbl <- table( d$gender, d$ht > 180, dnn=c('Gender', 'Height > 180cm') )
addmargins(tbl)
and define the total / row / column percentages using:
prop.table(tbl) # overall percentages
prop.table(tbl, margin=1) # row percentages
prop.table(tbl, margin=2) # column percentages
Quiz 6.1.1
In the SundBy data, how many participants weigh less than 60 kilos?
Solution
> table( d$wgt < 60 )
FALSE TRUE
164 32
i.e. 32 participants weigh less than 60 kilos.
Quiz 6.1.2
Determine the proportion of males resp. females that weigh less than 60 kilos.
Test your result here.
6.1.3 Two-way tables by hand
Some times we wish to analyze a table where the only data we have is the table itself. In this setting we do not have a data set with one line per individual to base the table upon.
Suppose we are given the following table (stolen from an online course in basic statistics, Concepts of Statistics). The page specifies: Researchers in the Physicians’ Health Study (1989) designed a randomized clinical trial to determine whether aspirin reduces the risk of heart attack. Researchers randomly assigned a large sample of healthy male physicians (22,071) to one of two groups. One group took a low dose of aspirin (325 mg every other day). The other group took a placebo. This was a double-blind experiment. The results of the study is given in the table below:
Heart attack | No heart attack | |
---|---|---|
Aspirin | 139 | 10898 |
Placebo | 239 | 10795 |
If we wish to analyze this table, we need to enter the numbers by hand into a table in R. This is done in two steps:
- The numbers in the table are entered into a vector (NB: if row and column numbers are given in the table you are considering, you should not enter these).
- The vector is turned into a table using the
matrix()
command, specifying the number of rows and columns.
> # 1. Enter the numbers in a vector:
> numbers <- c(139,239,10898,10795)
> # 2. Define a 2x2 table:
> tbl2 <- matrix( numbers, nrow=2, ncol=2 )
> tbl2
[,1] [,2]
[1,] 139 10898
[2,] 239 10795
We can not enter variable names to make the table more readable, but we may instead define row and column names of the table using additional argument dimnames
that takes as input a list
of row names and column names:
> tbl2 <- matrix( numbers, nrow=2, ncol=2,
+ dimnames=list( c('Aspirin','Placebo'), c('Heart attack','No heart attack')) )
> tbl2
Heart attack No heart attack
Aspirin 139 10898
Placebo 239 10795
If you want to flip rows and columns you may use extra argument byrow=T
(numbers will be entered into the table by row instead of by column). Be careful to also switch the row and column names:
> tbl2 <- matrix( numbers, nrow=2, ncol=2, byrow=T,
+ dimnames=list(c('Heart attack','No heart attack'),c('Aspirin','Placebo')))
> tbl2
Aspirin Placebo
Heart attack 139 239
No heart attack 10898 10795
Quiz 6.1.3
The table below, showing the distribution of diabetes (yes/no) on BMI groups, is taken from a study based on the Danish Nurse Cohort (all female).
Diabetes | No diabetes | Total | |
---|---|---|---|
Underweight (<18.5) | 8 | 393 | 401 |
Normal (18.5-25) | 349 | 13775 | 14124 |
Overweight (25-30) | 329 | 3991 | 4320 |
Obese (>30) | 187 | 841 | 1028 |
Total | 873 | 19000 | 19873 |
Enter the numbers into a table in R and determine 1) the percentage with diabetes for overweight women and 2) the percentage of obese women for women without diabetes.
Solution
The numbers are entered into a table (we carefully do not enter the row and column totals):
numbers <- c( 8,393, 349,13775, 329,3991, 187,841 )
tbl3 <- matrix( numbers, nrow=4, ncol=2, byrow=T,
dimnames = list( c('Underweight','Normal','Overweight','Obese'),
c('Diabetes','No diabetes')))
tbl3
Diabetes No diabetes
Underweight 8 393
Normal 349 13775
Overweight 329 3991
Obese 187 841
The percentage with diabetes for overweight women is found to be 7.6% using row percentages:
> prop.table( tbl3, margin=1)
Diabetes No diabetes
Underweight 0.01995012 0.9800499
Normal 0.02470971 0.9752903
Overweight 0.07615741 0.9238426
Obese 0.18190661 0.8180934
The percentage of obese women among non-diabetic women is found to be 4.4% using column percentages:
> prop.table( tbl3, margin=2)
Diabetes No diabetes
Underweight 0.009163803 0.02068421
Normal 0.399770905 0.72500000
Overweight 0.376861397 0.21005263
Obese 0.214203895 0.04426316
We may ask R to help us round the numbers:
> round( 100*prop.table( tbl3, margin=2), digits=1 )
Diabetes No diabetes
Underweight 0.9 2.1
Normal 40.0 72.5
Overweight 37.7 21.0
Obese 21.4 4.4
6.1.4 The Chi-Square Test
Based on the data in the table in Quiz 6.3.1 - do you think there is an association between BMI and diabetes? Maybe you remember that we can assess associations in contingency tables using the Chi-Square Test (more on this when the course starts).
The Chi-Square Test tests the hypothesis that there is no association between the two variables. To perform the test we use command chisq.test()
that takes as input a table:
> chisq.test( tbl3 )
Pearson's Chi-squared test
data: tbl3
X-squared = 702.53, df = 3, p-value < 2.2e-16
The p-value is < 2.2e-16 ( 2.2\(\times10^{-16}=0.00000000000000022\)), i.e. a very small number. The hypothesis of no association is therefore rejected. Ususally we don’t report p-values with more than four decimals and therefore summarize: There is an association between BMI and diabetes, p<0.0001.
Next step is to describe the association. More on that in your course.
6.2 Mean, standard deviation and confidence interval
We will now introduce functions to calculate mean and standard deviation of a quantitative variable. We further illustrate how to use these to calculate a 95% confidence interval for the mean.
6.2.1 Mean
To determine the mean of a variable x
we use the mean()
command that has the structure
mean( x, na.rm )
The extra argument na.rm
(Not Available ReMove) can be set to TRUE
or FALSE
(default), if FALSE
R will not remove those with missing values from the calculations in which case the mean is not defined. In that case we will have to specify na.rm=TRUE
to have R calculate the mean.
Using the SundBy data set we can determine the mean age as 41.1 years:
> mean( d$age )
[1] 41.08
Using the summary()
command we have previously (Chapter 4.4) seen the mean weight was 71 kilos. However when attempting to use the mean()
function to determine the mean we obtain:
> mean( d$wgt )
[1] NA
Because of the missing values in the weight variable wgt
R insists that the mean value is also missing (NA
). To force R to ignore the missing values we add set the extra argument na.rm
to TRUE
:
> mean( d$wgt, na.rm=TRUE )
[1] 70.93878
Note that we do not set quotation marks around the value TRUE
but it will also work with quotation. We could also have written a T
only: mean( d$wgt, na.rm=T )
6.2.2 Standard deviation
The structure of the function to determine the standard deviation (sd()
) is the exact same as for the mean
function:
sd( x, na.rm )
The standard deviation of the age resp. the weight variable
> sd( d$age )
[1] 18.29006
> sd( d$wgt, na.rm=TRUE)
[1] 12.18306
6.2.3 Confidence intervals
When the sample size (the number of observations \(n\)) is large we calculate the 95% confidence interval as
\[\begin{eqnarray*} {\rm mean} \pm 1.96 \times \frac{\rm standard \ deviation}{\sqrt{n}} \end{eqnarray*}\]The sample size \(n\) is the number of observations used for calculation of the mean and standard deviation. Warning: For “small” sample sizes we should not use the 1.96 in the calculation but rather a number a bit larger than 2, the exact number depending on the number of observations (found from a t-distribution). In the statistics course you will learn how to find that number, for now we will use 1.96 for practicing.
For the age
variable we found the mean and standard deviation above and we may therefore determine the upper limit of the 95% confidence interval as
> 41.08 + 1.96 * 18.29 / sqrt( 200 )
[1] 43.61486
However it is a bad habit to copy-paste numbers and instead we specify
> mean( d$age ) + 1.96 * sd( d$age ) / sqrt( 200 )
[1] 43.61487
> mean( d$age ) - 1.96 * sd( d$age ) / sqrt( 200 )
[1] 38.54513
i.e. from 38.5 to 43.6 years.
Do you remember the interpretation of the confidence interval? Otherwise it will be repeated during the course.
Quiz 6.2.1
Determine the 95% confidence interval for the mean weight (wgt
) in the SundBy data set. NB: Be careful when using the sample size \(n\) in the formula as we do not have the weight registered for all participants (\(n\)<200). Test your result here.
Quiz 6.2.2
It does not make much sense to calculate the mean weight for males and females combined as males and females differ in weight. Determine the mean weight including a 95% confidence interval for males and females separately.
Hint: Define a sub dataset containing the observations for males only (using the subset()
command described in Chapter 5.2, e.g. males <- subset(d, gender==1)
) and use this data set to determine mean and confidence interval for males. Perform the same maneuver for females.
Test your results here.
6.3 Common descriptive functions
Some of the most common commands used for making descriptive statistics is given in the table below:
Function | Explanation | Example |
range(x) |
Range (minimum and maximum) |
range( d$age ) |
median(x) |
Median |
median( d$age ) |
min( x ) |
Minimum |
min( d$age ) |
max(x) |
Maximum |
max( d$age ) |
quantile(x) |
Quantiles, defaults to 0/25/50/75/100%. To determine specific quantiles, e.g. 2.5%, use extra argument probs=0.025 .
|
quantile( d$age ) |
All of the above functions require extra argument na.rm=TRUE
in case the variable we are considering has missing values.
Quiz 6.3.1
Determine the median weight for males and females separately.
Solution
We use the median()
function on the subsets as in Quiz 6.2.2:
> females <- subset( d, gender==2)
> males <- subset(d, gender==1)
>
> median( females$wgt, na.rm=T )
[1] 65
> median( males$wgt, na.rm=T )
[1] 75.5
Note that we may also use the summary()
function, summary( females$wgt)
.
6.4 Performing calculations groupwise
Sometimes we have data sets where the data set can be grouped according to some criteria (e.g. gender or treatment group). We will often calculate summary statistics for each group separately as we did in Quiz 4.2 for the weight variable. The tapply()
function is very convenient for this purpose as it can be used to apply a function to subsets of a data set.
The structure of the tapply()
function is:
tapply( X, INDEX, FUN )
Here X
is the variable for which we will apply a function, INDEX
is the grouping variable and FUN
is the function we will apply.
To calculate the mean age for males and females separately we simply specify
tapply( d$age, d$gender, mean )
1 2
39.55056 42.30631
However if the variable in question has missing values
> tapply( d$wgt, d$gender, mean )
1 2
NA NA
tapply()
does not work but we can add the na.rm
argument to be used in the mean()
function:
> tapply( d$wgt, d$gender, mean, na.rm=T )
1 2
78.13372 65.31364
We may even perform the calculations split on several criteria, here gender and under / above 40 years by specifying the 2nd (INDEX
) argument as a “list” of the two grouping variables:
> tapply( d$wgt, list( d$gender, d$age < 40), mean, na.rm=T )
FALSE TRUE
1 79.50000 77.28302
2 65.65909 65.08333
I.e. for females under age 40, the average weight is 65.1 kilos.
Quiz 6.4.1
Determine the range (the difference between the largest and smallest value) of the weight for males and females separately. Make two solutions 1) using subsets as in the Quizzes 6.2.2/6.3.1 above) and 2) using tapply()
.
Test your results and find solutions here
6.5 Programming your own functions
Warning: This is a bit advanced - only read this chapter if you have become good friends with R, not are too overwhelmed by the coding and quite nerdy. Otherwise you can study this chapter later on when you have become more familiar with R (but you can definitely also live without).
Sometimes you might miss a function and fortunately it is not too difficult to define your own functions in R. The calculation of the confidence intervals above was a good exercise to practice playing around with numbers in R but the calculations would be easier if we could use a built-in confidence interval function. However such a function does not exist in R and we will therefore have to define it ourselves. Below we show how to define your own functions in a series of three videos.
The first video explains the basic principles for defining new functions. We define a function to calculate the mean that does not need the additional argument na.rm=T
in case the variable used for calculation contains missing values (4.11 min):
In the video, the function to calculate the mean was named mymean()
. It is important that we do not use names for our own functions that are identical to the built-in functions in R (otherwise some weird problems may arise …). If we had decided to name the new mean function mean()
it will not work (if you try, the mean()
command will not work and you will have to remove the mean
function from your environment (rm(mean)
) to make your program work again)!
The code produced in the video to define the mymean()
function is
mymean <- function( x ){
m = mean( x, na.rm=T )
return( m )
}
We can use our mymean()
function as mymean( x=d$wgt )
as we chose to name the argument x
in the function. Or simply
> mymean( d$wgt )
[1] 70.93878
In the next video we show how to define functions that determines more than one value. We define a function that returns the mean and the standard deviation (4.36 min):
The meansd()
function was defined as:
meansd <- function( x ){
m = mean( x, na.rm=T )
stddev = sd( x, na.rm=T )
outdata = data.frame( mean = m, sd = stddev )
return( outdata )
}
meansd( d$wgt )
mean sd
1 70.93878 12.18306
Returning a data frame (outdata
) instead of a number or vector is a simple way to add labels to the values thereby allowing for a more user friendly output.
Quiz 6.5.1
Define a command that takes as input a numeric variable (e.g. d$wgt
) and outputs the number of observations, minimum, maximum, median, mean and standard deviation.
Solution
mysummary <- function(x){
# it does not matter whether we use <- or =
n <- as.numeric( table( is.na(x) )[1] )
m = mean( x, na.rm=T )
mn = min( x, na.rm=T )
mx = max( x, na.rm=T )
md = median( x, na.rm=T)
stddev = sd( x, na.rm=T )
outdata = data.frame( n=n, min=mn, max=mx, median=md, mean=m, sd=stddev)
return( outdata )
}
mysummary( d$wgt )
In the last video we modify the code to produce a function ci()
for calculation of the confidence interval (7.50 min):
The ci()
function was defined as
ci <- function( x ){
n = as.numeric( table( is.na( x ) )[1] )
# print(n)
m = mean( x, na.rm=T )
stddev = sd( x, na.rm=T )
l = m - 1.96 * stddev / sqrt( n )
u = m + 1.96 * stddev / sqrt( n )
outdata = data.frame( n=n, mean = m, sd = stddev, lower=l, upper=u )
return( outdata )
}
ci( d$wgt )
n mean sd lower upper
1 196 70.93878 12.18306 69.23315 72.6444
We may even determine confidence intervals for several groups at the same time using tapply()
:
> tapply( d$wgt, d$gender, ci)
$`1`
n mean sd lower upper
1 86 78.13372 11.16387 75.77421 80.49323
$`2`
n mean sd lower upper
1 110 65.31364 9.775897 63.48673 67.14054
Voila!
Note that we did not need to add the na.rm
argument to tapply()
as above because our ci()
function handles the missing values. Disclaimer: tapply()
will not work with INDEX
set to a list of two grouping variables as tapply()
only works for functions that returns a single number (the ci()
returns a data set, i.e. several numbers).
Quiz 6.5.2
Define another version of theci()
command that uses the quantile of the approppriate t-distribution in the calculation of the confidence interval instead of the 1.96, i.e. using the formula
\[\begin{eqnarray*}
{\rm mean} \pm z_{0.975} \times \frac{\rm standard \ deviation}{\sqrt{n}}
\end{eqnarray*}\]
where \(z_{0.975}\) is determined as the 97.5th percentile of the t-distribution with \(n\)-1 degrees of freedom. The percentile can be determined using qt( 0.975, df)
(Quantile of T-distribution) with argument df
being the Degrees of Freedom (= \(n\)-1).
Solution
ci.correct <- function(x){
n <- as.numeric( table( is.na(x) )[1] )
m = mean( x, na.rm=T )
z = qt(0.975, n-1)
l = mean( x, na.rm=T ) - z * sd( x, na.rm=T ) / sqrt(n)
u = mean( x, na.rm=T ) + z * sd( x, na.rm=T ) / sqrt(n)
outdata = data.frame( n=n, mean=m, lower=l, upper=u )
return( outdata )
}
ci.correct( d$wgt )
This way of calculating the confidence interval is valid for small sample sizes also. More details will be given in the course.
Quiz 6.5.3
Define your own function named myrange
that determines the range as the difference between the maximum and minimum value (as you did in Quiz 6.4.1).
Solution
# We can define the function using the min() and max() commands ...
myrange <- function(x){
mn <- min( x, na.rm=T)
mx <- max( x, na.rm=T)
dif <- mx - mn
return( dif )
}
# ... or ...
myrange <- function(x){
dif <- max( x, na.rm=T) - min( x, na.rm=T)
return( dif )
}
# ... or using the range() command:
myrange <- function(x){
r <- range(x, na.rm=T)
dif <- r[2]-r[1]
return( dif )
}
# Range for females:
myrange( females$wgt )
# Range for both genders:
tapply( d$wgt, d$gender, myrange)