Chapter 2 Basic data structures in R
Before we jump into actual data analysis, it is desirable to first think about what are common variable types and how they are stored in R.
A variable is any characteristics, number, or quantity that can be measured or counted. A variable may also be called a data item. Age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye colour and vehicle type are examples of variables. It is called a variable because the value may vary between data units in a population, and may change in value over time.(Australian Bureau of Statistics, ABS)
2.1 Types of variables: A taxonomy
2.1.1 Categorical variables: ordinal vs. norminal
Categorical variables have values that describe a ‘quality’ or ‘type’ or ‘category’.Therefore, categorical variables are qualitative variables and tend to be represented by a non-numeric value. Categorical variables should be exclusive (in one category or in another) and exhaustive (include all possible options).
Categorical variables may be further divided as being ordinal or nominal:
An ordinal variable can be logically ordered or ranked. The categories associated with ordinal variables can be ranked higher or lower than another, but do not necessarily establish a numeric difference between each category.In other words, the interval between levels of the variables are unknown. Examples of ordinal categorical variables include academic grades (i.e. A, B, C), clothing size (i.e. small, medium, large, extra large) and attitudes (i.e. strongly agree, agree, disagree, strongly disagree).
For example, when doing a survey, participants will be asked to rate. The subjective measurements of this kind are often ordinal variables. E.g. a Likert ranking scale; level of education (“< high school”, “high school”, “associate’s degree”).
We can assign numbers to different levels of an ordinal variable, but we should bear in mind that these variable are not numeric. For example, “strongly agree” and “neutral” cannot average out to an “agree”, even though you may assign 5 to “strong agree” and 3 to “neutral”.
A nominal variable is not able to be organised in a logical sequence. Examples of nominal categorical variables include sex, business type, eye colour, religion and brand.
The data collected for a categorical variable are qualitative data.
2.1.2 Numeric variables: discrete or continuous
Numeric variables have values that describe a measurable quantity as a number, like ‘how many’ or ‘how much’. Therefore, numeric variables are quantitative variables.(ABS) It is also called Interval/ratio variables and the interval between numbers is equal: the interval between 1 kg and 2 kg is the same as between 3 kg and 4 kg.
Numeric variables may be further divided as being either continuous or discrete:
A discrete variable consists of counts from a set of distinct whole values. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of of which measured as whole units (i.e. 1, 2, 3 cars).(ABS)
A continuous variable can take any value between a certain set of real numbers. The value given to an observation for a continuous variable can include values as small as the instrument of measurement allows. Examples of continuous variables include height, time, age, and temperature.(ABS)
The data collected for a numeric variable are quantitative data.
The variable type will determine (1) statistical analysis; (2) the way we summarize data with statistics and plots.Variables can be stored in R in different data types.
Normial and ordinal variables can be stored as character or factors (with levels).
Interval data are stored as numbers either as integer or numeric (real or decimal).
If you have only one variable, you can store it in a vector. However, more often than not, you have a bunch of variables that need to be stored or imported as a matrix or data frame.
2.2 1D data structure: vectors
A vector is a sequence of data elements of the same basic type: integer, double, logical or character. All elements of a vector must be the same type.
2.2.1 Creating vectors
a = 8:17
b <- c(9, 10, 100, 38)
c = c (TRUE, FALSE, TRUE, FALSE)
c = c (T, F, T, F)
d = c ("TRUE", "FALSE", "FALSE")
# You can change the type of a vector with as.vector function.
as.vector(b, mode = "character")
## [1] "9" "10" "100" "38"
# When you put elements of different types in one vector, R will automatically change the type of some elements to keep the whole vector homogenous.
e = c(9,10, "ab", "cd")
f = c(10, 11, T, F)
c () is a function in R.
There are some other basic functions in R that you can play with to generate vectors.
A = 9:20 + 1
B = seq (1, 10)
C = seq (1, 20, by= 2)
D = rep (5, 4)
E = rep (c(1,2,3), 4)
G = rep (c(1,2,3), each = 4)
# Now that you have a vector, you can do some Maths.
max(a)
## [1] 17
min(a)
## [1] 8
range(a)
## [1] 8 17
sum(a)
## [1] 125
mean(a)
## [1] 12.5
median(a)
## [1] 12.5
quantile(a)
## 0% 25% 50% 75% 100%
## 8.00 10.25 12.50 14.75 17.00
sd(a)
## [1] 3.02765
round(sd(a), 2)
## [1] 3.03
2.2.2 Creating list objects
We can put vectors of different types (e.g., number, logic or character) and lengths in a list object.
list1 = list(a, b, c, d, e, f)
list1
## [[1]]
## [1] 8 9 10 11 12 13 14 15 16 17
##
## [[2]]
## [1] 9 10 100 38
##
## [[3]]
## [1] TRUE FALSE TRUE FALSE
##
## [[4]]
## [1] "TRUE" "FALSE" "FALSE"
##
## [[5]]
## [1] "9" "10" "ab" "cd"
##
## [[6]]
## [1] 10 11 1 0
# More often than not, we do not make list ourselves but have to deal with lists when we get outputs from stats models.
2.3 2D data structures: matrice and data frames
Most of us have had some experience with the Excel spreadsheet. Data in a spreadsheet are arranged by rows and columns in a rectangular space. This is a typical 2 dimensional data structure. In R, we can have two ways of forming tabular data like a spreadsheet: the matrix and dataframe.
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout in which all the elements must be of the same type (e.g., numeric or character).
Dataframe is similar to matrix in shape but only differs in that different types of data (e.g. numeric, factor, character) can co-exist in different columns. Thus, in data analysis, we use dataframes more often than matrix.
# Let's generate a dataframe from scratch.
id = seq(1, 40)
gender = rep(c("male", "female"), 5)
maths = rnorm(40, mean = 70, sd = 5)
english = rnorm(40, mean = 80, sd = 9)
music = rnorm(40, mean = 75, sd = 10)
pe = rnorm(40, mean = 86, sd = 12)
df1 = data.frame (id, gender, maths, english)
Now let’s explore the data frame we just created.
str(df1)
## 'data.frame': 40 obs. of 4 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 1 2 1 2 1 2 1 ...
## $ maths : num 70 71.5 71.7 67.3 71 ...
## $ english: num 93 55.1 62.5 84.8 73.6 ...
summary(df1)
## id gender maths english
## Min. : 1.00 female:20 Min. :60.57 Min. :55.13
## 1st Qu.:10.75 male :20 1st Qu.:66.43 1st Qu.:71.86
## Median :20.50 Median :70.12 Median :79.29
## Mean :20.50 Mean :69.73 Mean :78.49
## 3rd Qu.:30.25 3rd Qu.:71.68 3rd Qu.:85.50
## Max. :40.00 Max. :83.00 Max. :97.55
nrow(df1)
## [1] 40
ncol(df1)
## [1] 4
attributes(df1)
## $names
## [1] "id" "gender" "maths" "english"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2.3.1 what if I want to change column names or add variable to the df?
df2 = data.frame (id = id, gender = gender, maths = maths, english = english)
df2 = cbind(df2, pe)
colnames(df2) = c("ID", "SEX","MATHS","ENGLISH","PE")
head(df2)
## ID SEX MATHS ENGLISH PE
## 1 1 male 69.95176 93.00191 97.15277
## 2 2 female 71.54677 55.12660 75.79006
## 3 3 male 71.67045 62.54689 81.67235
## 4 4 female 67.25445 84.84459 103.52895
## 5 5 male 70.97599 73.64291 99.84611
## 6 6 female 68.06792 86.20590 86.53996
2.3.2 Subsetting dataframes
We all know how to select part of an Excel spreadsheet by clicking and moving our mouse. In R, when we want to select part of a dataframe, we use this formula, dataframe[row, column].
There are various ways we can use this formula and believe it or not, you will love them!
# the complete dataset
df2
## ID SEX MATHS ENGLISH PE
## 1 1 male 69.95176 93.00191 97.15277
## 2 2 female 71.54677 55.12660 75.79006
## 3 3 male 71.67045 62.54689 81.67235
## 4 4 female 67.25445 84.84459 103.52895
## 5 5 male 70.97599 73.64291 99.84611
## 6 6 female 68.06792 86.20590 86.53996
## 7 7 male 65.51432 68.75287 90.03359
## 8 8 female 71.10533 84.33375 107.47870
## 9 9 male 69.06481 87.63462 80.22279
## 10 10 female 70.33877 70.27196 97.11310
## 11 11 male 71.27370 70.79533 100.30879
## 12 12 female 63.53398 80.83623 51.17344
## 13 13 male 67.55812 79.26045 82.70119
## 14 14 female 68.51564 87.19538 91.07716
## 15 15 male 60.56896 88.67801 79.43153
## 16 16 female 61.85224 67.14054 101.25958
## 17 17 male 72.26608 90.73062 83.97038
## 18 18 female 74.21551 77.13889 87.51114
## 19 19 male 71.16397 75.92932 83.36162
## 20 20 female 77.85834 82.52012 93.73073
## 21 21 male 77.71301 72.21872 75.34537
## 22 22 female 73.28516 82.30036 84.19643
## 23 23 male 71.16284 63.91421 87.02147
## 24 24 female 66.16200 87.26578 111.01689
## 25 25 male 71.69465 68.37152 88.13204
## 26 26 female 64.79651 74.97961 92.63813
## 27 27 male 66.07760 82.67037 78.19265
## 28 28 female 68.29584 90.07613 78.00366
## 29 29 male 70.65870 63.67181 86.14422
## 30 30 female 70.29251 81.79678 75.04130
## 31 31 male 69.75765 73.11760 86.93607
## 32 32 female 64.55882 76.38809 106.71705
## 33 33 male 74.25832 85.26931 74.50730
## 34 34 female 66.52137 79.95632 95.12957
## 35 35 male 66.15673 73.23804 94.83521
## 36 36 female 62.95305 79.02947 85.48000
## 37 37 male 73.84790 97.54913 111.18145
## 38 38 female 69.65016 67.65983 79.77146
## 39 39 male 73.93870 94.12206 104.65158
## 40 40 female 83.00292 79.31134 76.59283
df2[2:5, ] # from row 2 to row 5
## ID SEX MATHS ENGLISH PE
## 2 2 female 71.54677 55.12660 75.79006
## 3 3 male 71.67045 62.54689 81.67235
## 4 4 female 67.25445 84.84459 103.52895
## 5 5 male 70.97599 73.64291 99.84611
df2[ , 1:2] # select column 1 to 2
## ID SEX
## 1 1 male
## 2 2 female
## 3 3 male
## 4 4 female
## 5 5 male
## 6 6 female
## 7 7 male
## 8 8 female
## 9 9 male
## 10 10 female
## 11 11 male
## 12 12 female
## 13 13 male
## 14 14 female
## 15 15 male
## 16 16 female
## 17 17 male
## 18 18 female
## 19 19 male
## 20 20 female
## 21 21 male
## 22 22 female
## 23 23 male
## 24 24 female
## 25 25 male
## 26 26 female
## 27 27 male
## 28 28 female
## 29 29 male
## 30 30 female
## 31 31 male
## 32 32 female
## 33 33 male
## 34 34 female
## 35 35 male
## 36 36 female
## 37 37 male
## 38 38 female
## 39 39 male
## 40 40 female
df2[ , c("ENGLISH", "PE")] # select by column names
## ENGLISH PE
## 1 93.00191 97.15277
## 2 55.12660 75.79006
## 3 62.54689 81.67235
## 4 84.84459 103.52895
## 5 73.64291 99.84611
## 6 86.20590 86.53996
## 7 68.75287 90.03359
## 8 84.33375 107.47870
## 9 87.63462 80.22279
## 10 70.27196 97.11310
## 11 70.79533 100.30879
## 12 80.83623 51.17344
## 13 79.26045 82.70119
## 14 87.19538 91.07716
## 15 88.67801 79.43153
## 16 67.14054 101.25958
## 17 90.73062 83.97038
## 18 77.13889 87.51114
## 19 75.92932 83.36162
## 20 82.52012 93.73073
## 21 72.21872 75.34537
## 22 82.30036 84.19643
## 23 63.91421 87.02147
## 24 87.26578 111.01689
## 25 68.37152 88.13204
## 26 74.97961 92.63813
## 27 82.67037 78.19265
## 28 90.07613 78.00366
## 29 63.67181 86.14422
## 30 81.79678 75.04130
## 31 73.11760 86.93607
## 32 76.38809 106.71705
## 33 85.26931 74.50730
## 34 79.95632 95.12957
## 35 73.23804 94.83521
## 36 79.02947 85.48000
## 37 97.54913 111.18145
## 38 67.65983 79.77146
## 39 94.12206 104.65158
## 40 79.31134 76.59283
df2[c(1,2,3), ] #select the first three rows
## ID SEX MATHS ENGLISH PE
## 1 1 male 69.95176 93.00191 97.15277
## 2 2 female 71.54677 55.12660 75.79006
## 3 3 male 71.67045 62.54689 81.67235
df2[seq(1, 40, 2), ] #select every other rows from 1 to 40 rows
## ID SEX MATHS ENGLISH PE
## 1 1 male 69.95176 93.00191 97.15277
## 3 3 male 71.67045 62.54689 81.67235
## 5 5 male 70.97599 73.64291 99.84611
## 7 7 male 65.51432 68.75287 90.03359
## 9 9 male 69.06481 87.63462 80.22279
## 11 11 male 71.27370 70.79533 100.30879
## 13 13 male 67.55812 79.26045 82.70119
## 15 15 male 60.56896 88.67801 79.43153
## 17 17 male 72.26608 90.73062 83.97038
## 19 19 male 71.16397 75.92932 83.36162
## 21 21 male 77.71301 72.21872 75.34537
## 23 23 male 71.16284 63.91421 87.02147
## 25 25 male 71.69465 68.37152 88.13204
## 27 27 male 66.07760 82.67037 78.19265
## 29 29 male 70.65870 63.67181 86.14422
## 31 31 male 69.75765 73.11760 86.93607
## 33 33 male 74.25832 85.26931 74.50730
## 35 35 male 66.15673 73.23804 94.83521
## 37 37 male 73.84790 97.54913 111.18145
## 39 39 male 73.93870 94.12206 104.65158
2.4 summary
Dimensions | Homogenous | Heterogeneous |
---|---|---|
1D | Atomic Vector | List |
2D | Matrix | Data frame |
nD | Array |