Chapter 2 Basic data structures in R
Before we jump into actual data analysis, it is desirable to first think about what are common variable types and how they are stored in R.
A variable is any characteristics, number, or quantity that can be measured or counted. A variable may also be called a data item. Age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye colour and vehicle type are examples of variables. It is called a variable because the value may vary between data units in a population, and may change in value over time.(Australian Bureau of Statistics, ABS)
2.1 Types of variables: A taxonomy
2.1.1 Categorical variables: ordinal vs. norminal
Categorical variables have values that describe a ‘quality’ or ‘type’ or ‘category’.Therefore, categorical variables are qualitative variables and tend to be represented by a non-numeric value. Categorical variables should be exclusive (in one category or in another) and exhaustive (include all possible options).
Categorical variables may be further divided as being ordinal or nominal:
An ordinal variable can be logically ordered or ranked. The categories associated with ordinal variables can be ranked higher or lower than another, but do not necessarily establish a numeric difference between each category.In other words, the interval between levels of the variables are unknown. Examples of ordinal categorical variables include academic grades (i.e. A, B, C), clothing size (i.e. small, medium, large, extra large) and attitudes (i.e. strongly agree, agree, disagree, strongly disagree).
For example, when doing a survey, participants will be asked to rate. The subjective measurements of this kind are often ordinal variables. E.g. a Likert ranking scale; level of education (“< high school”, “high school”, “associate’s degree”).
We can assign numbers to different levels of an ordinal variable, but we should bear in mind that these variable are not numeric. For example, “strongly agree” and “neutral” cannot average out to an “agree”, even though you may assign 5 to “strong agree” and 3 to “neutral”.
A nominal variable is not able to be organised in a logical sequence. Examples of nominal categorical variables include sex, business type, eye colour, religion and brand.
The data collected for a categorical variable are qualitative data.
2.1.2 Numeric variables: discrete or continuous
Numeric variables have values that describe a measurable quantity as a number, like ‘how many’ or ‘how much’. Therefore, numeric variables are quantitative variables.(ABS) It is also called Interval/ratio variables and the interval between numbers is equal: the interval between 1 kg and 2 kg is the same as between 3 kg and 4 kg.
Numeric variables may be further divided as being either continuous or discrete:
A discrete variable consists of counts from a set of distinct whole values. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of of which measured as whole units (i.e. 1, 2, 3 cars).(ABS)
A continuous variable can take any value between a certain set of real numbers. The value given to an observation for a continuous variable can include values as small as the instrument of measurement allows. Examples of continuous variables include height, time, age, and temperature.(ABS)
The data collected for a numeric variable are quantitative data.
The variable type will determine (1) statistical analysis; (2) the way we summarize data with statistics and plots.Variables can be stored in R in different data types.
Normial and ordinal variables can be stored as character or factors (with levels).
Interval data are stored as numbers either as integer or numeric (real or decimal).
If you have only one variable, you can store it in a vector. However, more often than not, you have a bunch of variables that need to be stored or imported as a matrix or data frame.
2.2 1D data structure: vectors
A vector is a sequence of data elements of the same basic type: integer, double, logical or character. All elements of a vector must be the same type.
2.2.1 Creating vectors
a = 8:17
b <- c(9, 10, 100, 38)
c = c (TRUE, FALSE, TRUE, FALSE)
c = c (T, F, T, F)
d = c ("TRUE", "FALSE", "FALSE")
# You can change the type of a vector with as.vector function.
as.vector(b, mode = "character")
## [1] "9" "10" "100" "38"
# When you put elements of different types in one vector, R will automatically change the type of some elements to keep the whole vector homogenous.
e = c(9,10, "ab", "cd")
f = c(10, 11, T, F)
c () is a function in R.
There are some other basic functions in R that you can play with to generate vectors.
A = 9:20 + 1
B = seq (1, 10)
C = seq (1, 20, by= 2)
D = rep (5, 4)
E = rep (c(1,2,3), 4)
G = rep (c(1,2,3), each = 4)
# Now that you have a vector, you can do some Maths.
max(a)
## [1] 17
min(a)
## [1] 8
range(a)
## [1] 8 17
sum(a)
## [1] 125
mean(a)
## [1] 12.5
median(a)
## [1] 12.5
quantile(a)
## 0% 25% 50% 75% 100%
## 8.00 10.25 12.50 14.75 17.00
sd(a)
## [1] 3.02765
round(sd(a), 2)
## [1] 3.03
2.2.2 Creating list objects
We can put vectors of different types (e.g., number, logic or character) and lengths in a list object.
list1 = list(a, b, c, d, e, f)
list1
## [[1]]
## [1] 8 9 10 11 12 13 14 15 16 17
##
## [[2]]
## [1] 9 10 100 38
##
## [[3]]
## [1] TRUE FALSE TRUE FALSE
##
## [[4]]
## [1] "TRUE" "FALSE" "FALSE"
##
## [[5]]
## [1] "9" "10" "ab" "cd"
##
## [[6]]
## [1] 10 11 1 0
# More often than not, we do not make list ourselves but have to deal with lists when we get outputs from stats models.
2.3 2D data structures: matrice and data frames
Most of us have had some experience with the Excel spreadsheet. Data in a spreadsheet are arranged by rows and columns in a rectangular space. This is a typical 2 dimensional data structure. In R, we can have two ways of forming tabular data like a spreadsheet: the matrix and dataframe.
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout in which all the elements must be of the same type (e.g., numeric or character).
Dataframe is similar to matrix in shape but only differs in that different types of data (e.g. numeric, factor, character) can co-exist in different columns. Thus, in data analysis, we use dataframes more often than matrix.
# Let's generate a dataframe from scratch.
id = seq(1, 40)
gender = rep(c("male", "female"), 5)
maths = rnorm(40, mean = 70, sd = 5)
english = rnorm(40, mean = 80, sd = 9)
music = rnorm(40, mean = 75, sd = 10)
pe = rnorm(40, mean = 86, sd = 12)
df1 = data.frame (id, gender, maths, english)
Now let’s explore the data frame we just created.
str(df1)
## 'data.frame': 40 obs. of 4 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 1 2 1 2 1 2 1 ...
## $ maths : num 75 66.8 66.2 66.8 69.7 ...
## $ english: num 70 69.7 74 90.9 88.2 ...
summary(df1)
## id gender maths english
## Min. : 1.00 female:20 Min. :56.35 Min. :53.08
## 1st Qu.:10.75 male :20 1st Qu.:68.46 1st Qu.:71.11
## Median :20.50 Median :69.89 Median :77.31
## Mean :20.50 Mean :70.41 Mean :77.20
## 3rd Qu.:30.25 3rd Qu.:73.45 3rd Qu.:82.20
## Max. :40.00 Max. :78.41 Max. :92.77
nrow(df1)
## [1] 40
ncol(df1)
## [1] 4
attributes(df1)
## $names
## [1] "id" "gender" "maths" "english"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2.3.1 what if I want to change column names or add variable to the df?
df2 = data.frame (id = id, gender = gender, maths = maths, english = english)
df2 = cbind(df2, pe)
colnames(df2) = c("ID", "SEX","MATHS","ENGLISH","PE")
head(df2)
## ID SEX MATHS ENGLISH PE
## 1 1 male 75.01796 70.00829 81.45317
## 2 2 female 66.76527 69.69189 97.60515
## 3 3 male 66.20430 74.01194 42.85854
## 4 4 female 66.76040 90.87320 81.41710
## 5 5 male 69.72038 88.19927 68.82476
## 6 6 female 78.40910 77.38686 78.27539
2.3.2 Subsetting dataframes
We all know how to select part of an Excel spreadsheet by clicking and moving our mouse. In R, when we want to select part of a dataframe, we use this formula, dataframe[row, column].
There are various ways we can use this formula and believe it or not, you will love them!
# the complete dataset
df2
## ID SEX MATHS ENGLISH PE
## 1 1 male 75.01796 70.00829 81.45317
## 2 2 female 66.76527 69.69189 97.60515
## 3 3 male 66.20430 74.01194 42.85854
## 4 4 female 66.76040 90.87320 81.41710
## 5 5 male 69.72038 88.19927 68.82476
## 6 6 female 78.40910 77.38686 78.27539
## 7 7 male 74.63351 72.84907 103.53668
## 8 8 female 70.20879 70.72695 75.82843
## 9 9 male 70.05682 73.92994 86.68846
## 10 10 female 72.83907 66.40915 82.75331
## 11 11 male 74.37074 78.15143 83.99379
## 12 12 female 59.28214 92.76303 83.46747
## 13 13 male 68.82831 57.60829 86.23941
## 14 14 female 67.07097 53.08478 92.80207
## 15 15 male 73.13889 81.69466 102.96310
## 16 16 female 70.93309 68.88525 88.10361
## 17 17 male 72.74304 68.15478 96.23987
## 18 18 female 72.25669 77.23339 109.31729
## 19 19 male 70.73052 71.25089 96.15179
## 20 20 female 75.22630 78.78715 67.72595
## 21 21 male 68.82996 76.23197 97.91007
## 22 22 female 76.96548 80.55444 92.41106
## 23 23 male 69.10837 81.14878 101.00206
## 24 24 female 69.59531 84.46368 84.35711
## 25 25 male 68.39574 81.23285 74.69256
## 26 26 female 75.03074 73.41064 87.77659
## 27 27 male 56.35006 88.63401 90.43346
## 28 28 female 69.08994 71.15803 86.71653
## 29 29 male 75.28302 81.68865 90.50612
## 30 30 female 69.57670 87.54575 97.37418
## 31 31 male 72.13060 83.42167 58.46628
## 32 32 female 69.68822 70.94917 73.62052
## 33 33 male 64.59905 64.67354 80.07683
## 34 34 female 69.48179 78.74199 67.77300
## 35 35 male 75.32829 81.78720 81.81599
## 36 36 female 68.47001 87.89925 99.06371
## 37 37 male 68.41000 89.30826 88.41365
## 38 38 female 68.44393 92.77329 79.07766
## 39 39 male 75.69908 75.21132 78.48825
## 40 40 female 70.69449 75.45890 87.14656
df2[2:5, ] # from row 2 to row 5
## ID SEX MATHS ENGLISH PE
## 2 2 female 66.76527 69.69189 97.60515
## 3 3 male 66.20430 74.01194 42.85854
## 4 4 female 66.76040 90.87320 81.41710
## 5 5 male 69.72038 88.19927 68.82476
df2[ , 1:2] # select column 1 to 2
## ID SEX
## 1 1 male
## 2 2 female
## 3 3 male
## 4 4 female
## 5 5 male
## 6 6 female
## 7 7 male
## 8 8 female
## 9 9 male
## 10 10 female
## 11 11 male
## 12 12 female
## 13 13 male
## 14 14 female
## 15 15 male
## 16 16 female
## 17 17 male
## 18 18 female
## 19 19 male
## 20 20 female
## 21 21 male
## 22 22 female
## 23 23 male
## 24 24 female
## 25 25 male
## 26 26 female
## 27 27 male
## 28 28 female
## 29 29 male
## 30 30 female
## 31 31 male
## 32 32 female
## 33 33 male
## 34 34 female
## 35 35 male
## 36 36 female
## 37 37 male
## 38 38 female
## 39 39 male
## 40 40 female
df2[ , c("ENGLISH", "PE")] # select by column names
## ENGLISH PE
## 1 70.00829 81.45317
## 2 69.69189 97.60515
## 3 74.01194 42.85854
## 4 90.87320 81.41710
## 5 88.19927 68.82476
## 6 77.38686 78.27539
## 7 72.84907 103.53668
## 8 70.72695 75.82843
## 9 73.92994 86.68846
## 10 66.40915 82.75331
## 11 78.15143 83.99379
## 12 92.76303 83.46747
## 13 57.60829 86.23941
## 14 53.08478 92.80207
## 15 81.69466 102.96310
## 16 68.88525 88.10361
## 17 68.15478 96.23987
## 18 77.23339 109.31729
## 19 71.25089 96.15179
## 20 78.78715 67.72595
## 21 76.23197 97.91007
## 22 80.55444 92.41106
## 23 81.14878 101.00206
## 24 84.46368 84.35711
## 25 81.23285 74.69256
## 26 73.41064 87.77659
## 27 88.63401 90.43346
## 28 71.15803 86.71653
## 29 81.68865 90.50612
## 30 87.54575 97.37418
## 31 83.42167 58.46628
## 32 70.94917 73.62052
## 33 64.67354 80.07683
## 34 78.74199 67.77300
## 35 81.78720 81.81599
## 36 87.89925 99.06371
## 37 89.30826 88.41365
## 38 92.77329 79.07766
## 39 75.21132 78.48825
## 40 75.45890 87.14656
df2[c(1,2,3), ] #select the first three rows
## ID SEX MATHS ENGLISH PE
## 1 1 male 75.01796 70.00829 81.45317
## 2 2 female 66.76527 69.69189 97.60515
## 3 3 male 66.20430 74.01194 42.85854
df2[seq(1, 40, 2), ] #select every other rows from 1 to 40 rows
## ID SEX MATHS ENGLISH PE
## 1 1 male 75.01796 70.00829 81.45317
## 3 3 male 66.20430 74.01194 42.85854
## 5 5 male 69.72038 88.19927 68.82476
## 7 7 male 74.63351 72.84907 103.53668
## 9 9 male 70.05682 73.92994 86.68846
## 11 11 male 74.37074 78.15143 83.99379
## 13 13 male 68.82831 57.60829 86.23941
## 15 15 male 73.13889 81.69466 102.96310
## 17 17 male 72.74304 68.15478 96.23987
## 19 19 male 70.73052 71.25089 96.15179
## 21 21 male 68.82996 76.23197 97.91007
## 23 23 male 69.10837 81.14878 101.00206
## 25 25 male 68.39574 81.23285 74.69256
## 27 27 male 56.35006 88.63401 90.43346
## 29 29 male 75.28302 81.68865 90.50612
## 31 31 male 72.13060 83.42167 58.46628
## 33 33 male 64.59905 64.67354 80.07683
## 35 35 male 75.32829 81.78720 81.81599
## 37 37 male 68.41000 89.30826 88.41365
## 39 39 male 75.69908 75.21132 78.48825
2.4 Summary
Dimensions | Homogenous | Heterogeneous |
---|---|---|
1D | Atomic Vector | List |
2D | Matrix | Data frame |
nD | Array |