Chapter 2 Basic data structures in R

Before we jump into actual data analysis, it is desirable to first think about what are common variable types and how they are stored in R.

A variable is any characteristics, number, or quantity that can be measured or counted. A variable may also be called a data item. Age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye colour and vehicle type are examples of variables. It is called a variable because the value may vary between data units in a population, and may change in value over time.(Australian Bureau of Statistics, ABS)

2.1 Types of variables: A taxonomy

2.1.1 Categorical variables: ordinal vs. norminal

Categorical variables have values that describe a ‘quality’ or ‘type’ or ‘category’.Therefore, categorical variables are qualitative variables and tend to be represented by a non-numeric value. Categorical variables should be exclusive (in one category or in another) and exhaustive (include all possible options).

Categorical variables may be further divided as being ordinal or nominal:

An ordinal variable can be logically ordered or ranked. The categories associated with ordinal variables can be ranked higher or lower than another, but do not necessarily establish a numeric difference between each category.In other words, the interval between levels of the variables are unknown. Examples of ordinal categorical variables include academic grades (i.e. A, B, C), clothing size (i.e. small, medium, large, extra large) and attitudes (i.e. strongly agree, agree, disagree, strongly disagree).

For example, when doing a survey, participants will be asked to rate. The subjective measurements of this kind are often ordinal variables. E.g. a Likert ranking scale; level of education (“< high school”, “high school”, “associate’s degree”).

We can assign numbers to different levels of an ordinal variable, but we should bear in mind that these variable are not numeric. For example, “strongly agree” and “neutral” cannot average out to an “agree”, even though you may assign 5 to “strong agree” and 3 to “neutral”.

A nominal variable is not able to be organised in a logical sequence. Examples of nominal categorical variables include sex, business type, eye colour, religion and brand.

The data collected for a categorical variable are qualitative data.

2.1.2 Numeric variables: discrete or continuous

Numeric variables have values that describe a measurable quantity as a number, like ‘how many’ or ‘how much’. Therefore, numeric variables are quantitative variables.(ABS) It is also called Interval/ratio variables and the interval between numbers is equal: the interval between 1 kg and 2 kg is the same as between 3 kg and 4 kg.

Numeric variables may be further divided as being either continuous or discrete:

A discrete variable consists of counts from a set of distinct whole values. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of of which measured as whole units (i.e. 1, 2, 3 cars).(ABS)

A continuous variable can take any value between a certain set of real numbers. The value given to an observation for a continuous variable can include values as small as the instrument of measurement allows. Examples of continuous variables include height, time, age, and temperature.(ABS)

The data collected for a numeric variable are quantitative data.

The variable type will determine (1) statistical analysis; (2) the way we summarize data with statistics and plots.

(#fig:variable_type)Taxonomy of variables

Variables can be stored in R in different data types.

Normial and ordinal variables can be stored as character or factors (with levels).
Interval data are stored as numbers either as integer or numeric (real or decimal).

If you have only one variable, you can store it in a vector. However, more often than not, you have a bunch of variables that need to be stored or imported as a matrix or data frame.

2.2 1D data structure: vectors

A vector is a sequence of data elements of the same basic type: integer, double, logical or character. All elements of a vector must be the same type.

2.2.1 Creating vectors

a = 8:17

b <- c(9, 10, 100, 38)

c = c (TRUE, FALSE, TRUE, FALSE)

c = c (T, F, T, F)

d = c ("TRUE", "FALSE", "FALSE")

# You can change the type of a vector with as.vector function.

as.vector(b, mode = "character")

## [1] "9"   "10"  "100" "38"

# When you put elements of different types in one vector, R will automatically change the type of some elements to keep the whole vector homogenous.

e = c(9,10, "ab", "cd")

f = c(10, 11, T, F)

c () is a function in R.

There are some other basic functions in R that you can play with to generate vectors.

A = 9:20 + 1

B = seq (1, 10)

C = seq (1, 20, by= 2)

D = rep (5, 4)

E = rep (c(1,2,3), 4)

G = rep (c(1,2,3), each = 4)

# Now that you have a vector, you can do some Maths.

max(a)

## [1] 17

min(a)

## [1] 8

range(a)

## [1]  8 17

sum(a)

## [1] 125

mean(a)

## [1] 12.5

median(a)

## [1] 12.5

quantile(a)

##    0%   25%   50%   75%  100% 
##  8.00 10.25 12.50 14.75 17.00

sd(a)

## [1] 3.02765

round(sd(a), 2)

## [1] 3.03

2.2.2 Creating list objects

We can put vectors of different types (e.g., number, logic or character) and lengths in a list object.

list1 = list(a, b, c, d, e, f)

list1

## [[1]]
##  [1]  8  9 10 11 12 13 14 15 16 17
## 
## [[2]]
## [1]   9  10 100  38
## 
## [[3]]
## [1]  TRUE FALSE  TRUE FALSE
## 
## [[4]]
## [1] "TRUE"  "FALSE" "FALSE"
## 
## [[5]]
## [1] "9"  "10" "ab" "cd"
## 
## [[6]]
## [1] 10 11  1  0

# More often than not, we do not make list ourselves but have to deal with lists when we get outputs from stats models.

2.3 2D data structures: matrice and data frames

Most of us have had some experience with the Excel spreadsheet. Data in a spreadsheet are arranged by rows and columns in a rectangular space. This is a typical 2 dimensional data structure. In R, we can have two ways of forming tabular data like a spreadsheet: the matrix and dataframe.

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout in which all the elements must be of the same type (e.g., numeric or character).

Dataframe is similar to matrix in shape but only differs in that different types of data (e.g. numeric, factor, character) can co-exist in different columns. Thus, in data analysis, we use dataframes more often than matrix.

# Let's generate a dataframe from scratch.

id = seq(1, 40)

gender = rep(c("male", "female"), 5)

maths = rnorm(40, mean = 70, sd = 5)

english = rnorm(40, mean = 80, sd = 9)

music = rnorm(40, mean = 75, sd = 10)

pe = rnorm(40, mean = 86, sd = 12)

df1 = data.frame (id, gender, maths, english)

Now let’s explore the data frame we just created.

str(df1)

## 'data.frame':    40 obs. of  4 variables:
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender : Factor w/ 2 levels "female","male": 2 1 2 1 2 1 2 1 2 1 ...
##  $ maths  : num  70 71.5 71.7 67.3 71 ...
##  $ english: num  93 55.1 62.5 84.8 73.6 ...

summary(df1)

##        id           gender       maths          english     
##  Min.   : 1.00   female:20   Min.   :60.57   Min.   :55.13  
##  1st Qu.:10.75   male  :20   1st Qu.:66.43   1st Qu.:71.86  
##  Median :20.50               Median :70.12   Median :79.29  
##  Mean   :20.50               Mean   :69.73   Mean   :78.49  
##  3rd Qu.:30.25               3rd Qu.:71.68   3rd Qu.:85.50  
##  Max.   :40.00               Max.   :83.00   Max.   :97.55

nrow(df1)

## [1] 40

ncol(df1)

## [1] 4

attributes(df1)

## $names
## [1] "id"      "gender"  "maths"   "english"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

2.3.1 what if I want to change column names or add variable to the df?

df2 = data.frame (id = id, gender = gender, maths = maths, english = english)

df2 = cbind(df2, pe)

colnames(df2) = c("ID", "SEX","MATHS","ENGLISH","PE")

head(df2)

##   ID    SEX    MATHS  ENGLISH        PE
## 1  1   male 69.95176 93.00191  97.15277
## 2  2 female 71.54677 55.12660  75.79006
## 3  3   male 71.67045 62.54689  81.67235
## 4  4 female 67.25445 84.84459 103.52895
## 5  5   male 70.97599 73.64291  99.84611
## 6  6 female 68.06792 86.20590  86.53996

2.3.2 Subsetting dataframes

We all know how to select part of an Excel spreadsheet by clicking and moving our mouse. In R, when we want to select part of a dataframe, we use this formula, dataframe[row, column].

There are various ways we can use this formula and believe it or not, you will love them!

# the complete dataset

df2

##    ID    SEX    MATHS  ENGLISH        PE
## 1   1   male 69.95176 93.00191  97.15277
## 2   2 female 71.54677 55.12660  75.79006
## 3   3   male 71.67045 62.54689  81.67235
## 4   4 female 67.25445 84.84459 103.52895
## 5   5   male 70.97599 73.64291  99.84611
## 6   6 female 68.06792 86.20590  86.53996
## 7   7   male 65.51432 68.75287  90.03359
## 8   8 female 71.10533 84.33375 107.47870
## 9   9   male 69.06481 87.63462  80.22279
## 10 10 female 70.33877 70.27196  97.11310
## 11 11   male 71.27370 70.79533 100.30879
## 12 12 female 63.53398 80.83623  51.17344
## 13 13   male 67.55812 79.26045  82.70119
## 14 14 female 68.51564 87.19538  91.07716
## 15 15   male 60.56896 88.67801  79.43153
## 16 16 female 61.85224 67.14054 101.25958
## 17 17   male 72.26608 90.73062  83.97038
## 18 18 female 74.21551 77.13889  87.51114
## 19 19   male 71.16397 75.92932  83.36162
## 20 20 female 77.85834 82.52012  93.73073
## 21 21   male 77.71301 72.21872  75.34537
## 22 22 female 73.28516 82.30036  84.19643
## 23 23   male 71.16284 63.91421  87.02147
## 24 24 female 66.16200 87.26578 111.01689
## 25 25   male 71.69465 68.37152  88.13204
## 26 26 female 64.79651 74.97961  92.63813
## 27 27   male 66.07760 82.67037  78.19265
## 28 28 female 68.29584 90.07613  78.00366
## 29 29   male 70.65870 63.67181  86.14422
## 30 30 female 70.29251 81.79678  75.04130
## 31 31   male 69.75765 73.11760  86.93607
## 32 32 female 64.55882 76.38809 106.71705
## 33 33   male 74.25832 85.26931  74.50730
## 34 34 female 66.52137 79.95632  95.12957
## 35 35   male 66.15673 73.23804  94.83521
## 36 36 female 62.95305 79.02947  85.48000
## 37 37   male 73.84790 97.54913 111.18145
## 38 38 female 69.65016 67.65983  79.77146
## 39 39   male 73.93870 94.12206 104.65158
## 40 40 female 83.00292 79.31134  76.59283

df2[2:5, ] # from row 2 to row 5

##   ID    SEX    MATHS  ENGLISH        PE
## 2  2 female 71.54677 55.12660  75.79006
## 3  3   male 71.67045 62.54689  81.67235
## 4  4 female 67.25445 84.84459 103.52895
## 5  5   male 70.97599 73.64291  99.84611

df2[ , 1:2] # select column 1 to 2

##    ID    SEX
## 1   1   male
## 2   2 female
## 3   3   male
## 4   4 female
## 5   5   male
## 6   6 female
## 7   7   male
## 8   8 female
## 9   9   male
## 10 10 female
## 11 11   male
## 12 12 female
## 13 13   male
## 14 14 female
## 15 15   male
## 16 16 female
## 17 17   male
## 18 18 female
## 19 19   male
## 20 20 female
## 21 21   male
## 22 22 female
## 23 23   male
## 24 24 female
## 25 25   male
## 26 26 female
## 27 27   male
## 28 28 female
## 29 29   male
## 30 30 female
## 31 31   male
## 32 32 female
## 33 33   male
## 34 34 female
## 35 35   male
## 36 36 female
## 37 37   male
## 38 38 female
## 39 39   male
## 40 40 female

df2[ , c("ENGLISH", "PE")] # select by column names

##     ENGLISH        PE
## 1  93.00191  97.15277
## 2  55.12660  75.79006
## 3  62.54689  81.67235
## 4  84.84459 103.52895
## 5  73.64291  99.84611
## 6  86.20590  86.53996
## 7  68.75287  90.03359
## 8  84.33375 107.47870
## 9  87.63462  80.22279
## 10 70.27196  97.11310
## 11 70.79533 100.30879
## 12 80.83623  51.17344
## 13 79.26045  82.70119
## 14 87.19538  91.07716
## 15 88.67801  79.43153
## 16 67.14054 101.25958
## 17 90.73062  83.97038
## 18 77.13889  87.51114
## 19 75.92932  83.36162
## 20 82.52012  93.73073
## 21 72.21872  75.34537
## 22 82.30036  84.19643
## 23 63.91421  87.02147
## 24 87.26578 111.01689
## 25 68.37152  88.13204
## 26 74.97961  92.63813
## 27 82.67037  78.19265
## 28 90.07613  78.00366
## 29 63.67181  86.14422
## 30 81.79678  75.04130
## 31 73.11760  86.93607
## 32 76.38809 106.71705
## 33 85.26931  74.50730
## 34 79.95632  95.12957
## 35 73.23804  94.83521
## 36 79.02947  85.48000
## 37 97.54913 111.18145
## 38 67.65983  79.77146
## 39 94.12206 104.65158
## 40 79.31134  76.59283

df2[c(1,2,3), ] #select the first three rows

##   ID    SEX    MATHS  ENGLISH       PE
## 1  1   male 69.95176 93.00191 97.15277
## 2  2 female 71.54677 55.12660 75.79006
## 3  3   male 71.67045 62.54689 81.67235

df2[seq(1, 40, 2), ] #select every other rows from 1 to 40 rows

##    ID  SEX    MATHS  ENGLISH        PE
## 1   1 male 69.95176 93.00191  97.15277
## 3   3 male 71.67045 62.54689  81.67235
## 5   5 male 70.97599 73.64291  99.84611
## 7   7 male 65.51432 68.75287  90.03359
## 9   9 male 69.06481 87.63462  80.22279
## 11 11 male 71.27370 70.79533 100.30879
## 13 13 male 67.55812 79.26045  82.70119
## 15 15 male 60.56896 88.67801  79.43153
## 17 17 male 72.26608 90.73062  83.97038
## 19 19 male 71.16397 75.92932  83.36162
## 21 21 male 77.71301 72.21872  75.34537
## 23 23 male 71.16284 63.91421  87.02147
## 25 25 male 71.69465 68.37152  88.13204
## 27 27 male 66.07760 82.67037  78.19265
## 29 29 male 70.65870 63.67181  86.14422
## 31 31 male 69.75765 73.11760  86.93607
## 33 33 male 74.25832 85.26931  74.50730
## 35 35 male 66.15673 73.23804  94.83521
## 37 37 male 73.84790 97.54913 111.18145
## 39 39 male 73.93870 94.12206 104.65158

2.4 summary

Dimensions	Homogenous	Heterogeneous
1D	Atomic Vector	List
2D	Matrix	Data frame
nD	Array