Chapter 4 Variation
Variation is the tendency of the values of a variable to change from measurement to measurement. (Wickham & Grolemund)
4.1 Categorical variable
The best way to characterizing categorical variables is via the frequency. The frequency is the number of times a particular value for a variable (data item) has been observed to occur.
4.1.1 Absolute v.s relative frequency?
The frequency of a value can be expressed in different ways:
The absolute frequency describes the number of times a particular value for a variable occurs. Or simply put, counts.
The relative frequency describes the number of times a particular value for a variable occurs in relation to the total number of values for that variable.It is calculated by dividing the absolute frequency by the total number of values for the variable. Ratios, rates, proportions and percentages are different ways of expressing relative frequencies.
A ratio compares the frequency of one value for a variable with another value for the variable (1st value : 2nd value).
For example, in a total of 20 coin tosses where there are 12 heads and 8 tails, the ratio of heads to tails is 12:8. Alternatively, the ratio of tails to heads is 8:12.
- A rate is a measurement of one value for a variable in relation to another measured quantity.
For example, in a total of 20 coin tosses where there are 12 heads and 8 tails, the rate is 12 heads per 20 coin tosses. Alternatively, the rate is 8 tails per 20 coin tosses.
- A proportion describes the share of one value for a variable in relation to a whole.It is calculated by dividing the number of times a particular value for a variable has been observed, by the total number of values in the population.
For example, in a total of 20 coin tosses where there are 12 heads and 8 tails, the proportion of heads is 0.6 (12 divided by 20). Alternatively, the proportion of tails is 0.4 (8 divided by 20).
- A percentage expresses a value for a variable in relation to a whole population as a fraction of one hundred. The percentage total of an entire dataset should always add up to 100, as 100% represents the total. A percentage is calculated by dividing the number of times a particular value for a variable has been observed, by the total number of observations in the population, then multiplying this number by 100.
For example, in a total of 20 coin tosses where there are 12 heads and 8 tails, the percentage of heads is 60% (12 divided by 20, multiplied by 100). Alternatively, the percentage of tails is 40% (8 divided by 20, multiplied by 100). (ABS)
4.1.2 Frequency distributions
Frequency distributions are visual displays that organise and present frequency counts so that the information can be interpreted more easily.
A frequency distribution of data can be shown in a table or graph. Some common methods of showing frequency distributions include frequency tables, bar charts or histograms.
4.1.2.1 Frequency Tables
A frequency table is a simple way to display the number of occurrences of a particular value or characteristic.
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#library(Hmisc)
describe(diamonds)
## diamonds
##
## 10 Variables 53940 Observations
## ---------------------------------------------------------------------------
## carat
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 273 0.999 0.7979 0.5122 0.30 0.31
## .25 .50 .75 .90 .95
## 0.40 0.70 1.04 1.51 1.70
##
## lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
## ---------------------------------------------------------------------------
## cut
## n missing distinct
## 53940 0 5
##
## Value Fair Good Very Good Premium Ideal
## Frequency 1610 4906 12082 13791 21551
## Proportion 0.030 0.091 0.224 0.256 0.400
## ---------------------------------------------------------------------------
## color
## n missing distinct
## 53940 0 7
##
## Value D E F G H I J
## Frequency 6775 9797 9542 11292 8304 5422 2808
## Proportion 0.126 0.182 0.177 0.209 0.154 0.101 0.052
## ---------------------------------------------------------------------------
## clarity
## n missing distinct
## 53940 0 8
##
## Value I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## Frequency 741 9194 13065 12258 8171 5066 3655 1790
## Proportion 0.014 0.170 0.242 0.227 0.151 0.094 0.068 0.033
## ---------------------------------------------------------------------------
## depth
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 184 0.999 61.75 1.515 59.3 60.0
## .25 .50 .75 .90 .95
## 61.0 61.8 62.5 63.3 63.8
##
## lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
## ---------------------------------------------------------------------------
## table
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 127 0.98 57.46 2.448 54 55
## .25 .50 .75 .90 .95
## 56 57 59 60 61
##
## lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
## ---------------------------------------------------------------------------
## price
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 11602 1 3933 4012 544 646
## .25 .50 .75 .90 .95
## 950 2401 5324 9821 13107
##
## lowest : 326 327 334 335 336, highest: 18803 18804 18806 18818 18823
## ---------------------------------------------------------------------------
## x
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 554 1 5.731 1.276 4.29 4.36
## .25 .50 .75 .90 .95
## 4.71 5.70 6.54 7.31 7.66
##
## lowest : 0.00 3.73 3.74 3.76 3.77, highest: 10.01 10.02 10.14 10.23 10.74
## ---------------------------------------------------------------------------
## y
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 552 1 5.735 1.269 4.30 4.36
## .25 .50 .75 .90 .95
## 4.72 5.71 6.54 7.30 7.65
##
## Value 0.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
## Frequency 7 5 1731 12305 7817 5994 6742 9260 4298 3402
## Proportion 0.000 0.000 0.032 0.228 0.145 0.111 0.125 0.172 0.080 0.063
##
## Value 8.0 8.5 9.0 9.5 10.0 10.5 32.0 59.0
## Frequency 1635 652 69 14 6 1 1 1
## Proportion 0.030 0.012 0.001 0.000 0.000 0.000 0.000 0.000
## ---------------------------------------------------------------------------
## z
## n missing distinct Info Mean Gmd .05 .10
## 53940 0 375 1 3.539 0.7901 2.65 2.69
## .25 .50 .75 .90 .95
## 2.91 3.53 4.04 4.52 4.73
##
## Value 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
## Frequency 20 1 2 3 8807 13809 9474 13682 5525 2352
## Proportion 0.000 0.000 0.000 0.000 0.163 0.256 0.176 0.254 0.102 0.044
##
## Value 5.5 6.0 6.5 7.0 8.0 32.0
## Frequency 237 20 5 1 1 1
## Proportion 0.004 0.000 0.000 0.000 0.000 0.000
## ---------------------------------------------------------------------------
contents(diamonds)
##
## Data frame:diamonds 53940 observations and 10 variables Maximum # NAs:0
##
##
## Levels Class Storage
## carat double
## cut 5 ordered integer
## color 7 ordered integer
## clarity 8 ordered integer
## depth double
## table double
## price integer
## x double
## y double
## z double
##
## +--------+---------------------------------+
## |Variable|Levels |
## +--------+---------------------------------+
## | cut |Fair,Good,Very Good,Premium,Ideal|
## +--------+---------------------------------+
## | color |D,E,F,G,H,I,J |
## +--------+---------------------------------+
## | clarity|I1,SI2,SI1,VS2,VS1,VVS2,VVS1,IF |
## +--------+---------------------------------+
Categorical variables are usually stored as factors or characters. You can use count() function or bar chart to explore the distribution.
4.1.2.2 Bar chart
A bar chart is a type of graph in which each column (plotted either vertically or horizontally) represents a categorical variable or a discrete ungrouped numeric variable. It is used to compare the frequency (count) for a category or characteristic with another category or characteristic.(ABS)
How to interprate:
In a bar chart, the bar height (if vertical) or length (if horizontal) shows the frequency for each category or characteristic.
The distribution of the dataset is not important because the columns each represent an individual category or characteristic rather than intervals for a continuous measurement. Therefore, gaps are included between each bar and each bar can be arranged in any order without affecting the data.
count(diamonds, cut)
## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
count(diamonds, color)
## # A tibble: 7 x 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
ggplot(diamonds)+
geom_bar(aes(x = cut))
ggplot(diamonds)+
geom_bar(aes(x = color))
Try this with clarity!
library(languageR)
##
## Attaching package: 'languageR'
## The following object is masked _by_ '.GlobalEnv':
##
## english
summary(lexdec)
## Subject RT Trial Sex NativeLanguage
## A1 : 79 Min. :5.829 Min. : 23 F:1106 English:948
## A2 : 79 1st Qu.:6.215 1st Qu.: 64 M: 553 Other :711
## A3 : 79 Median :6.346 Median :106
## C : 79 Mean :6.385 Mean :105
## D : 79 3rd Qu.:6.502 3rd Qu.:146
## I : 79 Max. :7.587 Max. :185
## (Other):1185
## Correct PrevType PrevCorrect Word
## correct :1594 nonword:855 correct :1542 almond : 21
## incorrect: 65 word :804 incorrect: 117 ant : 21
## apple : 21
## apricot : 21
## asparagus: 21
## avocado : 21
## (Other) :1533
## Frequency FamilySize SynsetCount Length
## Min. :1.792 Min. :0.0000 Min. :0.6931 Min. : 3.000
## 1st Qu.:3.951 1st Qu.:0.0000 1st Qu.:1.0986 1st Qu.: 5.000
## Median :4.754 Median :0.0000 Median :1.0986 Median : 6.000
## Mean :4.751 Mean :0.7028 Mean :1.3154 Mean : 5.911
## 3rd Qu.:5.652 3rd Qu.:1.0986 3rd Qu.:1.6094 3rd Qu.: 7.000
## Max. :7.772 Max. :3.3322 Max. :2.3026 Max. :10.000
##
## Class FreqSingular FreqPlural DerivEntropy
## animal:924 Min. : 4.0 Min. : 0.0 Min. :0.0000
## plant :735 1st Qu.: 23.0 1st Qu.: 19.0 1st Qu.:0.0000
## Median : 69.0 Median : 49.0 Median :0.0370
## Mean : 132.1 Mean :109.7 Mean :0.3856
## 3rd Qu.: 146.0 3rd Qu.:132.0 3rd Qu.:0.6845
## Max. :1518.0 Max. :854.0 Max. :2.2641
##
## Complex rInfl meanRT SubjFreq
## complex: 210 Min. :-1.3437 Min. :6.245 Min. :2.000
## simplex:1449 1st Qu.:-0.3023 1st Qu.:6.322 1st Qu.:3.160
## Median : 0.1900 Median :6.364 Median :3.880
## Mean : 0.2845 Mean :6.379 Mean :3.911
## 3rd Qu.: 0.6385 3rd Qu.:6.420 3rd Qu.:4.680
## Max. : 4.4427 Max. :6.621 Max. :6.040
##
## meanSize meanWeight BNCw BNCc
## Min. :1.323 Min. :0.8244 Min. : 0.02229 Min. : 0.0000
## 1st Qu.:1.890 1st Qu.:1.4590 1st Qu.: 1.64921 1st Qu.: 0.1625
## Median :3.099 Median :2.7558 Median : 3.32071 Median : 0.6500
## Mean :2.891 Mean :2.5516 Mean : 7.37800 Mean : 5.0351
## 3rd Qu.:3.711 3rd Qu.:3.4178 3rd Qu.: 7.10943 3rd Qu.: 2.9248
## Max. :4.819 Max. :4.7138 Max. :79.17324 Max. :83.1949
##
## BNCd BNCcRatio BNCdRatio
## Min. : 0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 1.188 1st Qu.:0.09673 1st Qu.:0.5551
## Median : 3.800 Median :0.27341 Median :0.9349
## Mean : 12.995 Mean :0.45834 Mean :1.5428
## 3rd Qu.: 10.451 3rd Qu.:0.55550 3rd Qu.:2.1315
## Max. :241.561 Max. :8.29545 Max. :6.3458
##
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
4.2 Continous variable
A continuous variable can take any of an infinite set of ordered values.
4.2.1 Central tendency
A measure of central tendency attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution.
There are three main measures of central tendency: the mode, the median, and the mean. Each of these measures describes a different indication of the typical or central value in the distribution.
4.2.1.1 Mode
- The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
*Advantage:
The mode has an advantage over the median and the mean as it can be found for both numerical and categorical (non-numerical) data.
*Limitations:
- In some distributions, the mode may not reflect the centre of the distribution very well.
it is easy to see that the centre of the distribution is 57 years, but the mode is lower, at 54 years. 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
more than one mode for the same distribution of data, bi-modal, or multi-modal. The presence of more than one mode can limit the ability of the mode in describing the centre or typical value of the distribution because a single value to describe the centre cannot be identified.
In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e. if all values are different).In cases such as these, it may be better to consider using the median or mean, or group the data in to appropriate intervals, and find the modal class.
4.2.1.2 Median
- The median is the middle value in distribution when the values are arranged in ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either side of the median value). In a distribution with an odd number of observations, the median value is the middle value.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the two middle values. In the following distribution, the two middle values are 56 and 57, therefore the median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
*Advantage:
The median is less affected by outliers and skewed data than the mean, and is usually the preferred measure of central tendency when the distribution is not symmetrical.
*Limitation:
The median cannot be identified for categorical nominal data, as it cannot be logically ordered.
4.2.1.3 Mean
The mean is the sum of the value of each observation in a dataset divided by the number of observations. This is also known as the arithmetic average.
Advantage
The mean can be used for both continuous and discrete numeric data.
- Limitations
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers and skewed distributions.
The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean is calculated on a distribution from a sample it is indicated by the symbol x̅ (pronounced X-bar).
4.2.2 Measures of Spread
Dataset A: 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8 Dataset B: 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11
The measures of central tendency for both datasets above are the same.However, if we look at the spread of the values, we can see that Dataset B is more dispersed than Dataset A. Used together, the measures of central tendency and measures of spread help us to better understand the data
Summarising the dataset can help us understand the data, especially when the dataset is large. Measures of spread summarise the data in a way that shows how scattered the values are and how much they differ from the mean value.
The spread of the values can be measured for quantitative data, as the variables are numeric and can be arranged into a logical order with a low end value and a high end value.(ABS)
Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.(ABS)
dataset1 = c(4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8)
dataset2 = c(1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11)
dataset = data.frame(dataset1, dataset2)
dataset = gather(dataset, type, data)
ggplot(dataset, aes(data, fill = type)) + geom_bar(position=position_dodge())
#### Range The range is the difference between the smallest value and the largest value in a dataset.
range(dataset1)
## [1] 4 8
range(dataset2)
## [1] 1 11
4.2.2.1 Quartiles
Quartiles divide an ordered dataset into four equal parts, and refer to the values of the point between the quarters. A dataset may also be divided into quintiles (five equal parts) or deciles (ten equal parts).
The lower quartile (Q1) is the point between the lowest 25% of values and the highest 75% of values. It is also called the 25th percentile.
The second quartile (Q2) is the middle of the data set. It is also called the 50th percentile, or the median.
The upper quartile (Q3) is the point between the lowest 75% and highest 25% of values. It is also called the 75th percentile.
4.2.2.2 The interquartile range
The interquartile range (IQR) is the difference between the upper (Q3) and lower (Q1) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.
quantile(dataset1)
## 0% 25% 50% 75% 100%
## 4 5 6 7 8
quantile(dataset2)
## 0% 25% 50% 75% 100%
## 1.00 3.75 6.00 8.25 11.00
4.2.2.3 Variance vs. standard deviation
The variance and the standard deviation are measures of how close each observed data value is to the mean value.
In datasets with a small spread all values are very close to the mean, resulting in a small variance and standard deviation. Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation. The smaller the variance and standard deviation, the more the mean value is indicative of the whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and variance are zero.
The standard deviation is the square root of the variance.
The standard deviation of a normal distribution enables us to calculate confidence intervals. In a normal distribution, about 68% of the values are within one standard deviation either side of the mean and about 95% of the scores are within two standard deviations of the mean.
var(dataset1)
## [1] 1.272727
sqrt(var(dataset1))
## [1] 1.128152
sd(dataset1)
## [1] 1.128152
4.2.2.4 Five-number summary
The five-number summary is a set of descriptive statistics that provide information about a dataset. It consists of the five most important sample percentiles:
the sample minimum (smallest observation)
the lower quartile or first quartile
the median (the middle value)
the upper quartile or third quartile
the sample maximum (largest observation)
The five-number summary provides a concise summary of the distribution of the observations. Reporting five numbers avoids the need to decide on the most appropriate summary statistic. The five-number summary gives information about the location (from the median), spread (from the quartiles) and range (from the sample minimum and maximum) of the observations.
Since it reports order statistics (rather than, say, the mean), the five-number summary is appropriate for ordinal measurements, as well as interval and ratio measurements. (from wikipedia)
How to Find a Five-Number Summary:
Step 1: Put your numbers in ascending order (from smallest to largest). For this particular data set, the order is: > Example: 1,2,5,6,7,9,12,15,18,19,27
Step 2: Find the minimum and maximum for your data set. Now that your numbers are in order, this should be easy to spot. In the example in step 1, the minimum (the smallest number) is 1 and the maximum (the largest number) is 27.
Step 3: Find the median. The median is the middle number. If you aren’t sure how to find the median, see: How to find the mean mode and median.
Step 4: Place parentheses around the numbers above and below the median.(This is not technically necessary, but it makes Q1 and Q3 easier to find). (1,2,5,6,7),9,(12,15,18,19,27).
Step 5: Find Q1 and Q3. Q1 can be thought of as a median in the lower half of the data, and Q3 can be thought of as a median for the upper half of data. (1,2,5,6,7), 9, ( 12,15,18,19,27).
-Step 6: Write down your summary found in the above steps. minimum=1, Q1 =5, median=9, Q3=18, and maximum=27.
summary(diamonds$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
4.2.2.5 Visualization: Boxplot & violin plot
The five-number summary can be represented graphically using a boxplot.
Boxplots have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles. Outliers may be plotted as individual points. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data.
box plots typically graph six data points:
The lowest value, excluding outliers
The first quartile (this is the 25th percentile, or median of all the numbers below the median)
The median value (equivalent to the 50th percentile)
The third quartile (this is the 75th percentile, or median of all the numbers above the median)
The highest value, excluding outliers
Outliers
# r base function
boxplot(diamonds$carat)
#ggplot version
ggplot(diamonds, aes(x="carat", y=carat))+
geom_boxplot()
A violin plot is a method of plotting numeric data. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data; a box or marker indicating the interquartile range; and possibly all sample points, if the number of samples is not too high.
It is similar to a boxplot, with the addition of a rotated kernel density plot on each side.The difference is particularly useful when the data distribution is multimodal (more than one peak). In this case a violin plot shows the presence of different peaks, their position and relative amplitude.
knitr::include_graphics("img/violinplot.png")
#source:https://mode.com/blog/violin-plot-examples
#violin plot
ggplot(diamonds, aes(x="carat", y=carat))+
geom_violin()
ggplot(diamonds, aes(x="price", y=price))+
geom_violin()
### Distributions You can use a histogram and other descriptive stats to characterize its distribution.
4.2.2.6 Histogram
A histogram is a type of graph in which each column represents a numeric variable, in particular that which is continuous and/or grouped.It shows the distribution of all observations in a quantitative dataset. It is useful for describing the shape, centre and spread to better understand the distribution of the dataset.
hist(rbeta(10000,5,2))
hist(rbeta(10000,2,5))
hist(rbeta(10000,5,5))
How to interprate:
The height of the column shows the frequency for a specific range of values.
Columns are usually of equal width, however a histogram may show data using unequal ranges (intervals) and therefore have columns of unequal width.
The values represented by each column must be mutually exclusive and exhaustive. Therefore, there are no spaces between columns and each observation can only ever belong in one column.
It is important that there is no ambiguity in the labelling of the intervals on the x-axis for continuous or grouped data (e.g. 0 to less than 10, 10 to less than 20, 20 to less than 30).
# histogram
ggplot(diamonds)+
geom_histogram(aes(carat), binwidth = 0.1)
# histogram zoom in y axsis
ggplot(diamonds)+
geom_histogram(aes(carat), binwidth = 0.1)+
coord_cartesian(ylim = c(0,50))
# histogram zoom in x axsis
ggplot(diamonds[ which(diamonds$carat < 3), ])+
geom_histogram(aes(carat), binwidth = 0.1)
diamonds%>%
filter(carat <3)%>%
ggplot()+
geom_histogram(aes(carat), binwidth = 0.1)
# density plot
ggplot(diamonds,aes(carat))+
geom_density(kernel = "gaussian")
# area
ggplot(diamonds,aes(carat))+
geom_area(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# dotplot
ggplot(diamonds,aes(carat))+
geom_dotplot()
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
# freqpoly
ggplot(diamonds, aes(carat))+
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
4.2.2.7 Symmetrical distributions:
When a distribution is symmetrical, the mode, median and mean are all in the middle of the distribution. The following graph shows a larger retirement age dataset with a distribution which is symmetrical. The mode, median and mean all equal 58 years.
4.2.2.8 Skewed distributions:
When a distribution is skewed, the mode remains the most commonly occurring value, the median remains the middle value in the distribution, but the mean is generally ‘pulled’ in the direction of the tails. In a skewed distribution, the median is often a preferred measure of central tendency, as the mean is not usually in the middle of the distribution.
A distribution is said to be positively or right skewed when the tail on the right side of the distribution is longer than the left side. In a positively skewed distribution it is common for the mean to be ‘pulled’ toward the right tail of the distribution. Although there are exceptions to this rule, generally, most of the values, including the median value, tend to be less than the mean value.
The following graph shows a larger retirement age data set with a distribution which is right skewed. The data has been grouped into classes, as the variable being measured (retirement age) is continuous. The mode is 54 years, the modal class is 54-56 years, the median is 56 years and the mean is 57.2 years.A distribution is said to be negatively or left skewed when the tail on the left side of the distribution is longer than the right side. In a negatively skewed distribution, it is common for the mean to be ‘pulled’ toward the left tail of the distribution. Although there are exceptions to this rule, generally, most of the values, including the median value, tend to be greater than the mean value.
The following graph shows a larger retirement age dataset with a distribution which left skewed. The mode is 65 years, the modal class is 63-65 years, the median is 63 years and the mean is 61.8 years.
4.2.2.9 QQ Plots
QQ Plots (Quantile-Quantile plots) are plots of two quantiles against each other. A quantile is a fraction where certain values fall below that quantile. For example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of QQ plots is to find out if two sets of data come from the same distribution.
A 45 degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.(from Wikipedia)
# Solution 1
qplot(sample = carat, data = diamonds)
# solution 2
ggplot(diamonds)+
geom_qq(aes(sample = carat))+
geom_qq_line(aes(sample = carat))
4.2.3 Confidence interval
(Taken from Summary and Analysis of Extension Program Evaluation in R by Salvatore S. Mangiafico)
A confidence interval is a range in which it is estimated the true population value lies. A confidence interval is used to indicate how accurate a calculated statistic is likely to be. Confidence intervals can be calculated for a variety of statistics, such as the mean, median, or slope of a linear regression.
Most of the statistics we use assume we are analyzing a sample which we are using to represent a larger population.
4.2.3.1 Statistics and parameters
When we calculate the sample mean, the result is a statistic. It’s an estimate of the population mean, but our calculated sample mean would vary depending on our sample.
In theory, there is a mean for the population of interest, and we consider this population mean a parameter. Our goal in calculating the sample mean is estimating the population parameter.
4.2.3.2 Point estimates and confidence intervals
Our sample mean is a point estimate for the population parameter. A point estimate is a useful approximation for the parameter, but considering the confidence interval for the estimate gives us more information.
As a definition of confidence intervals, if we were to sample the same population many times and calculated a sample mean and a 95% confidence interval each time, then 95% of those intervals would contain the actual population mean.
if we want to compare the means of two groups to see if they are statistically different, we will use a t-test, or similar test, calculate a p-value, and draw a conclusion. An alternative approach would be to construct 95% or 99% confidence intervals about the mean for each group. If the confidence intervals of the two means don’t overlap, we are justified in calling them statistically different.
4.2.3.3 Methods
The traditional method is appropriate for normally distributed data or with large sample sizes. It produces an interval that is symmetric about the mean.
For skewed data, confidence intervals by bootstrapping may be more reliable.For routine use, I recommend using bootstrapped confidence intervals, particularly the BCa or percentile methods.
The groupwiseMean function in the rcompanion package can produce confidence intervals both by traditional and bootstrap methods, for grouped and ungrouped data.
4.2.3.3.1 The traditional method
The data must be housed in a data frame. By default, the function reports confidence intervals by the traditional method.
In the groupwiseMean function, the measurement and grouping variables can be indicated with formula notation, with the measurement variable on the left side of the tilde (~), and grouping variables on the right.
The confidence level is indicated by, e.g., the conf = 0.95 argument. The digits option indicates the number of significant digits to which the output is rounded. Note that in the output, the means and other statistics are rounded to 3 significant figures.
#install.packages("rcompanion")
library(rcompanion)
dataset3 = data.frame(dataset1)
groupwiseMean(dataset1 ~ 1,
data = dataset3,
conf = 0.95,
digits = 3)
## .id n Mean Conf.level Trad.lower Trad.upper
## 1 <NA> 12 6 0.95 5.28 6.72
groupwiseMean(data ~ type,
data = dataset,
conf = 0.95,
digits = 3)
## type n Mean Conf.level Trad.lower Trad.upper
## 1 dataset1 12 6 0.95 5.28 6.72
## 2 dataset2 12 6 0.95 3.99 8.01
4.2.3.3.2 The Bootstrapped method
In the groupwiseMean function, the type of confidence interval is requested by setting certain options to TRUE. These options are traditional, normal, basic, percentile and bca. The boot option reports an optional statistic, the mean by bootstrap. The R option indicates the number of iterations to calculate each bootstrap statistic.
groupwiseMean(data ~ type,
data = dataset,
conf = 0.95,
digits = 3,
R = 10000,
boot = TRUE,
traditional = FALSE,
normal = FALSE,
basic = FALSE,
percentile = FALSE,
bca = TRUE)
## type n Mean Boot.mean Conf.level Bca.lower Bca.upper
## 1 dataset1 12 6 6 0.95 5.33 6.50
## 2 dataset2 12 6 6 0.95 4.25 7.67
4.2.4 Unusual values
For unusual values in the dataset, you can do one of the two things:
drop the entire row with the strange values
replace the unusual values with missing values
unusual = diamonds %>%
filter(y < 3 | y > 20)%>%
arrange(y)
unusual
## # A tibble: 9 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
## 8 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
## 9 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
# drop the entire row with the strange values
diamonds_new = diamonds%>%
filter(between(y,3,20))
head(arrange(diamonds_new,y))
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31
## 2 0.2 Premium F VS2 62.6 59 367 3.73 3.71 2.33
## 3 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36
## 4 0.2 Premium D VS2 61.7 60 367 3.77 3.72 2.31
## 5 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33
## 6 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
# replace the unusual values with missing values
diamonds_replace = diamonds %>%
mutate(y2 = ifelse(y<3 | y> 20, NA, y))
head(arrange(diamonds_replace, y ))
## # A tibble: 6 x 11
## carat cut color clarity depth table price x y z y2
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 Very Good H VS2 63.3 53 5139 0 0 0 NA
## 2 1.14 Fair G VS1 57.5 67 6381 0 0 0 NA
## 3 1.56 Ideal G VS2 62.2 54 12800 0 0 0 NA
## 4 1.2 Premium D VVS1 62.1 59 15686 0 0 0 NA
## 5 2.25 Premium H SI2 62.8 59 18034 0 0 0 NA
## 6 0.71 Good F SI2 64.1 60 2130 0 0 0 NA