Chapter 5 Covariation

While the variation describes the behaviors within a variable, covariation describes the behavior between variables.

Covariation is the tendency for the values of two or more variables to vary together in a related way.

5.1 Two categorical variables

5.1.1 Contingency table

To characterize covaritation of two categorical variable, we can use contingency table to display the frequency.

table(diamonds$cut, diamonds$color)
##            
##                D    E    F    G    H    I    J
##   Fair       163  224  312  314  303  175  119
##   Good       662  933  909  871  702  522  307
##   Very Good 1513 2400 2164 2299 1824 1204  678
##   Premium   1603 2337 2331 2924 2360 1428  808
##   Ideal     2834 3903 3826 4884 3115 2093  896
table(diamonds$cut, diamonds$color, diamonds$clarity)
## , ,  = I1
## 
##            
##                D    E    F    G    H    I    J
##   Fair         4    9   35   53   52   34   23
##   Good         8   23   19   19   14    9    4
##   Very Good    5   22   13   16   12    8    8
##   Premium     12   30   34   46   46   24   13
##   Ideal       13   18   42   16   38   17    2
## 
## , ,  = SI2
## 
##            
##                D    E    F    G    H    I    J
##   Fair        56   78   89   80   91   45   27
##   Good       223  202  201  163  158   81   53
##   Very Good  314  445  343  327  343  200  128
##   Premium    421  519  523  492  521  312  161
##   Ideal      356  469  453  486  450  274  110
## 
## , ,  = SI1
## 
##            
##                D    E    F    G    H    I    J
##   Fair        58   65   83   69   75   30   28
##   Good       237  355  273  207  235  165   88
##   Very Good  494  626  559  474  547  358  182
##   Premium    556  614  608  566  655  367  209
##   Ideal      738  766  608  660  763  504  243
## 
## , ,  = VS2
## 
##            
##                D    E    F    G    H    I    J
##   Fair        25   42   53   45   41   32   23
##   Good       104  160  184  192  138  110   90
##   Very Good  309  503  466  479  376  274  184
##   Premium    339  629  619  721  532  315  202
##   Ideal      920 1136  879  910  556  438  232
## 
## , ,  = VS1
## 
##            
##                D    E    F    G    H    I    J
##   Fair         5   14   33   45   32   25   16
##   Good        43   89  132  152   77  103   52
##   Very Good  175  293  293  432  257  205  120
##   Premium    131  292  290  566  336  221  153
##   Ideal      351  593  616  953  467  408  201
## 
## , ,  = VVS2
## 
##            
##                D    E    F    G    H    I    J
##   Fair         9   13   10   17   11    8    1
##   Good        25   52   50   75   45   26   13
##   Very Good  141  298  249  302  145   71   29
##   Premium     94  121  146  275  118   82   34
##   Ideal      284  507  520  774  289  178   54
## 
## , ,  = VVS1
## 
##            
##                D    E    F    G    H    I    J
##   Fair         3    3    5    3    1    1    1
##   Good        13   43   35   41   31   22    1
##   Very Good   52  170  174  190  115   69   19
##   Premium     40  105   80  171  112   84   24
##   Ideal      144  335  440  594  326  179   29
## 
## , ,  = IF
## 
##            
##                D    E    F    G    H    I    J
##   Fair         3    0    4    2    0    0    0
##   Good         9    9   15   22    4    6    6
##   Very Good   23   43   67   79   29   19    8
##   Premium     10   27   31   87   40   23   12
##   Ideal       28   79  268  491  226   95   25
xtabs(~ cut + color, diamonds)
##            color
## cut            D    E    F    G    H    I    J
##   Fair       163  224  312  314  303  175  119
##   Good       662  933  909  871  702  522  307
##   Very Good 1513 2400 2164 2299 1824 1204  678
##   Premium   1603 2337 2331 2924 2360 1428  808
##   Ideal     2834 3903 3826 4884 3115 2093  896
xtabs(~ cut + color + clarity, diamonds)
## , , clarity = I1
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair         4    9   35   53   52   34   23
##   Good         8   23   19   19   14    9    4
##   Very Good    5   22   13   16   12    8    8
##   Premium     12   30   34   46   46   24   13
##   Ideal       13   18   42   16   38   17    2
## 
## , , clarity = SI2
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair        56   78   89   80   91   45   27
##   Good       223  202  201  163  158   81   53
##   Very Good  314  445  343  327  343  200  128
##   Premium    421  519  523  492  521  312  161
##   Ideal      356  469  453  486  450  274  110
## 
## , , clarity = SI1
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair        58   65   83   69   75   30   28
##   Good       237  355  273  207  235  165   88
##   Very Good  494  626  559  474  547  358  182
##   Premium    556  614  608  566  655  367  209
##   Ideal      738  766  608  660  763  504  243
## 
## , , clarity = VS2
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair        25   42   53   45   41   32   23
##   Good       104  160  184  192  138  110   90
##   Very Good  309  503  466  479  376  274  184
##   Premium    339  629  619  721  532  315  202
##   Ideal      920 1136  879  910  556  438  232
## 
## , , clarity = VS1
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair         5   14   33   45   32   25   16
##   Good        43   89  132  152   77  103   52
##   Very Good  175  293  293  432  257  205  120
##   Premium    131  292  290  566  336  221  153
##   Ideal      351  593  616  953  467  408  201
## 
## , , clarity = VVS2
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair         9   13   10   17   11    8    1
##   Good        25   52   50   75   45   26   13
##   Very Good  141  298  249  302  145   71   29
##   Premium     94  121  146  275  118   82   34
##   Ideal      284  507  520  774  289  178   54
## 
## , , clarity = VVS1
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair         3    3    5    3    1    1    1
##   Good        13   43   35   41   31   22    1
##   Very Good   52  170  174  190  115   69   19
##   Premium     40  105   80  171  112   84   24
##   Ideal      144  335  440  594  326  179   29
## 
## , , clarity = IF
## 
##            color
## cut            D    E    F    G    H    I    J
##   Fair         3    0    4    2    0    0    0
##   Good         9    9   15   22    4    6    6
##   Very Good   23   43   67   79   29   19    8
##   Premium     10   27   31   87   40   23   12
##   Ideal       28   79  268  491  226   95   25
ftable(xtabs(~ cut + color + clarity, diamonds))
##                 clarity   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF
## cut       color                                                
## Fair      D                4   56   58   25    5    9    3    3
##           E                9   78   65   42   14   13    3    0
##           F               35   89   83   53   33   10    5    4
##           G               53   80   69   45   45   17    3    2
##           H               52   91   75   41   32   11    1    0
##           I               34   45   30   32   25    8    1    0
##           J               23   27   28   23   16    1    1    0
## Good      D                8  223  237  104   43   25   13    9
##           E               23  202  355  160   89   52   43    9
##           F               19  201  273  184  132   50   35   15
##           G               19  163  207  192  152   75   41   22
##           H               14  158  235  138   77   45   31    4
##           I                9   81  165  110  103   26   22    6
##           J                4   53   88   90   52   13    1    6
## Very Good D                5  314  494  309  175  141   52   23
##           E               22  445  626  503  293  298  170   43
##           F               13  343  559  466  293  249  174   67
##           G               16  327  474  479  432  302  190   79
##           H               12  343  547  376  257  145  115   29
##           I                8  200  358  274  205   71   69   19
##           J                8  128  182  184  120   29   19    8
## Premium   D               12  421  556  339  131   94   40   10
##           E               30  519  614  629  292  121  105   27
##           F               34  523  608  619  290  146   80   31
##           G               46  492  566  721  566  275  171   87
##           H               46  521  655  532  336  118  112   40
##           I               24  312  367  315  221   82   84   23
##           J               13  161  209  202  153   34   24   12
## Ideal     D               13  356  738  920  351  284  144   28
##           E               18  469  766 1136  593  507  335   79
##           F               42  453  608  879  616  520  440  268
##           G               16  486  660  910  953  774  594  491
##           H               38  450  763  556  467  289  326  226
##           I               17  274  504  438  408  178  179   95
##           J                2  110  243  232  201   54   29   25
diamonds %>%
  group_by(cut, color)%>%
  count( )%>%
  group_by(cut)%>%
  mutate(sum = sum(n))%>%
  mutate(proportion = n/sum,
         percentage = (n/sum)*100)
## # A tibble: 35 x 6
## # Groups:   cut [5]
##    cut   color     n   sum proportion percentage
##    <ord> <ord> <int> <int>      <dbl>      <dbl>
##  1 Fair  D       163  1610     0.101       10.1 
##  2 Fair  E       224  1610     0.139       13.9 
##  3 Fair  F       312  1610     0.194       19.4 
##  4 Fair  G       314  1610     0.195       19.5 
##  5 Fair  H       303  1610     0.188       18.8 
##  6 Fair  I       175  1610     0.109       10.9 
##  7 Fair  J       119  1610     0.0739       7.39
##  8 Good  D       662  4906     0.135       13.5 
##  9 Good  E       933  4906     0.190       19.0 
## 10 Good  F       909  4906     0.185       18.5 
## # ... with 25 more rows

5.1.2 Tile plot

You can visualize the frequency table with a tile plot.

diamonds%>% 
  count(color, cut)%>%
  ggplot(aes(color, cut))+
  geom_tile(aes(fill=n))

ggplot(diamonds)+
  geom_count(aes(cut, color))

#install.packages("seriation")

5.2 Categorical + continuous variable

The covariation of a categorical and a continuous variable can be visualized and explored by treating the categorical variable as the group factor. Then we can apply all the methods we learned when dealing with a continuous variable.

5.2.1 Summary table

R offers a number of ways we can summarize the mean, sd of a continous variable as a function of one or more grouping variables.

# solution 1
with(diamonds, tapply(price, cut, mean))
##      Fair      Good Very Good   Premium     Ideal 
##  4358.758  3928.864  3981.760  4584.258  3457.542
with(diamonds, tapply(price, list(cut, color, clarity), mean))
## , , I1
## 
##                  D        E        F        G        H        I        J
## Fair      7383.000 2095.222 2543.514 3187.472 4212.962 3501.000 5795.043
## Good      3490.750 4398.130 2569.526 3195.789 3849.714 4175.444 3794.500
## Very Good 2622.800 3443.545 4252.923 3194.812 5258.833 6045.125 4478.375
## Premium   3818.750 3199.267 3554.559 4051.522 3904.348 5044.625 4577.231
## Ideal     3526.923 3559.389 3903.452 4044.438 5415.184 4103.294 9454.000
## 
## , , SI2
## 
##                  D        E        F        G        H        I        J
## Fair      4355.143 4172.385 4520.112 5665.150 6022.407 6658.022 5131.815
## Good      3595.296 3785.490 4426.786 4776.411 5529.778 6933.012 5306.113
## Very Good 4425.459 4279.447 4249.758 4699.269 6112.414 6621.600 5992.898
## Premium   4351.086 4489.931 4747.090 5617.205 6718.946 7148.484 7550.286
## Ideal     3142.048 3891.303 4335.508 4612.086 5589.473 7191.912 6555.173
## 
## , , SI1
## 
##                  D        E        F        G        H        I        J
## Fair      4273.345 3901.154 3784.687 3579.362 5195.800 4574.967 4553.929
## Good      3021.173 3162.132 3261.454 4129.329 4179.285 4742.945 4627.625
## Very Good 3234.931 3228.176 3574.292 3481.871 4933.945 5195.302 5026.544
## Premium   3236.378 3362.625 4040.467 4303.348 5707.722 6092.093 5726.579
## Ideal     2490.459 2883.808 3710.322 3441.108 4769.988 5178.565 5115.675
## 
## , , VS2
## 
##                  D        E        F        G        H        I        J
## Fair      4512.880 3041.714 3400.472 5384.444 5110.927 3856.125 4067.826
## Good      3588.462 3772.019 3790.543 4140.714 4433.043 5956.564 4803.167
## Very Good 3145.194 3329.497 3995.944 4426.816 4620.221 5754.642 5325.549
## Premium   2919.357 3070.394 4221.467 4556.255 5553.876 7156.346 6175.559
## Ideal     2111.927 2163.324 3317.205 4310.035 4039.126 4663.384 4867.134
## 
## , , VS1
## 
##                  D        E        F        G        H        I        J
## Fair      2921.200 3307.929 4103.061 3497.622 4604.750 4500.480 5906.188
## Good      3556.581 3712.775 2787.508 4302.428 3819.117 4597.165 3662.827
## Very Good 2955.480 3089.358 3880.802 3770.150 3750.198 5276.971 4339.592
## Premium   4178.046 3721.695 4758.038 4435.823 3949.336 5339.367 5817.261
## Ideal     2576.040 2175.798 3504.002 4116.918 3613.325 3944.422 4734.428
## 
## , , VVS2
## 
##                  D        E        F        G        H        I        J
## Fair      3607.000 3119.308 4018.200 3099.059 3481.727 2994.625 2998.000
## Good      2345.640 3390.154 3192.360 3310.467 2428.000 2758.000 4371.154
## Very Good 2615.298 2041.685 3461.912 3711.785 2768.145 3059.887 5960.448
## Premium   3888.436 2940.942 4099.466 4323.571 2651.263 3190.768 6423.353
## Ideal     3619.014 2556.335 3323.629 3795.651 2591.156 2858.680 4121.926
## 
## , , VVS1
## 
##                  D        E        F        G        H        I        J
## Fair      4473.000 4115.333 4679.800 2216.333 4115.000 4194.000 1691.000
## Good      2586.231 1905.953 2189.514 2705.195 1719.710 2650.955 4633.000
## Very Good 2987.731 1997.447 2826.540 2719.332 2042.191 2056.420 3175.526
## Premium   3771.000 2699.857 3969.325 2933.655 1453.759 1831.083 7244.375
## Ideal     2705.778 2205.519 2611.234 2909.199 1915.985 2034.397 2000.172
## 
## , , IF
## 
##                   D        E        F        G        H        I        J
## Fair       1619.667       NA 2344.000 1488.000       NA       NA       NA
## Good      10030.333 1519.222 3132.867 4060.136 5948.750 1749.333 2738.000
## Very Good 10298.261 4332.744 4677.075 3525.241 2647.690 4093.895 1074.125
## Premium    9056.500 4525.444 3617.581 3311.115 3384.750 2358.565 7026.000
## Ideal      6567.179 3258.937 2153.709 2206.031 1982.765 1502.621 2489.000
# solution 2
# install.packages("doBy")
library(doBy)
data = as.data.frame(diamonds)
head(summaryBy(price ~ cut + clarity + color , data = data, FUN = mean))
##    cut clarity color price.mean
## 1 Fair      I1     D   7383.000
## 2 Fair      I1     E   2095.222
## 3 Fair      I1     F   2543.514
## 4 Fair      I1     G   3187.472
## 5 Fair      I1     H   4212.962
## 6 Fair      I1     I   3501.000
head(summaryBy(price  + carat ~ cut + clarity + color , data = data, FUN = mean))
##    cut clarity color price.mean carat.mean
## 1 Fair      I1     D   7383.000  1.8775000
## 2 Fair      I1     E   2095.222  0.9688889
## 3 Fair      I1     F   2543.514  1.0234286
## 4 Fair      I1     G   3187.472  1.2264151
## 5 Fair      I1     H   4212.962  1.4986538
## 6 Fair      I1     I   3501.000  1.3229412
head(summaryBy(price  + carat ~ cut +  color , data = data, FUN = c(mean, sd)))
##    cut color price.mean carat.mean price.sd  carat.sd
## 1 Fair     D   4291.061  0.9201227 3286.114 0.4054185
## 2 Fair     E   3682.312  0.8566071 2976.652 0.3645848
## 3 Fair     F   3827.003  0.9047115 3223.303 0.4188899
## 4 Fair     G   4239.255  1.0238217 3609.644 0.4927241
## 5 Fair     H   5135.683  1.2191749 3886.482 0.5482389
## 6 Fair     I   4685.446  1.1980571 3730.271 0.5219776
# solution 3
diamonds%>%
  group_by(cut, clarity, color)%>%
  summarise(mean = mean(price),
            sd = sd(price))
## # A tibble: 276 x 5
## # Groups:   cut, clarity [40]
##    cut   clarity color  mean    sd
##    <ord> <ord>   <ord> <dbl> <dbl>
##  1 Fair  I1      D     7383  5899.
##  2 Fair  I1      E     2095.  824.
##  3 Fair  I1      F     2544. 2227.
##  4 Fair  I1      G     3187. 2598.
##  5 Fair  I1      H     4213. 3149.
##  6 Fair  I1      I     3501  2157.
##  7 Fair  I1      J     5795. 4594.
##  8 Fair  SI2     D     4355. 3260.
##  9 Fair  SI2     E     4172. 3055.
## 10 Fair  SI2     F     4520. 3627.
## # ... with 266 more rows

5.2.2 Central tendency (mean): Bar plots

#bar plot
diamonds%>%
  group_by(cut)%>%
  summarise(mean = mean(price))%>%
  ggplot(aes(cut, mean))+
  geom_bar(stat="identity")

diamonds%>%
  group_by(cut)%>%
  summarise(mean = mean(price))%>%
  ggplot(aes(cut, mean, fill = cut))+
  geom_bar(stat="identity")

# xlim(4, 10) + ylim(4, 10)

diamonds%>%
  group_by(cut)%>%
  summarise(mean = mean(price),
            sd = sd(price))%>%
  ggplot(aes(cut, mean))+
  geom_bar(stat="identity")+
  geom_errorbar(aes(ymin = mean - sd,
                    ymax = mean + sd),
                width = .2, size = 0.7, position = position_dodge(.9))

### Spread: boxplot

# boxplot

ggplot(diamonds, aes(cut, price))+
  geom_boxplot()

5.2.3 Distribution: density plot

We can color-code the density plot to represent the group factor.

library(tidyverse)

ggplot(diamonds, aes(price))+
  geom_freqpoly(binwidth = 500)

ggplot(diamonds, aes(price))+
  geom_freqpoly(aes(color = cut), binwidth = 500)

# standardized count where the area under each frequency polygon is one

ggplot(diamonds, aes(x = price, y =  ..density..))+
  geom_freqpoly(aes(color = cut), binwidth = 500)

model = model name

displ = engine displacement, in litres

year = year of manufacture

cyl = number of cylinders

trans = type of transmission

drv => f = front-wheel drive, r = rear wheel drive, 4 = 4wd

cty = city miles per gallon

hwy = highway miles per gallon

fl = fuel type

class = “type” of car

summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
ggplot(mpg)+
  geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))

ggplot(mpg)+
  geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))+
  coord_flip()

5.3 Two continuous variables

5.3.1 Scatter plots

The most common way we visualize two continuous variables is by using a scatter plot.

ggplot(diamonds)+
  geom_point(aes(carat, price))

# add transparency
ggplot(diamonds)+
  geom_point(aes(carat, price), alpha = 1/100)

# bin two variables
ggplot(diamonds)+
  geom_bin2d(aes(carat, price))

#install.packages("hexbin")
ggplot(diamonds)+
  geom_hex(aes(carat, price))

#bin one variable
ggplot(diamonds,aes(carat, price))+
  geom_boxplot(aes(group = cut_width(carat, 0.1)))

ggplot(diamonds,aes(carat, price))+
  geom_boxplot(aes(group = cut_width(carat, 0.5)))

### Bin one or both continuous variables

Sometime we can bin one or both continuous variables to convert them into categorical variable(s). In those cases, we apply what we learn in dealing with categorical variables, such as tile plots or boxplots.

#bin one variable
ggplot(diamonds,aes(carat, price))+
  geom_boxplot(aes(group = cut_width(carat, 0.1)))

ggplot(diamonds,aes(carat, price))+
  geom_boxplot(aes(group = cut_width(carat, 0.5)))

# bin two variables
ggplot(diamonds)+
  geom_bin2d(aes(carat, price))

#install.packages("hexbin")
ggplot(diamonds)+
  geom_hex(aes(carat, price))