1 Loading data / データの読み込み

タイタニック号乗客データを読み込む．

titanic <- read.csv("https://raw.githubusercontent.com/kurodaecon/bs/main/data/titanic3_csv.csv")

2 Qualitative data / 質的データ

性別変数 sex を例に利用．

head(titanic$sex)

## [1] "female" "male"   "female" "male"   "female" "male"

2.1 Frequency distribution table / 度数分布表

table(ベクトル) で1変数の度数分布表を作成できる．

table(titanic$sex)

## 
## female   male 
##    466    843

female と male それぞれを数える場合．

head(titanic$sex == "female")  # logical ... TRUE or FALSE

## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

sum(titanic$sex == "female")  # the number of TRUE = the number of female

## [1] 466

sum(titanic$sex == "male")  # the number of male

## [1] 843

2.1.1 Count missing values / 欠損値をカウントする

この table 関数を利用してデータに欠損値 (NA) があるかを確認するとよい．欠損値である場合に TRUE を返す関数 is.na を併用する．

is.na(NA)

## [1] TRUE

is.na(c(1, 10, NA, 1000))

## [1] FALSE FALSE  TRUE FALSE

table(is.na(c(1, 10, NA, 1000)))

## 
## FALSE  TRUE 
##     3     1

table(is.na(titanic$sex))

## 
## FALSE 
##  1309

2.2 Bar chart / 棒グラフ

barplot(度数分布表または度数ベクトル) で作成．

barplot(table(titanic$sex))

2.3 Pie chart / 円グラフ

棒グラフと同様に pie(度数分布表または度数ベクトル) で作成．

pie(table(titanic$sex))

2.3.1 Add percentage to labels

ラベルに「%」を追加する．

titanic_sex_table <- table(titanic$sex)
titanic_sex_label <- paste(names(titanic_sex_table), " ", 
                     round(100 * titanic_sex_table / sum(titanic_sex_table), 1), "%", sep = "")
pie(titanic_sex_table, labels = titanic_sex_label)

このコードはどのように動いているのか？

paste("female", " ", 35.6, "%", sep = "")

## [1] "female 35.6%"

titanic_sex_table / sum(titanic_sex_table)

## 
##    female      male 
## 0.3559969 0.6440031

round(100 * titanic_sex_table / sum(titanic_sex_table), 1)

## 
## female   male 
##   35.6   64.4

注：ちょっとしたグラフを作成する程度ならExcelの方が便利という場合もよくある．

2.4 Contingency table / 分割表

2つの質的変数の分割表（クロス集計表）も table 関数で作成できる．

table(1つ目のベクトル, 2つ目のベクトル) のように2つの変数を引数として指定する．

table(titanic$sex, titanic$survived)

##         
##            0   1
##   female 127 339
##   male   682 161

table(titanic[, c("sex", "survived")])  # same as above

##         survived
## sex        0   1
##   female 127 339
##   male   682 161

セルを1つずつカウントすることもできる．

sum(titanic$sex == "female" & titanic$survived == 0)

## [1] 127

sum(titanic$sex == "female" & titanic$survived == 1)

## [1] 339

sum(titanic$sex == "male" & titanic$survived == 0)

## [1] 682

sum(titanic$sex == "male" & titanic$survived == 1)

## [1] 161

2.5 Mode / 最頻値

出港地 embarked を例として．

C＝Cherbourg（仏・シェルブール）, Q＝Queenstown（アイルランド）, S＝Southampton（英）

table(titanic$embarked)

## 
##       C   Q   S 
##   2 270 123 914

table(titanic$embarked == "")  # 中身のない空の観測値がある

## 
## FALSE  TRUE 
##  1307     2

names(which.max(table(titanic$embarked)))

## [1] "S"

3 Quantitative data / 量的データ

年齢変数 age を例に利用．欠損値もある．

head(titanic$age)

## [1] 29.00  0.92  2.00 30.00 25.00 48.00

table(is.na(titanic$age))

## 
## FALSE  TRUE 
##  1046   263

3.1 Histogram / ヒストグラム

hist(ベクトル) で作成． breaks 引数でセルの数を調整できる．

hist(titanic$age)

hist(titanic$age, breaks = 30)  # set the number of cells

度数ではなく相対度数（全体に占める割合）で表示する場合は freq = FALSE という引数を追加する．

（おまけ）滑らかな曲線をフィットさせるには density 関数でカーネル密度推定を行って lines 関数で重ね描きする．

hist(titanic$age, freq = FALSE)
lines(density(titanic$age, na.rm = TRUE))

3.2 Frequency distribution table of quantitative data / 数量データの度数分布表

[0, 20), [20, 60), [60, 100) の3区間で作成する．そのために，cut 関数を利用して実数データを上の3区間に変換する．

breaks 引数で区間の区切り値のベクトルを指定
「[0, 20)」は 0 以上 20 未満．右は閉じない（左側が閉じる）ので right = FALSE 引数を追加

titanic_age_interval <- cut(titanic$age, breaks = c(0, 20, 40, 60, 100), right = FALSE)
# [right = F] means intervals should not be closed on the right
table(titanic_age_interval)

## titanic_age_interval
##   [0,20)  [20,40)  [40,60) [60,100) 
##      225      576      205       40

table(titanic_age_interval) / sum(table(titanic_age_interval))  # relative frequency

## titanic_age_interval
##     [0,20)    [20,40)    [40,60)   [60,100) 
## 0.21510516 0.55066922 0.19598470 0.03824092

同様に，分割表は以下のように作成できる．

table(titanic_age_interval, titanic$survived)

##                     
## titanic_age_interval   0   1
##             [0,20)   119 106
##             [20,40)  351 225
##             [40,60)  121  84
##             [60,100)  28  12

3.3 Mean / 平均

標本平均

\[ \bar{x} = \frac{1}{n} \sum_i x_i \]

x <- c(1, 2, 6)
1 + 2 + 6  # sum (1+2+6=9)

## [1] 9

sum(x)  # sum

## [1] 9

sum(x) / 3  # definition of mean

## [1] 3

mean(x)

## [1] 3

mean(c(1, 2, 6))  # same as above

## [1] 3

3.3.1 Example of age variable of Titanic data / タイタニック号乗客データの年齢変数

欠損値 NA があると計算結果も NA になる．その場合は NA を除外して平均を計算するために na.rm = TRUE 引数を追加する．

mean(titanic$age)

## [1] NA

mean(titanic$age, na.rm = TRUE)

## [1] 29.88114

3.4 Quantiles, median / 分位数，中央値

quantile 関数は，デフォルトでは 0%（最小値）, 25%（第1四分位数）, 50%（中央値）, 75%（第3四分位数）, 100%（最大値）を出力．

quantile(c(1, 2, 4, 7, 8, 11, 13, 13, 15, 16, 18))  # returns quartiles as default 四分位数

##   0%  25%  50%  75% 100% 
##  1.0  5.5 11.0 14.0 18.0

quantile(titanic$age, na.rm = TRUE)

##    0%   25%   50%   75%  100% 
##  0.17 21.00 28.00 39.00 80.00

summary(titanic$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.17   21.00   28.00   29.88   39.00   80.00     263

summary関数も四分位点を報告するが，quantile関数を使った場合と少し値が違う場合もある．

probs 引数で percentile を指定できる．

seq(0.1, 0.9, by = 0.1)  # 0.1, 0.2, ..., 0.9

## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

quantile(titanic$age, probs = seq(0.1, 0.9, by = 0.1), na.rm = TRUE)  # deciles 十分位数

## 10% 20% 30% 40% 50% 60% 70% 80% 90% 
##  14  19  22  25  28  31  36  42  50

quantile(titanic$age, probs = 0.35, na.rm = TRUE)  # 35th percentile

## 35% 
##  24

median(titanic$age, na.rm = TRUE)  # median 中央値

## [1] 28

3.5 Maximum, Minimum, Range / 最大値，最小値，範囲

max(titanic$age, na.rm = TRUE); min(titanic$age, na.rm = TRUE)

## [1] 80

## [1] 0.17

max(titanic$age, na.rm = TRUE) - min(titanic$age, na.rm = TRUE)  # range

## [1] 79.83

3.6 Variance and standard deviation / 分散と標準偏差

母分散，母標準偏差（\(\mu\) は母平均）

\[ \sigma^2 = \frac{1}{N} \sum_i (x_i - \mu)^2, \quad \sigma = \sqrt{\sigma^2} \]

標本分散，標本標準偏差

\[ s^2 = \frac{1}{n-1} \sum_i (x_i - \bar{x})^2, \quad s = \sqrt{s^2} \]

x <- c(1, 2, 6)
x - 3

## [1] -2 -1  3

x - mean(x)

## [1] -2 -1  3

(x - mean(x))^2

## [1] 4 1 9

mean((x - mean(x))^2)  # definition of population variance

## [1] 4.666667

sum((x - mean(x))^2) / 3  # same as above

## [1] 4.666667

sum((x - mean(x))^2) / (3-1)  # definition of sample variance

## [1] 7

var(x)  # variance defined as sample variance

## [1] 7

sqrt(var(x))  # definition of standard deviation

## [1] 2.645751

sd(x)

## [1] 2.645751

3.6.1 Example of age variable of Titanic data / タイタニック号乗客データの年齢変数

欠損値 NA があると計算結果も NA になる．その場合は NA を除外して平均を計算するために na.rm = TRUE 引数を追加する．

var(titanic$age)

## [1] NA

var(titanic$age, na.rm = TRUE)

## [1] 207.7488

sd(titanic$age, na.rm = TRUE)

## [1] 14.41349

3.6.2 Why is the denominator of the sample variance \(n-1\)?

なぜ標本分散の分母は \(n-1\) なのか？

平均1，分散1の乱数を生成する rnorm 関数を使ってシミュレーションを行って確認してみる．

var(rnorm(n = 5, mean = 0, sd = 1))

## [1] 0.740988

シミュレーション

rnorm(n = サンプルサイズ, mean = 平均, sd = 標準偏差) 関数で正規分布と呼ばれる確率分布に従う乱数を生成
- 標準偏差を sd = 1 と設定し，分散の平均値が 1 に近いかどうかを確認する
- シミュレーション結果の再現性のために set.seed 関数を使って乱数を固定する
for 文を利用して「乱数の生成，分散を計算，記録」を1000回繰り返す
分散は母分散と標本分散の2種類を計算し，rnorm_pop_var, rnorm_sample_var それぞれのベクトルに記録

rnorm_pop_var <- rnorm_sample_var <- NULL
for (i in 1:1000) {
  # Step 1. 乱数の生成
  set.seed(i)  # 再現性のため
  rnorm_i <- rnorm(n = 5, mean = 0, sd = 1)
  # Step 2. 分散を計算
  pop_var_i <- sum((rnorm_i - mean(rnorm_i))^2) / length(rnorm_i)
  sample_var_i <- sum((rnorm_i - mean(rnorm_i))^2) / (length(rnorm_i)-1)
  # Step 3. 記録
  rnorm_pop_var <- c(rnorm_pop_var, pop_var_i)
  rnorm_sample_var <- c(rnorm_sample_var, sample_var_i)
}
summary(rnorm_pop_var)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02162 0.40059 0.71008 0.82542 1.11389 3.88203

summary(rnorm_sample_var)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02703 0.50074 0.88761 1.03178 1.39237 4.85254

中級者向け：以下のように sapply 関数を使うことで for 文を使わずに計算できる．

sapply(X = ベクトルやリスト, FUN = 関数)：X の各要素に対して関数を適用し，実行結果をベクトルで返す
シンプルな演算を大量のデータに対して行う場合，for 文よりも計算速度が速い可能性が高い

summary(sapply(X = 1:1000, FUN = function (x) {
  rnorm_i <- rnorm(n = 5, mean = 0, sd = 1)
  sum((rnorm_i - mean(rnorm_i))^2) / length(rnorm_i)  # population variance 
}))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02154 0.37973 0.67805 0.78518 1.06479 4.61459

3.6.3 Box plot / 箱ひげ図

boxplot(ベクトル) で作成．

boxplot(titanic$age)

男女別に箱ひげ図を描く場合は，boxplot(女性の年齢ベクトル, 男性の年齢ベクトル) のように指定する．

titanic_age_female <- titanic$age[titanic$sex == "female"]  # age of female 
titanic_age_male <- titanic$age[titanic$sex == "male"]  # age of male 
boxplot(titanic_age_female, titanic_age_male, names = c("Female", "Male"), ylab = "Age")

boxplot(連続変数 ~ グループ変数, data = データフレーム名) でもグループで分けた箱ひげ図を作成できる．

boxplot(age ~ sex, titanic)

3.7 Scatter plot / 散布図

plot(1つ目のベクトル, 2つ目のベクトル) で作成．

plot(x = titanic$age, y = titanic$fare, xlab = "Age", ylab = "Fare", pch = 20)

3.7.1 Scale does matter

同じデータだが縦軸（Y軸）のスケールが異なる例．

tempx <- 1:10
set.seed(1)
tempy <- tempx + rnorm(10, sd = 4)
par(mfrow = c(1, 2))
plot(tempx, tempy, xlab = "x", ylab = "y", pch = 20)
plot(tempx, tempy, xlab = "x", ylab = "y", pch = 20, ylim = c(-50, 50))

左の図は正の相関を示しているように見えるが，右の図は弱い正の相関（または無相関）に見えるだろう．

3.7.2 Non-linear relationship

Y軸のみ対数変換する場合は log = "y"，X軸とY軸の両方を対数変換する場合は log = "xy" を引数に追加する．

plot(x = titanic$age, y = titanic$fare, xlab = "Age", ylab = "Fare", pch = 20, log = "y")

table(titanic$fare <= 0)  # omitted from the plot

## 
## FALSE  TRUE 
##  1291    17

3.7.3 Data points overlap: 透過度を指定してプロット

col = "red" のように引数を追加することでマーカーの色を指定できる．

col = rgb(red, green, blue) のように引数を追加すれば，RGB形式で色を指定できる．

col = rgb(red, green, blue, alpha) のように引数を追加すれば，RGB形式で色を指定したうえに，透過水準を alpha 値（0以上1以下）として指定できる．

plot(x = titanic$age, y = titanic$fare, xlab = "Age", ylab = "Fare", pch = 20, col = rgb(0, 0, 0, .3))

3.7.4 Scatter plot with jitter: データポイントをずらしてプロット

Without jitter

plot(x = titanic$pclass, y = titanic$parch, xlab = "Ticket class", 
     ylab = "# of parents and children", pch = 20)

With jitter

plot(x = jitter(titanic$pclass), y = jitter(titanic$parch), xlab = "Ticket class", 
     ylab = "# of parents and children", pch = 20, cex = .5)

3.7.5 Binned scatter plot: Bin ごとの平均値をプロット

Raw plot

plot(x = titanic$age, y = titanic$survived, xlab = "Age", ylab = "Survived", pch = 20)

Binned plot

titanic_with_age <- titanic[!is.na(titanic$age), ]  # remove obs. with no age 
titanic_age_6 <- cut(titanic_with_age$age, breaks = 6)  # convert age into 6 intervals
levels(titanic_age_6)

## [1] "(0.0902,13.5]" "(13.5,26.8]"   "(26.8,40.1]"   "(40.1,53.4]"  
## [5] "(53.4,66.7]"   "(66.7,80.1]"

titanic_mean_age <- c((0.0902+13.5)/2, (13.5+26.8)/2, (26.8+40.1)/2, 
                      (40.1+53.4)/2, (53.4+66.7)/2, (66.7+80.1)/2)
titanic_mean_survived <- c(
 mean(titanic_with_age$survived[titanic_age_6 == levels(titanic_age_6)[1]]),
 mean(titanic_with_age$survived[titanic_age_6 == levels(titanic_age_6)[2]]),
 mean(titanic_with_age$survived[titanic_age_6 == levels(titanic_age_6)[3]]),
 mean(titanic_with_age$survived[titanic_age_6 == levels(titanic_age_6)[4]]),
 mean(titanic_with_age$survived[titanic_age_6 == levels(titanic_age_6)[5]]),
 mean(titanic_with_age$survived[titanic_age_6 == levels(titanic_age_6)[6]]) )
plot(x = titanic_mean_age, y = titanic_mean_survived, 
     xlab = "Age (binned)", ylab = "Survival rate", pch = 20)

注：Binned plot を計算するための便利なパッケージが幾つかあるため，実際に描画する際にはそれを利用するとよい．

3.7.6 Scatterplot Matrix / 散布図行列

3つ以上の連続変数の散布図の組み合わせを行列形式で表示する． Draftman’s display や Pair plot とも呼ばれる．

pairs(swiss[, c("Fertility", "Examination", "Education")])

3.8 Covariance and correlation / 共分散と相関係数

母共分散

\[ \sigma_{x, y} = \frac{1}{N} \sum_i (x_i - \mu_{x})(y_i - \mu_{y}) \]

標本共分散

\[ Cov(x, y) = \frac{1}{n-1} \sum_i (x_i - \bar{x})(y_i - \bar{y}) \]

x <- c(1, 2, 6)
y <- c(1, 3, 4)
plot(x, y)  # scatter plot

(x - mean(x)) * (y - mean(y))

## [1]  3.3333333 -0.3333333  4.0000000

sum((x - mean(x)) * (y - mean(y))) / 3  # definition of population covariance

## [1] 2.333333

sum((x - mean(x)) * (y - mean(y))) / (3-1)  # definition of sample covariance

## [1] 3.5

cov(x, y)  # sample covariance

## [1] 3.5

相関係数

\[ Cor(x, y) = \frac{Cov(x, y)}{\sqrt{s^2_x} \cdot \sqrt{s^2_y}} \]

cov(x, y) / (sd(x) * sd(y))  # definition of correlation coef.

## [1] 0.8660254

cor(x, y)  # correlation coefficient

## [1] 0.8660254

相関係数が 1 や -1 になるのは，一方の変数がもう一方の変数の定数倍・定数加算によって得られるような場合．

x3 <- x * 2 + 3
x4 <- x * (-0.5) + 3
par(mfrow = c(1, 2))
plot(x, x3); abline(lm(x3 ~ x), col = "red")  # Add regression line (red line) / 回帰直線を追加
plot(x, x4); abline(lm(x4 ~ x), col = "red")

cor(x, x3)

## [1] 1

cor(x, x4)

## [1] -1

3.8.1 Anscombe’s quartet / アンスコムの数値例

See https://en.wikipedia.org/wiki/Anscombe%27s_quartet

変数の関係が互いに異なるが平均・分散・相関係数が概ね同じ {x, y} という変数の組が4セット含まれる．

str(anscombe)  # display structure of object

## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

summary(anscombe)

##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500

# Standard deviation of x 
c(sd(anscombe$x1), sd(anscombe$x2), sd(anscombe$x3), sd(anscombe$x4))

## [1] 3.316625 3.316625 3.316625 3.316625

# Standard deviation of y 
c(sd(anscombe$y1), sd(anscombe$y2), sd(anscombe$y3), sd(anscombe$y4))

## [1] 2.031568 2.031657 2.030424 2.030579

# Correlation of x and y 
c(cor(anscombe$x1, anscombe$y1), cor(anscombe$x2, anscombe$y2), 
  cor(anscombe$x3, anscombe$y3), cor(anscombe$x4, anscombe$y4))

## [1] 0.8164205 0.8162365 0.8162867 0.8165214

par(mfrow = c(2, 2))
plot(anscombe$x1, anscombe$y1)
plot(anscombe$x2, anscombe$y2)
plot(anscombe$x3, anscombe$y3)
plot(anscombe$x4, anscombe$y4)

この例は，記述統計だけを見るのではなく，生のデータを描画することの重要性を我々に教えてくれる．

See also: Same Stats, Different Graphs https://www.research.autodesk.com/publications/same-stats-different-graphs/

3.8.2 Example of age variable of Titanic data / タイタニック号乗客データの年齢変数

少なくとも一方の変数に欠損値 NA がある場合は，欠損値がない観測値のみを使って相関係数を計算するために use = "complete.obs" 引数を追加する．

cor(titanic$age, titanic$fare, use = "complete.obs")

## [1] 0.1787399

複数の変数に対して相関係数行列を計算することもできる．

cor(titanic[, c("age", "fare", "parch")], use = "complete.obs")

##              age      fare      parch
## age    1.0000000 0.1787399 -0.1502409
## fare   0.1787399 1.0000000  0.2167232
## parch -0.1502409 0.2167232  1.0000000

4 Take home messages

度数分布表は table(ベクトル)
ヒストグラムは hist(ベクトル)
平均は mean(ベクトル)
散布図は plot(x = X軸のベクトル, y = Y軸のベクトル)
相関係数は cor(1つ目のベクトル, 2つ目のベクトル)

Data Analysis Using Statistical Packages: Descriptive Statistics

Sho Kuroda / 黒田翔

2024年3月 (Last update: 2024年3月)