1 Loading data / データの読み込み

Load the Titanic passenger data.

タイタニック号乗客データを読み込む.

titanic <- read.csv("https://raw.githubusercontent.com/kurodaecon/bs/main/data/titanic3_csv.csv")

2 Qualitative data / 質的データ

Use the sex variable as an example.

性別変数 sex を例に利用.

head(titanic$sex)
## [1] "female" "male"   "female" "male"   "female" "male"

2.1 Frequency distribution table / 度数分布表

You can create a frequency distribution table for a single variable using table(vector).

table(ベクトル) で1変数の度数分布表を作成できる.

table(titanic$sex)
## 
## female   male 
##    466    843

2.1.1 Count missing values / 欠損値をカウントする

It’s better to use the table function to check if there are any missing values (NA) in the data.

We can combine it with the is.na function, which returns TRUE for missing values.

この table 関数を利用してデータに欠損値 (NA) があるかを確認するとよい. 欠損値である場合に TRUE を返す関数 is.na を併用する.

is.na(NA)
## [1] TRUE
is.na(123)
## [1] FALSE
is.na(c(1, 10, NA, 1000))
## [1] FALSE FALSE  TRUE FALSE
table(is.na(c(1, 10, NA, 1000)))
## 
## FALSE  TRUE 
##     3     1
table(is.na(titanic$sex))
## 
## FALSE 
##  1309

2.2 Bar chart / 棒グラフ

Create it using barplot(frequency distribution table or frequency vector).

barplot(度数分布表または度数ベクトル) で作成.

barplot(table(titanic$sex))

2.3 Pie chart / 円グラフ

Similar to a bar chart, create it using pie(frequency distribution table or frequency vector).

棒グラフと同様に pie(度数分布表または度数ベクトル) で作成.

pie(table(titanic$sex))

2.4 Contingency table / 分割表

We can also create a contingency table (cross-tabulation) for two categorical variables using the table function.

Specify the two variables as arguments, like table(first vector, second vector).

2つの質的変数の分割表(クロス集計表)も table 関数で作成できる.

table(1つ目のベクトル, 2つ目のベクトル) のように2つの変数を引数として指定する.

table(titanic$sex, titanic$survived)
##         
##            0   1
##   female 127 339
##   male   682 161

3 Quantitative data / 量的データ

Use the age (age of the passengers) variable as an example. Note that there are also missing values.

乗客の年齢を表す変数 age を例に利用. 欠損値もある.

head(titanic$age)
## [1] 29.00  0.92  2.00 30.00 25.00 48.00
table(is.na(titanic$age))
## 
## FALSE  TRUE 
##  1046   263

3.1 Histogram / ヒストグラム

Create a histogram using hist(vector). We can adjust the number of bins using the breaks argument.

hist(ベクトル) で作成. breaks 引数でセルの数を調整できる.

hist(titanic$age)

hist(titanic$age, breaks = 30)  # set the number of cells

To display relative frequency (the proportion of the total) instead of the count, add the argument freq = FALSE.

度数ではなく相対度数(全体に占める割合)で表示する場合は freq = FALSE という引数を追加する.

hist(titanic$age, freq = FALSE)

3.2 Frequency distribution table of quantitative data / 数量データの度数分布表

Create the histogram using the three intervals [0, 20), [20, 60), and [60, 100). To do this, use the cut function to convert the continuous data into these intervals.

  • Specify the vector of breakpoints using the breaks argument.
  • The interval ‘[0, 20)’ means 0 is inclusive, and 20 is exclusive. To ensure the right side of the interval is not closed (only the left side is), add the argument right = FALSE.

[0, 20), [20, 60), [60, 100) の3区間で作成する. そのために,cut 関数を利用して実数データを上の3区間に変換する.

  • breaks 引数で区間の区切り値のベクトルを指定
  • 「[0, 20)」は 0 以上 20 未満.右は閉じない(左側が閉じる)ので right = FALSE 引数を追加
titanic_age_interval <- cut(titanic$age, breaks = c(0, 20, 60, 100), right = FALSE)
# [right = F] means intervals should not be closed on the right
table(titanic_age_interval)
## titanic_age_interval
##   [0,20)  [20,60) [60,100) 
##      225      781       40
table(titanic_age_interval) / sum(table(titanic_age_interval))  # relative frequency
## titanic_age_interval
##     [0,20)    [20,60)   [60,100) 
## 0.21510516 0.74665392 0.03824092

3.3 Sample mean / 標本平均

\[ \bar{x} = \frac{1}{n} \sum_i x_i \]

x <- c(1, 2, 6)
mean(x)
## [1] 3

3.3.1 Example of age variable of Titanic data / タイタニック号乗客データの年齢変数

If there are missing values (NA), the result will also be NA. To exclude NA values and calculate the mean, add the na.rm = TRUE argument.

欠損値 NA があると計算結果も NA になる. その場合は NA を除外して平均を計算するために na.rm = TRUE 引数を追加する.

mean(titanic$age)
## [1] NA
mean(titanic$age, na.rm = TRUE)
## [1] 29.88114

3.4 Quantiles, median / 分位数,中央値

The quantile function, by default, outputs the values corresponding to 0% (minimum), 25% (first quartile), 50% (median), 75% (third quartile), and 100% (maximum).

quantile 関数は,デフォルトでは 0%(最小値), 25%(第1四分位数), 50%(中央値), 75%(第3四分位数), 100%(最大値) を出力.

quantile(c(1, 2, 4, 7, 8, 11, 13, 13, 15, 16, 18))
##   0%  25%  50%  75% 100% 
##  1.0  5.5 11.0 14.0 18.0
quantile(titanic$age, na.rm = TRUE)
##    0%   25%   50%   75%  100% 
##  0.17 21.00 28.00 39.00 80.00
summary(titanic$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.17   21.00   28.00   29.88   39.00   80.00     263

We can also specify percentiles using the probs argument.

probs 引数で percentile を指定できる.

quantile(titanic$age, probs = 0.35, na.rm = TRUE)  # 35th percentile
## 35% 
##  24

3.5 Maximum, Minimum, Range / 最大値,最小値,範囲

max(titanic$age, na.rm = TRUE); min(titanic$age, na.rm = TRUE)
## [1] 80
## [1] 0.17
max(titanic$age, na.rm = TRUE) - min(titanic$age, na.rm = TRUE)  # range
## [1] 79.83

3.6 Variance and standard deviation / 分散と標準偏差

Population variance and standard deviation (\(\mu\) is population mean) / 母分散,母標準偏差 (\(\mu\) は母平均)

\[ \sigma^2 = \frac{1}{N} \sum_i (x_i - \mu)^2, \quad \sigma = \sqrt{\sigma^2} \]

Sample variance and standard deviation / 標本分散,標本標準偏差

\[ s^2 = \frac{1}{n-1} \sum_i (x_i - \bar{x})^2, \quad s = \sqrt{s^2} \]

x <- c(1, 2, 6)
var(x)  # sample variance
## [1] 7
sd(x)  # standard deviation
## [1] 2.645751

3.6.1 Example of age variable of Titanic data / タイタニック号乗客データの年齢変数

var(titanic$age)
## [1] NA
var(titanic$age, na.rm = TRUE)
## [1] 207.7488
sd(titanic$age, na.rm = TRUE)
## [1] 14.41349

3.6.2 Box plot / 箱ひげ図

Create it using boxplot(vector).

boxplot(ベクトル) で作成.

boxplot(titanic$age) 

To create boxplots separated by gender, use boxplot(continuous variable ~ group variable, data = dataframe name).

男女別に箱ひげ図を描く場合は,boxplot(連続変数 ~ グループ変数, data = データフレーム名) で作成できる.

boxplot(age ~ sex, titanic)

3.7 Scatter plot / 散布図

Create it using plot(first vector, second vector).

plot(1つ目のベクトル, 2つ目のベクトル) で作成.

plot(x = titanic$age, y = titanic$fare, xlab = "Age", ylab = "Fare", pch = 20)

3.7.1 Scatterplot Matrix / 散布図行列

Display combinations of three or more continuous variables in a matrix format, also known as a Draftman’s display or Pair plot.

3つ以上の連続変数の散布図の組み合わせを行列形式で表示する. Draftman’s display や Pair plot とも呼ばれる.

pairs(swiss[, c("Fertility", "Examination", "Education")])

3.8 Covariance and correlation / 共分散と相関係数

Population covariance / 母共分散

\[ \sigma_{x, y} = \frac{1}{N} \sum_i (x_i - \mu_{x})(y_i - \mu_{y}) \]

Sample covariance / 標本共分散

\[ Cov(x, y) = \frac{1}{n-1} \sum_i (x_i - \bar{x})(y_i - \bar{y}) \]

x <- c(1, 2, 6)
y <- c(1, 3, 4)
plot(x, y)  # scatter plot

cov(x, y)  # sample covariance
## [1] 3.5

Correlation coefficient / 相関係数

\[ Cor(x, y) = \frac{Cov(x, y)}{\sqrt{s^2_x} \cdot \sqrt{s^2_y}} \]

cor(x, y)
## [1] 0.8660254

3.8.1 Example of age variable of Titanic data / タイタニック号乗客データの年齢変数

If either variable has missing values (NA), add the use = "complete.obs" argument to calculate the correlation coefficient using only observations without missing values.

少なくとも一方の変数に欠損値 NA がある場合は,欠損値がない観測値のみを使って相関係数を計算するために use = "complete.obs" 引数を追加する.

cor(titanic$age, titanic$fare, use = "complete.obs")
## [1] 0.1787399

We can also calculate a correlation matrix for multiple variables.

複数の変数に対して相関係数行列を計算することもできる.

cor(titanic[, c("age", "fare", "parch")], use = "complete.obs")
##              age      fare      parch
## age    1.0000000 0.1787399 -0.1502409
## fare   0.1787399 1.0000000  0.2167232
## parch -0.1502409 0.2167232  1.0000000

4 Take home messages

en:

  • Frequency distribution table: table(vector)
  • Histogram: hist(vector)
  • Mean: mean(vector)
  • Scatter plot: plot(x = X-axis vector, y = Y-axis vector)
  • Correlation coefficient: cor(first vector, second vector)

ja:

  • 度数分布表は table(ベクトル)
  • ヒストグラムは hist(ベクトル)
  • 平均は mean(ベクトル)
  • 散布図は plot(x = X軸のベクトル, y = Y軸のベクトル)
  • 相関係数は cor(1つ目のベクトル, 2つ目のベクトル)