Load the Titanic passenger data.
タイタニック号乗客データを読み込む.
titanic <- read.csv("https://raw.githubusercontent.com/kurodaecon/bs/main/data/titanic3_csv.csv")
Use the sex
variable as an example.
性別変数 sex
を例に利用.
head(titanic$sex)
## [1] "female" "male" "female" "male" "female" "male"
You can create a frequency distribution table for a single variable
using table(vector)
.
table(ベクトル)
で1変数の度数分布表を作成できる.
table(titanic$sex)
##
## female male
## 466 843
It’s better to use the table
function to check if there
are any missing values (NA) in the data.
We can combine it with the is.na
function, which returns
TRUE
for missing values.
この table
関数を利用してデータに欠損値 (NA)
があるかを確認するとよい. 欠損値である場合に TRUE
を返す関数 is.na
を併用する.
is.na(NA)
## [1] TRUE
is.na(123)
## [1] FALSE
is.na(c(1, 10, NA, 1000))
## [1] FALSE FALSE TRUE FALSE
table(is.na(c(1, 10, NA, 1000)))
##
## FALSE TRUE
## 3 1
table(is.na(titanic$sex))
##
## FALSE
## 1309
Create it using
barplot(frequency distribution table or frequency vector)
.
barplot(度数分布表または度数ベクトル)
で作成.
barplot(table(titanic$sex))
Similar to a bar chart, create it using
pie(frequency distribution table or frequency vector)
.
棒グラフと同様に pie(度数分布表または度数ベクトル)
で作成.
pie(table(titanic$sex))
We can also create a contingency table (cross-tabulation) for two
categorical variables using the table
function.
Specify the two variables as arguments, like
table(first vector, second vector)
.
2つの質的変数の分割表(クロス集計表)も table
関数で作成できる.
table(1つ目のベクトル, 2つ目のベクトル)
のように2つの変数を引数として指定する.
table(titanic$sex, titanic$survived)
##
## 0 1
## female 127 339
## male 682 161
Use the age
(age of the passengers) variable as an
example. Note that there are also missing values.
乗客の年齢を表す変数 age
を例に利用. 欠損値もある.
head(titanic$age)
## [1] 29.00 0.92 2.00 30.00 25.00 48.00
table(is.na(titanic$age))
##
## FALSE TRUE
## 1046 263
Create a histogram using hist(vector)
. We can adjust the
number of bins using the breaks
argument.
hist(ベクトル)
で作成. breaks
引数でセルの数を調整できる.
hist(titanic$age)
hist(titanic$age, breaks = 30) # set the number of cells
To display relative frequency (the proportion of the total) instead
of the count, add the argument freq = FALSE
.
度数ではなく相対度数(全体に占める割合)で表示する場合は
freq = FALSE
という引数を追加する.
hist(titanic$age, freq = FALSE)
Create the histogram using the three intervals [0, 20), [20, 60), and
[60, 100). To do this, use the cut
function to convert the
continuous data into these intervals.
breaks
argument.right = FALSE
.[0, 20), [20, 60), [60, 100) の3区間で作成する.
そのために,cut
関数を利用して実数データを上の3区間に変換する.
breaks
引数で区間の区切り値のベクトルを指定right = FALSE
引数を追加titanic_age_interval <- cut(titanic$age, breaks = c(0, 20, 60, 100), right = FALSE)
# [right = F] means intervals should not be closed on the right
table(titanic_age_interval)
## titanic_age_interval
## [0,20) [20,60) [60,100)
## 225 781 40
table(titanic_age_interval) / sum(table(titanic_age_interval)) # relative frequency
## titanic_age_interval
## [0,20) [20,60) [60,100)
## 0.21510516 0.74665392 0.03824092
\[ \bar{x} = \frac{1}{n} \sum_i x_i \]
x <- c(1, 2, 6)
mean(x)
## [1] 3
If there are missing values (NA
), the result will also
be NA
. To exclude NA
values and calculate the
mean, add the na.rm = TRUE
argument.
欠損値 NA
があると計算結果も NA
になる.
その場合は NA
を除外して平均を計算するために
na.rm = TRUE
引数を追加する.
mean(titanic$age)
## [1] NA
mean(titanic$age, na.rm = TRUE)
## [1] 29.88114
The quantile
function, by default, outputs the values
corresponding to 0% (minimum), 25% (first quartile), 50% (median), 75%
(third quartile), and 100% (maximum).
quantile
関数は,デフォルトでは 0%(最小値),
25%(第1四分位数), 50%(中央値), 75%(第3四分位数), 100%(最大値)
を出力.
quantile(c(1, 2, 4, 7, 8, 11, 13, 13, 15, 16, 18))
## 0% 25% 50% 75% 100%
## 1.0 5.5 11.0 14.0 18.0
quantile(titanic$age, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0.17 21.00 28.00 39.00 80.00
summary(titanic$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.17 21.00 28.00 29.88 39.00 80.00 263
We can also specify percentiles using the probs
argument.
probs
引数で percentile を指定できる.
quantile(titanic$age, probs = 0.35, na.rm = TRUE) # 35th percentile
## 35%
## 24
max(titanic$age, na.rm = TRUE); min(titanic$age, na.rm = TRUE)
## [1] 80
## [1] 0.17
max(titanic$age, na.rm = TRUE) - min(titanic$age, na.rm = TRUE) # range
## [1] 79.83
Population variance and standard deviation (\(\mu\) is population mean) / 母分散,母標準偏差 (\(\mu\) は母平均)
\[ \sigma^2 = \frac{1}{N} \sum_i (x_i - \mu)^2, \quad \sigma = \sqrt{\sigma^2} \]
Sample variance and standard deviation / 標本分散,標本標準偏差
\[ s^2 = \frac{1}{n-1} \sum_i (x_i - \bar{x})^2, \quad s = \sqrt{s^2} \]
x <- c(1, 2, 6)
var(x) # sample variance
## [1] 7
sd(x) # standard deviation
## [1] 2.645751
var(titanic$age)
## [1] NA
var(titanic$age, na.rm = TRUE)
## [1] 207.7488
sd(titanic$age, na.rm = TRUE)
## [1] 14.41349
Create it using boxplot(vector)
.
boxplot(ベクトル)
で作成.
boxplot(titanic$age)
To create boxplots separated by gender, use
boxplot(continuous variable ~ group variable, data = dataframe name)
.
男女別に箱ひげ図を描く場合は,boxplot(連続変数 ~ グループ変数, data = データフレーム名)
で作成できる.
boxplot(age ~ sex, titanic)
Create it using plot(first vector, second vector)
.
plot(1つ目のベクトル, 2つ目のベクトル)
で作成.
plot(x = titanic$age, y = titanic$fare, xlab = "Age", ylab = "Fare", pch = 20)
Display combinations of three or more continuous variables in a matrix format, also known as a Draftman’s display or Pair plot.
3つ以上の連続変数の散布図の組み合わせを行列形式で表示する. Draftman’s display や Pair plot とも呼ばれる.
pairs(swiss[, c("Fertility", "Examination", "Education")])
Population covariance / 母共分散
\[ \sigma_{x, y} = \frac{1}{N} \sum_i (x_i - \mu_{x})(y_i - \mu_{y}) \]
Sample covariance / 標本共分散
\[ Cov(x, y) = \frac{1}{n-1} \sum_i (x_i - \bar{x})(y_i - \bar{y}) \]
x <- c(1, 2, 6)
y <- c(1, 3, 4)
plot(x, y) # scatter plot
cov(x, y) # sample covariance
## [1] 3.5
Correlation coefficient / 相関係数
\[ Cor(x, y) = \frac{Cov(x, y)}{\sqrt{s^2_x} \cdot \sqrt{s^2_y}} \]
cor(x, y)
## [1] 0.8660254
If either variable has missing values (NA
), add the
use = "complete.obs"
argument to calculate the correlation
coefficient using only observations without missing values.
少なくとも一方の変数に欠損値 NA
がある場合は,欠損値がない観測値のみを使って相関係数を計算するために
use = "complete.obs"
引数を追加する.
cor(titanic$age, titanic$fare, use = "complete.obs")
## [1] 0.1787399
We can also calculate a correlation matrix for multiple variables.
複数の変数に対して相関係数行列を計算することもできる.
cor(titanic[, c("age", "fare", "parch")], use = "complete.obs")
## age fare parch
## age 1.0000000 0.1787399 -0.1502409
## fare 0.1787399 1.0000000 0.2167232
## parch -0.1502409 0.2167232 1.0000000
en:
table(vector)
hist(vector)
mean(vector)
plot(x = X-axis vector, y = Y-axis vector)
cor(first vector, second vector)
ja:
table(ベクトル)
hist(ベクトル)
mean(ベクトル)
plot(x = X軸のベクトル, y = Y軸のベクトル)
cor(1つ目のベクトル, 2つ目のベクトル)