このページに対応するRmdファイル：GitHub

1 Packages in R

パッケージには大別すると次の3種類がある：

Rをインストールした時に自動的にインストールされており，Rを起動したときにデフォルトで使えるように設定されているもの（base::c, utils::str, etc.）
Rをインストールした時に自動的にインストールされているが，Rを起動したときにデフォルトでは読み込まれていないもの（MASS::Boston, Matrix::Cholesky, etc.）
- library(パッケージ名) で読み込んで 関数名 で呼び出すか，パッケージを読み込まずに パッケージ名::関数名 で呼び出す
ユーザーが自分でインストールしなければ使えないもの（tidyverse, data.table, etc.）
- install.packages("パッケージ名") でインストールし（最初の一度だけ実行），library(パッケージ名) で読み込む（Rを起動する度に実行）
- Rstudio で install.packages が使えない場合，PC がインターネットに接続されていることを確認したうえで，CRAN レポジトリを変更してリトライ：Tools > Global Options > Packages > Primary CRAN repository を「Japan (Tokyo)」などに変更（参考：Setting CRAN repository options）

補足：同じ名前の関数が複数のパッケージで使用されている場合があるので，常に パッケージ名::関数名 の形式でパッケージを指定するのを好む人もいる．

1.1 `tidyverse` package

データの整理・操作は tidyverse というパッケージに含まれる関数を使うのが一般的．

正確には，tidyverse は dplyr (読み方：“dee-ply-er”) や ggplot2 などの複数のパッケージを束ねたもの
tidyverse をインストール／ロードすれば dplyr や ggplot2 などが同時にインストール／ロードされる

パッケージのインストールは R のインストール後に一度だけ実行する．

install.packages("tidyverse")  # run for the first time only

インストール後に R に読み込む．こちらはパッケージのインストールと異なり，R （または RStudio）を起動するたびに実行する．

library(tidyverse)

1.1.1 `dplyr` package

データのハンドリング（または data wrangling）を行うパッケージ．

filter … 条件に合致する行を抽出
select … 条件に合致する列（変数）を抽出
mutate … 新しい列（変数）を作成
rename … 列（変数）の名前を変更
left_join … 2つのデータセットを統合
summarise … データを要約
group_by … 行をグルーピング
- Note: group_by はそれ単体で使われるわけではなく，通常は summarise 関数と併用される

1.1.2 `ggplot2` package

データ可視化のためのパッケージ． plot のようなデフォルトで使える関数と比べて文法に癖があるが，慣れれば論文に使える程度に綺麗な図を出力できる．

plot 関数はこれ単体でグラフを描画してくれるが，ggplot2 パッケージを使用する場合は ggplot 関数でデータと属性（X軸の変数，Y軸の変数，色に対応する変数，サイズに対応する変数）を指定し，次いで geom_xxx 関数でジオメトリ（点，線，棒，etc.）を指定するという2段階で描画する．

詳細は Kabacoff (2024) Modern Data Visualization with R などを参照．

2 Example of The Beatles data frame

2.1 Create “tibble” object

data.frame で作成したデータフレームのオブジェクトよりも少し扱いやすいデータフレーム型オブジェクトを tibble::tibble 関数で作成できる．

beatles <- tibble::tibble(
  name = c("John", "Paul", "George", "Ringo"), 
  born = c(1940, 1942, 1943, 1940),
  decease = c(1980, NA, NA, NA), 
  height = c(179, 180, 178, 170)
  )
beatles

## # A tibble: 4 × 4
##   name    born decease height
##   <chr>  <dbl>   <dbl>  <dbl>
## 1 John    1940    1980    179
## 2 Paul    1942      NA    180
## 3 George  1943      NA    178
## 4 Ringo   1940      NA    170

2.2 `filter`

残す行（observation）の条件を指定する．

beatles %>% filter(born >= 1941)

## # A tibble: 2 × 4
##   name    born decease height
##   <chr>  <dbl>   <dbl>  <dbl>
## 1 Paul    1942      NA    180
## 2 George  1943      NA    178

filter を適用した後のデータセットを新しいオブジェクトに割り当てるには <- を用いる（上書きも可；以下同様）．

beatles_ver2 <- beatles %>% filter(born >= 1941)
beatles_ver2

## # A tibble: 2 × 4
##   name    born decease height
##   <chr>  <dbl>   <dbl>  <dbl>
## 1 Paul    1942      NA    180
## 2 George  1943      NA    178

複数の条件を指定することもできる．

beatles %>% filter(born >= 1941 & height < 180)

## # A tibble: 1 × 4
##   name    born decease height
##   <chr>  <dbl>   <dbl>  <dbl>
## 1 George  1943      NA    178

2.3 `select`

残す列（変数）の名前を指定する．

beatles %>% select(name)

## # A tibble: 4 × 1
##   name  
##   <chr> 
## 1 John  
## 2 Paul  
## 3 George
## 4 Ringo

beatles %>% select(name, born)

## # A tibble: 4 × 2
##   name    born
##   <chr>  <dbl>
## 1 John    1940
## 2 Paul    1942
## 3 George  1943
## 4 Ringo   1940

マイナス記号 - を使って「-変数名」と指定すると当該変数以外が残る．

beatles %>% select(-height)

## # A tibble: 4 × 3
##   name    born decease
##   <chr>  <dbl>   <dbl>
## 1 John    1940    1980
## 2 Paul    1942      NA
## 3 George  1943      NA
## 4 Ringo   1940      NA

2.4 `mutate`

新しく列（変数）を作成する．

beatles %>% mutate(primary_role = c("vocal", "vocal", "guitar", "drum"))

## # A tibble: 4 × 5
##   name    born decease height primary_role
##   <chr>  <dbl>   <dbl>  <dbl> <chr>       
## 1 John    1940    1980    179 vocal       
## 2 Paul    1942      NA    180 vocal       
## 3 George  1943      NA    178 guitar      
## 4 Ringo   1940      NA    170 drum

既存の変数を利用して新しい変数を作成することもできる．

beatles %>% mutate(age_at_debut = 1962 - born)

## # A tibble: 4 × 5
##   name    born decease height age_at_debut
##   <chr>  <dbl>   <dbl>  <dbl>        <dbl>
## 1 John    1940    1980    179           22
## 2 Paul    1942      NA    180           20
## 3 George  1943      NA    178           19
## 4 Ringo   1940      NA    170           22

2.5 `rename`

列（変数）の名前を変える．

beatles %>% rename(birth_year = born)

## # A tibble: 4 × 4
##   name   birth_year decease height
##   <chr>       <dbl>   <dbl>  <dbl>
## 1 John         1940    1980    179
## 2 Paul         1942      NA    180
## 3 George       1943      NA    178
## 4 Ringo        1940      NA    170

2.6 `left_join`

別のデータセットと統合する．

beatles_primary_role <- tibble::tibble(name = c("John", "Paul", "George", "Ringo"), 
                                       primary_role = c("vocal", "vocal", "guitar", "drum"))
beatles_primary_role

## # A tibble: 4 × 2
##   name   primary_role
##   <chr>  <chr>       
## 1 John   vocal       
## 2 Paul   vocal       
## 3 George guitar      
## 4 Ringo  drum

beatles %>% left_join(beatles_primary_role, by = "name")

## # A tibble: 4 × 5
##   name    born decease height primary_role
##   <chr>  <dbl>   <dbl>  <dbl> <chr>       
## 1 John    1940    1980    179 vocal       
## 2 Paul    1942      NA    180 vocal       
## 3 George  1943      NA    178 guitar      
## 4 Ringo   1940      NA    170 drum

2.6.1 If either data frame contains NA / いずれかのデータフレームに欠損値がある場合

一方のデータフレームにしかない行がある場合，主たるデータフレーム（マージされる側）の行はそのまま残る．

補足：コードの文法上左側にあるデータセットに合わせてマージするので left_join という関数名が使われている．右側に合わせる場合は right_join，積集合でマージするのは inner_join，和集合でマージするのは full_join．

Case 1. 担当楽器のデータセットに欠損がある場合．

beatles_primary_role_wo_paul <- tibble::tibble(name = c("John", "George", "Ringo"), 
                                               primary_role = c("vocal", "guitar", "drum"))
beatles_primary_role_wo_paul

## # A tibble: 3 × 2
##   name   primary_role
##   <chr>  <chr>       
## 1 John   vocal       
## 2 George guitar      
## 3 Ringo  drum

beatles %>% left_join(beatles_primary_role_wo_paul, by = "name")

## # A tibble: 4 × 5
##   name    born decease height primary_role
##   <chr>  <dbl>   <dbl>  <dbl> <chr>       
## 1 John    1940    1980    179 vocal       
## 2 Paul    1942      NA    180 <NA>        
## 3 George  1943      NA    178 guitar      
## 4 Ringo   1940      NA    170 drum

Case 2. 逆に，主なデータフレームの方に欠損がある場合．

beatles %>% 
  filter(name != "Paul") %>% 
  left_join(beatles_primary_role, by = "name")

## # A tibble: 3 × 5
##   name    born decease height primary_role
##   <chr>  <dbl>   <dbl>  <dbl> <chr>       
## 1 John    1940    1980    179 vocal       
## 2 George  1943      NA    178 guitar      
## 3 Ringo   1940      NA    170 drum

2.7 `summarise`

データを要約する．

beatles %>% summarise(mean_height = mean(height), 
                      std_dev_height = sd(height),
                      sample_size = n())

## # A tibble: 1 × 3
##   mean_height std_dev_height sample_size
##         <dbl>          <dbl>       <int>
## 1        177.           4.57           4

2.8 `group_by`

summarise 関数と組み合わせて使うことで，グループごとの要約統計量を計算したり何らかの統計処理を行うことができる．

beatles %>% 
  group_by(born) %>% 
  summarise(mean_height = mean(height), 
            std_dev_height = sd(height),
            sample_size = n())

## # A tibble: 3 × 4
##    born mean_height std_dev_height sample_size
##   <dbl>       <dbl>          <dbl>       <int>
## 1  1940        174.           6.36           2
## 2  1942        180           NA              1
## 3  1943        178           NA              1

2.9 Scatter plot

まずは ggplot だけ実行してみる．データが何も描かれていないキャンバスだけが表示される．

ggplot(data = beatles, mapping = aes(x = born, y = height))

ここに点のレイヤーを geom_point 関数で追加する．

ggplot(data = beatles, mapping = aes(x = born, y = height)) + 
  geom_point()

軸のラベルを変更する．

ggplot(data = beatles, mapping = aes(x = born, y = height)) + 
  geom_point() + 
  xlab("Year of birth")

キャンバスのテーマを変える．

ggplot(data = beatles, mapping = aes(x = born, y = height)) + 
  geom_point() + 
  theme_classic()

3 Example of `swiss` data

R をインストールした際にビルトインされているデータセットは data() で確認できる．

そのうちの一つ「swiss」データを用いて tidyverse と ggplot2 の基本的な使い方を確認する．

理解しやすいように，swiss データセットの最初の6行（provinces）と最初の4列（variables）だけを切り出して，「swiss2」という名前の新しいデータセットとして定義する．

swiss2 <- swiss[1:6, 1:4]  # Extract a portion of `swiss` dataset
swiss2

##              Fertility Agriculture Examination Education
## Courtelary        80.2        17.0          15        12
## Delemont          83.1        45.1           6         9
## Franches-Mnt      92.5        39.7           5         5
## Moutier           85.8        36.5          12         7
## Neuveville        76.9        43.5          17        15
## Porrentruy        76.1        35.3           9         7

3.1 `filter`, `select`, `mutate`, `rename`, and `summarise`

swiss2 %>% filter(Fertility > 80)  # Keep only provinces with a "Fertility" greater than 80

##              Fertility Agriculture Examination Education
## Courtelary        80.2        17.0          15        12
## Delemont          83.1        45.1           6         9
## Franches-Mnt      92.5        39.7           5         5
## Moutier           85.8        36.5          12         7

filter(swiss2, Fertility > 80)  # same as above

##              Fertility Agriculture Examination Education
## Courtelary        80.2        17.0          15        12
## Delemont          83.1        45.1           6         9
## Franches-Mnt      92.5        39.7           5         5
## Moutier           85.8        36.5          12         7

swiss2 %>% filter(Fertility > 80, Examination > 10)  # multiple conditions

##            Fertility Agriculture Examination Education
## Courtelary      80.2        17.0          15        12
## Moutier         85.8        36.5          12         7

swiss2 %>% select(Fertility, Examination)  # select specified variables

##              Fertility Examination
## Courtelary        80.2          15
## Delemont          83.1           6
## Franches-Mnt      92.5           5
## Moutier           85.8          12
## Neuveville        76.9          17
## Porrentruy        76.1           9

swiss2 %>% mutate(Exam_Edu = (Examination + Education)/2)  # creates new variable as the mean

##              Fertility Agriculture Examination Education Exam_Edu
## Courtelary        80.2        17.0          15        12     13.5
## Delemont          83.1        45.1           6         9      7.5
## Franches-Mnt      92.5        39.7           5         5      5.0
## Moutier           85.8        36.5          12         7      9.5
## Neuveville        76.9        43.5          17        15     16.0
## Porrentruy        76.1        35.3           9         7      8.0

swiss2 %>% rename(Exam = Examination)  # new name is `Exam`, old name is `Examination`

##              Fertility Agriculture Exam Education
## Courtelary        80.2        17.0   15        12
## Delemont          83.1        45.1    6         9
## Franches-Mnt      92.5        39.7    5         5
## Moutier           85.8        36.5   12         7
## Neuveville        76.9        43.5   17        15
## Porrentruy        76.1        35.3    9         7

swiss2 %>% summarise(mean_fert = mean(Fertility), sd_fert = sd(Fertility), 
                     median_fert = median(Fertility), mean_educ = mean(Education))

##   mean_fert  sd_fert median_fert mean_educ
## 1  82.43333 6.145459       81.65  9.166667

summarise を使わないで計算する場合は以下のようになる．

mean(swiss2$Fertility); sd(swiss2$Fertility); median(swiss2$Fertility); mean(swiss2$Education)

## [1] 82.43333

## [1] 6.145459

## [1] 81.65

## [1] 9.166667

3.2 `group_by` and `summarise`

出生率が高いか低いかでグループを分けて，そのグループごとに平均値や分散などを計算することもできる．

出生率が高いかどうかを示す high_fertility_province という変数を作成し，それを group_by 関数に使用する．

swiss2 %>%
  mutate(high_fertility_province = Fertility > mean(Fertility)) %>%
  group_by(high_fertility_province) %>%
  summarise(mean_fert = mean(Fertility), mean_exam = mean(Examination))

## # A tibble: 2 × 3
##   high_fertility_province mean_fert mean_exam
##   <lgl>                       <dbl>     <dbl>
## 1 FALSE                        77.7     13.7 
## 2 TRUE                         87.1      7.67

先ほどと同様に group_by や summarise 関数を使わずに同じものを計算しようとすると以下のようになる． group_by 関数の便利さと可読性がおわかりいただけるだろう．

mean(swiss2$Fertility[swiss2$Fertility > mean(swiss2$Fertility)])

## [1] 87.13333

mean(swiss2$Fertility[swiss2$Fertility <= mean(swiss2$Fertility)])

## [1] 77.73333

mean(swiss2$Examination[swiss2$Fertility > mean(swiss2$Fertility)])

## [1] 7.666667

mean(swiss2$Examination[swiss2$Fertility <= mean(swiss2$Fertility)])

## [1] 13.66667

上記の操作を行ったデータセットを今後の分析に使用する場合，元のデータセットを上書きしてもよいし，新しいデータセットを作成してもよいが，上書きすると元のデータはR上から消えるので注意．

swiss2_new <- swiss2 %>% filter(Fertility > 90)  # create new
swiss2 <- swiss2 %>% filter(Fertility > 90)  # overwrite

3.3 `left_join`

2つのデータセットがある場合，left_join 関数を使って統合することができる．

例：swiss データに人口の変数を統合する．

マージする key として province の名前を用いる
swiss データセットには province の名前が列としてではなく rownames として記録されているので，統合される側のデータに province の名前を表す prov 変数を作成

swiss2 <- swiss[1:6, 1:4]
swiss2b <- swiss2 %>% mutate(prov = rownames(swiss2))  # add province names as a column "prov"
swiss2b

##              Fertility Agriculture Examination Education         prov
## Courtelary        80.2        17.0          15        12   Courtelary
## Delemont          83.1        45.1           6         9     Delemont
## Franches-Mnt      92.5        39.7           5         5 Franches-Mnt
## Moutier           85.8        36.5          12         7      Moutier
## Neuveville        76.9        43.5          17        15   Neuveville
## Porrentruy        76.1        35.3           9         7   Porrentruy

（仮想の）人口変数を含む swiss2_pop というデータセットを作成する．

swiss2_pop <- data.frame(prov = c("Courtelary", "Delemont", "Franches-Mnt", 
                                  "Moutier", "Neuveville", "Porrentruy"),
                         pop = c(100, 200, 300, 400, 500, 600))
swiss2_pop

##           prov pop
## 1   Courtelary 100
## 2     Delemont 200
## 3 Franches-Mnt 300
## 4      Moutier 400
## 5   Neuveville 500
## 6   Porrentruy 600

統合．

swiss2b %>% left_join(swiss2_pop, by = "prov")

##   Fertility Agriculture Examination Education         prov pop
## 1      80.2        17.0          15        12   Courtelary 100
## 2      83.1        45.1           6         9     Delemont 200
## 3      92.5        39.7           5         5 Franches-Mnt 300
## 4      85.8        36.5          12         7      Moutier 400
## 5      76.9        43.5          17        15   Neuveville 500
## 6      76.1        35.3           9         7   Porrentruy 600

上記の例では，swiss2b というデータセットの州の並びと swiss2_pop というデータセットの州の並びが同じだったが，異なっている場合でもマージ可能．また，swiss2_pop にデータが欠損している州があったとしてもマージ可能．

swiss2_pop <- data.frame(prov = c("Porrentruy", "Courtelary", "Delemont", 
                                  "Franches-Mnt", "Neuveville"),
                         pop = c(600, 100, 200, 300, 500))
swiss2_pop

##           prov pop
## 1   Porrentruy 600
## 2   Courtelary 100
## 3     Delemont 200
## 4 Franches-Mnt 300
## 5   Neuveville 500

swiss2b %>% left_join(swiss2_pop, by = "prov")

##   Fertility Agriculture Examination Education         prov pop
## 1      80.2        17.0          15        12   Courtelary 100
## 2      83.1        45.1           6         9     Delemont 200
## 3      92.5        39.7           5         5 Franches-Mnt 300
## 4      85.8        36.5          12         7      Moutier  NA
## 5      76.9        43.5          17        15   Neuveville 500
## 6      76.1        35.3           9         7   Porrentruy 600

3.4 Scatter plot using `ggplot2`

ビートルズのデータと同様に散布図を描いてみる．

ggplot(data = swiss, mapping = aes(x = Education, y = Fertility)) + 
  geom_point()

マーカーの色をカトリック教徒の割合（%）で指定するには aes 内で指定する．

ggplot(data = swiss, mapping = aes(x = Education, y = Fertility, color = Catholic)) + 
  geom_point()

swiss データを加工して描画に使用する場合は下のようにコーディングしてもよい．このとき，ggplot 関数の一つ目の引数 data として ggplot 関数の直前のデータが指定される．

swiss %>% 
  filter(Education <= 20) %>% 
  ggplot(mapping = aes(x = Education, y = Fertility)) + 
  geom_point()

データポイントにフィットする曲線のレイヤーを追加する．

ggplot(data = swiss, mapping = aes(x = Education, y = Fertility)) + 
  geom_point() + 
  geom_smooth()

4 Example of Titanic data

titanic <- read.csv("https://raw.githubusercontent.com/kurodaecon/bs/main/data/titanic3_csv.csv")

4.1 `filter`

titanic %>% filter(age > 75)

##   pclass survived    sex age sibsp parch  fare embarked
## 1      1        1   male  80     0     0 30.00        S
## 2      1        1 female  76     1     0 78.85        S

4.2 `group_by` and `summarise`

性別ごとの生存率．

titanic %>% 
  group_by(sex) %>%
  summarise(survival_rate = mean(survived), 
            sample_size = n())

## # A tibble: 2 × 3
##   sex    survival_rate sample_size
##   <chr>          <dbl>       <int>
## 1 female         0.727         466
## 2 male           0.191         843

性別と客室等級ごとの生存率．

titanic %>% 
  group_by(sex, pclass) %>%
  summarise(survival_rate = mean(survived), 
            sample_size = n())

## # A tibble: 6 × 4
## # Groups:   sex [2]
##   sex    pclass survival_rate sample_size
##   <chr>   <int>         <dbl>       <int>
## 1 female      1         0.965         144
## 2 female      2         0.887         106
## 3 female      3         0.491         216
## 4 male        1         0.341         179
## 5 male        2         0.146         171
## 6 male        3         0.152         493

性別と年齢層ごとの生存率．

titanic %>% 
  mutate(age_group = cut(age, breaks = c(0, 20, 60, 100), right = FALSE)) %>% 
  group_by(sex, age_group) %>%
  summarise(survival_rate = mean(survived), 
            sample_size = n())

## # A tibble: 8 × 4
## # Groups:   sex [2]
##   sex    age_group survival_rate sample_size
##   <chr>  <fct>             <dbl>       <int>
## 1 female [0,20)            0.699         103
## 2 female [20,60)           0.770         274
## 3 female [60,100)          0.818          11
## 4 female <NA>              0.603          78
## 5 male   [0,20)            0.279         122
## 6 male   [20,60)           0.193         507
## 7 male   [60,100)          0.103          29
## 8 male   <NA>              0.141         185

性別と客室等級と年齢層ごとの生存率．

titanic %>% 
  mutate(age_group = cut(age, breaks = c(0, 20, 60, 100), right = FALSE)) %>% 
  group_by(sex, pclass, age_group) %>%
  summarise(survival_rate = mean(survived), 
            sample_size = n())

## # A tibble: 24 × 5
## # Groups:   sex, pclass [6]
##    sex    pclass age_group survival_rate sample_size
##    <chr>   <int> <fct>             <dbl>       <int>
##  1 female      1 [0,20)            0.938          16
##  2 female      1 [20,60)           0.972         108
##  3 female      1 [60,100)          0.889           9
##  4 female      1 <NA>              1              11
##  5 female      2 [0,20)            0.958          24
##  6 female      2 [20,60)           0.885          78
##  7 female      2 [60,100)          0               1
##  8 female      2 <NA>              0.667           3
##  9 female      3 [0,20)            0.540          63
## 10 female      3 [20,60)           0.420          88
## # ℹ 14 more rows

カテゴリー数が多いためにすべて表示されていない．以下のように表示数を調整できる．

titanic %>% 
  mutate(age_group = cut(age, breaks = c(0, 20, 60, 100), right = FALSE)) %>% 
  group_by(sex, pclass, age_group) %>%
  summarise(survival_rate = mean(survived), 
            sample_size = n()) %>% 
  print(n = 30)

## # A tibble: 24 × 5
## # Groups:   sex, pclass [6]
##    sex    pclass age_group survival_rate sample_size
##    <chr>   <int> <fct>             <dbl>       <int>
##  1 female      1 [0,20)           0.938           16
##  2 female      1 [20,60)          0.972          108
##  3 female      1 [60,100)         0.889            9
##  4 female      1 <NA>             1               11
##  5 female      2 [0,20)           0.958           24
##  6 female      2 [20,60)          0.885           78
##  7 female      2 [60,100)         0                1
##  8 female      2 <NA>             0.667            3
##  9 female      3 [0,20)           0.540           63
## 10 female      3 [20,60)          0.420           88
## 11 female      3 [60,100)         1                1
## 12 female      3 <NA>             0.531           64
## 13 male        1 [0,20)           0.6             10
## 14 male        1 [20,60)          0.363          124
## 15 male        1 [60,100)         0.118           17
## 16 male        1 <NA>             0.286           28
## 17 male        2 [0,20)           0.444           27
## 18 male        2 [20,60)          0.0806         124
## 19 male        2 [60,100)         0.143            7
## 20 male        2 <NA>             0.154           13
## 21 male        3 [0,20)           0.188           85
## 22 male        3 [20,60)          0.166          259
## 23 male        3 [60,100)         0                5
## 24 male        3 <NA>             0.111          144

年齢の平均を計算しようとすると NA が返ってくる．

titanic %>% 
  summarise(age_mean = mean(age))

##   age_mean
## 1       NA

これは年齢変数 age に NA が含まれるため． na.rm = TRUE 引数を追加する．

titanic %>% 
  summarise(age_mean = mean(age, na.rm = TRUE))

##   age_mean
## 1 29.88114

4.3 Correlation matrix

相関係数行列．

titanic %>% 
  select(age, fare, parch) %>% 
  cor(use = "complete.obs")

##              age      fare      parch
## age    1.0000000 0.1787399 -0.1502409
## fare   0.1787399 1.0000000  0.2167232
## parch -0.1502409 0.2167232  1.0000000

4.4 Histogram using `ggplot2`

年齢．

ggplot(data = titanic, mapping = aes(x = age)) + 
  geom_histogram()

性別ごとに分ける．

ggplot(data = titanic, mapping = aes(x = age, fill = sex)) + 
  geom_histogram(position = "dodge") + 
  scale_fill_grey() +  # grey scale 
  theme_classic()

4.5 Bar plot using `ggplot2`

出港地ごとの人数．

titanic %>% 
  filter(embarked != "") %>% 
  group_by(embarked) %>% 
  summarise(person = n()) %>% 
  ggplot(mapping = aes(x = embarked, y = person)) + 
  geom_bar(stat = "identity") + 
  theme_classic()

性別ごとに分ける．ついでにX軸のラベルも修正しておく．

titanic %>% 
  filter(embarked != "") %>% 
  group_by(embarked, sex) %>% 
  summarise(person = n()) %>% 
  ggplot(mapping = aes(x = embarked, y = person, fill = sex)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  scale_x_discrete(labels = c("Cherbourg", "Queenstown", "Southampton")) + 
  scale_fill_grey() + 
  theme_classic()

4.6 Pie chart using `ggplot2`

出港地ごとの人数．

titanic %>% 
  filter(embarked != "") %>% 
  group_by(embarked) %>% 
  summarise(person = n()) %>% 
  ggplot(mapping = aes(x = "x", y = person, fill = embarked)) + 
  geom_bar(stat = "identity", position = "stack") + 
  coord_polar(theta = "y") + 
  scale_fill_brewer(labels = c("Cherbourg", "Queenstown", "Southampton")) + 
  theme_classic()

4.7 Boxplot using `ggplot2`

性別ごとの年齢．

ggplot(data = titanic, mapping = aes(x = sex, y = age)) + 
  geom_boxplot()

性別・客室等級ごとの年齢．

ggplot(data = titanic, mapping = aes(x = sex, y = age, fill = factor(pclass))) + 
  geom_boxplot() + 
  scale_fill_grey(start = 0.4, end = 0.9) + 
  theme_classic()

データの分布が複雑な場合（多峰など），バイオリンプロットという選択肢がある．

ggplot(data = titanic, mapping = aes(x = sex, y = age)) + 
  geom_violin()

4.8 Scatter plot using `ggplot2`

ビートルズやSwissデータと同様に散布図を描き，スムージング曲線を追加する．

点同士が重なってしまうため geom_jitter 関数でY軸方向にばらつかせて描画．

ggplot(data = titanic, mapping = aes(x = age, y = survived)) + 
  geom_point() + 
  geom_jitter(height = .05, width = 0) + 
  geom_smooth()

Y軸の survived は 0/1 の binary 変数なので jitter は違和感がある． Binned plot はよい代替案だろう．

ggplot(data = titanic, mapping = aes(x = age, y = survived)) + 
  stat_summary_bin()

性別による死亡率の違いを示す．

ggplot(data = titanic, mapping = aes(x = age, y = survived, color = sex, shape = sex, linetype = sex)) + 
  geom_point(size = 0.5) + 
  geom_jitter(height = .05, width = 0) + 
  geom_smooth(se = FALSE) + 
  scale_color_brewer(palette = "Dark2") + # colorblind-friendly palette 
  theme_classic()

50歳代からサンプルサイズが大きく低下するため，フィッティングの曲線は参考にならない点に注意．

客室等級による死亡率の違いを示す．

pclass は連続変数なので factor 関数で因子型の変数に変換して使う．

ggplot(data = titanic, mapping = aes(x = age, y = survived, color = factor(pclass), 
                                     shape = factor(pclass), linetype = factor(pclass))) + 
  geom_point() + 
  geom_jitter(height = .05, width = 0) + 
  geom_smooth(se = FALSE) + 
  scale_color_brewer(palette = "Dark2") +
  theme_classic()

性別ごとに独立した散布図を作成してそれぞれのパネルを並べた図を作ることもできる．

facet_wrap(~ グループ変数) を利用

ggplot(data = titanic, mapping = aes(x = age, y = survived, color = factor(pclass), 
                                     shape = factor(pclass), linetype = factor(pclass))) + 
  geom_point() + 
  geom_jitter(height = .05, width = 0) + 
  geom_smooth(se = FALSE) + 
  facet_wrap(~ sex) + 
  scale_color_brewer(palette = "Dark2") +
  theme_classic()

Data Analysis Using Statistical Packages: Organizing and Visualizing Data

Sho Kuroda / 黒田翔

最終更新：2025年5月

1 Packages in R

1.1 `tidyverse` package

1.1.1 `dplyr` package

1.1.2 `ggplot2` package

2 Example of The Beatles data frame

2.1 Create “tibble” object

2.2 `filter`

2.3 `select`

2.4 `mutate`

2.5 `rename`

2.6 `left_join`

2.6.1 If either data frame contains NA / いずれかのデータフレームに欠損値がある場合

2.7 `summarise`

2.8 `group_by`

2.9 Scatter plot

3 Example of `swiss` data

3.1 `filter`, `select`, `mutate`, `rename`, and `summarise`

3.2 `group_by` and `summarise`

3.3 `left_join`

3.4 Scatter plot using `ggplot2`

4 Example of Titanic data

4.1 `filter`

4.2 `group_by` and `summarise`

4.3 Correlation matrix

4.4 Histogram using `ggplot2`

4.5 Bar plot using `ggplot2`

4.6 Pie chart using `ggplot2`

4.7 Boxplot using `ggplot2`

4.8 Scatter plot using `ggplot2`

Data Analysis Using Statistical Packages: Organizing and Visualizing Data

Sho Kuroda / 黒田翔

最終更新：2025年5月

1 Packages in R

1.1 tidyverse package

1.1.1 dplyr package

1.1.2 ggplot2 package

2 Example of The Beatles data frame

2.1 Create “tibble” object

2.2 filter

2.3 select

2.4 mutate

2.5 rename

2.6 left_join

2.6.1 If either data frame contains NA / いずれかのデータフレームに欠損値がある場合

2.7 summarise

2.8 group_by

2.9 Scatter plot

3 Example of swiss data

3.1 filter, select, mutate, rename, and summarise

3.2 group_by and summarise

3.3 left_join

3.4 Scatter plot using ggplot2

4 Example of Titanic data

4.1 filter

4.2 group_by and summarise

4.3 Correlation matrix

4.4 Histogram using ggplot2

4.5 Bar plot using ggplot2

4.6 Pie chart using ggplot2

4.7 Boxplot using ggplot2

4.8 Scatter plot using ggplot2

1.1 `tidyverse` package

1.1.1 `dplyr` package

1.1.2 `ggplot2` package

2.2 `filter`

2.3 `select`

2.4 `mutate`

2.5 `rename`

2.6 `left_join`

2.7 `summarise`

2.8 `group_by`

3 Example of `swiss` data

3.1 `filter`, `select`, `mutate`, `rename`, and `summarise`

3.2 `group_by` and `summarise`

3.3 `left_join`

3.4 Scatter plot using `ggplot2`

4.1 `filter`

4.2 `group_by` and `summarise`

4.4 Histogram using `ggplot2`

4.5 Bar plot using `ggplot2`

4.6 Pie chart using `ggplot2`

4.7 Boxplot using `ggplot2`

4.8 Scatter plot using `ggplot2`