1 Simple regression analysis / 単回帰分析

We’ll estimate the following regression model (simple linear regression) with a single explanatory variable using the swiss dataset.

swiss データを使って,1つの説明変数をもつ次のような回帰モデル(単回帰モデル)を推定する.

\[ Fertility = \alpha + \beta Exam + u \]

1.1 Estimation and interpretation

Estimate using the lm (linear model) function.

lm (linear model) 関数で推定する.

# str(swiss)  # built-in dataset
lm(formula = Fertility ~ Examination, data = swiss)
## 
## Call:
## lm(formula = Fertility ~ Examination, data = swiss)
## 
## Coefficients:
## (Intercept)  Examination  
##      86.819       -1.011

From the results above, it can be interpreted as follows: When the value of the Examination variable increases by one unit, the Fertility variable tends to be 1.011 units lower on average.

上の結果から次のように解釈することができる:「Examination 変数が1単位大きい値を取るとき,Fertility 変数は平均的に 1.011 単位小さな値を取る」

It cannot be interpreted as: A one-unit increase in Examination reduces Fertility by 1.011 units.

This is because the phrase ‘reduces’ implies a causal relationship, whereas the analysis above is merely measuring correlation, not causation.

次のように解釈することはできない:「Examination の1単位の増加は Fertility を 1.011 単位減少させる」

なぜならば,「減少させる」という表現は通常因果関係を意味するが,上の分析では因果関係ではなく相関関係を測定しているにすぎないからである.

1.2 summary function

summary function display the R-squared and other statistics.

lm 関数で計算した回帰オブジェクトを summary 関数に渡すと決定係数なども表示される.

summary(lm(formula = Fertility ~ Examination, data = swiss))
## 
## Call:
## lm(formula = Fertility ~ Examination, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.9375  -6.0044  -0.3393   7.9239  19.7399 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  86.8185     3.2576  26.651  < 2e-16 ***
## Examination  -1.0113     0.1782  -5.675 9.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.642 on 45 degrees of freedom
## Multiple R-squared:  0.4172, Adjusted R-squared:  0.4042 
## F-statistic: 32.21 on 1 and 45 DF,  p-value: 9.45e-07

2 Multiple regression analysis / 重回帰分析

Let’s consider the following regression model (multiple regression model) with two explanatory variables.

2つの説明変数をもつ次のような回帰モデル(重回帰モデル)を考える.

\[ Fertility = \alpha + \beta_1 Exam + \beta_2 Educ + u \]

summary(lm(Fertility ~ Examination + Education, data = swiss))
## 
## Call:
## lm(formula = Fertility ~ Examination + Education, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9935  -6.8894  -0.3621   7.1640  19.2634 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85.2533     3.0855  27.630   <2e-16 ***
## Examination  -0.5572     0.2319  -2.402   0.0206 *  
## Education    -0.5395     0.1924  -2.803   0.0075 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.982 on 44 degrees of freedom
## Multiple R-squared:  0.5055, Adjusted R-squared:  0.483 
## F-statistic: 22.49 on 2 and 44 DF,  p-value: 1.87e-07

From the results above, it can be interpreted as follows: Given that Education is held at a certain level, when Examination increases by one unit, Fertility tends to be 0.5572 units lower on average.

上の結果から次のように解釈できる:Education を一定水準に固定したもとで Examination が1単位大きい値を取るとき,Fertility は平均的に 0.5572 小さな値を取る傾向にある.

3 Advanced / 応用

3.1 Adding quardatic or interaction terms / 二次・交差項

When specifying a regression model, we can use the I function to represent squares and interaction terms.

The notation var1 : var2 represents var1 × var2, while var1 * var2 represents var1 + var2 + (var1 × var2).

回帰モデルを指定する際に I 関数を使って二乗や交差項を表現する.

var1 : var2」は「var1 × var2」を表し,「var1 * var2」は「var1 + var2 + (var1 × var2)」を表す.

\[ Fertility = \alpha + \beta_1 Exam + \beta_2 Educ^2 + u \]

lm(Fertility ~ Examination + I(Education^2), data = swiss)
## 
## Call:
## lm(formula = Fertility ~ Examination + I(Education^2), data = swiss)
## 
## Coefficients:
##    (Intercept)     Examination  I(Education^2)  
##       83.22072        -0.66069        -0.01035

\[ Fertility = \alpha + \beta_1 Exam \times Educ + u \]

lm(Fertility ~ Examination:Education, data = swiss)

\[ Fertility = \alpha + \beta_1 Exam + \beta_2 Educ + \beta_3 Exam \times Educ + u \]

lm(Fertility ~ Examination*Education, data = swiss)
## 
## Call:
## lm(formula = Fertility ~ Examination * Education, data = swiss)
## 
## Coefficients:
##           (Intercept)            Examination              Education  
##             87.178104              -0.625731              -0.807552  
## Examination:Education  
##              0.009201

We can use log without I when taking the logarithm of either the explanatory variables or the dependent variable.

説明変数や被説明変数に対数を取る場合は I なしで log を使ってよい.

\[ \log(Fertility) = \alpha + \beta_1 Exam + u \]

lm(log(Fertility) ~ Examination, data = swiss)
## 
## Call:
## lm(formula = log(Fertility) ~ Examination, data = swiss)
## 
## Coefficients:
## (Intercept)  Examination  
##     4.49251     -0.01574

3.2 Dummy variables / ダミー変数

Let’s define a dummy variable, exam_dummy, that takes the value 1 when the Examination variable is greater than the average (approximately 16).

Use the ifelse(logical expression, value if TRUE, value if FALSE) function to create a variable that returns 1 if the logical expression is TRUE.

Examination 変数が平均(≒ 16)よりも大きい場合に 1 をとるダミー変数を定義してみる.

ifelse(論理演算, TRUE の場合, FALSE の場合) を利用し,論理演算部分が TRUE なら 1 を返すように変数を作成する.

swiss$exam_dummy <- ifelse(swiss$Examination > 16, 1, 0)
head(swiss[, c("Examination", "exam_dummy")])
##              Examination exam_dummy
## Courtelary            15          0
## Delemont               6          0
## Franches-Mnt           5          0
## Moutier               12          0
## Neuveville            17          1
## Porrentruy             9          0

\[ Fertility = \alpha + \beta_1 1(Exam > 16) + u \]

summary(lm(Fertility ~ exam_dummy, data = swiss))
## 
## Call:
## lm(formula = Fertility ~ exam_dummy, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.359  -5.557   1.945   6.793  16.441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   76.059      2.020  37.657  < 2e-16 ***
## exam_dummy   -13.904      3.096  -4.491 4.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.5 on 45 degrees of freedom
## Multiple R-squared:  0.3095, Adjusted R-squared:  0.2941 
## F-statistic: 20.17 on 1 and 45 DF,  p-value: 4.907e-05

It can be observed that in regions where Examination exceeds 16, Fertility tends to be, on average, 13.904 units lower compared to regions where Examination does not exceed 16.

Examination が16を超えている地域では超えていない地域と比較して Fertility が平均的に 13.904 単位低い値を取る傾向にあることが分かる.