We’ll estimate the following regression model (simple linear
regression) with a single explanatory variable using the
swiss
dataset.
swiss
データを使って,1つの説明変数をもつ次のような回帰モデル(単回帰モデル)を推定する.
\[ Fertility = \alpha + \beta Exam + u \]
Estimate using the lm
(linear model) function.
lm
(linear model) 関数で推定する.
# str(swiss) # built-in dataset
lm(formula = Fertility ~ Examination, data = swiss)
##
## Call:
## lm(formula = Fertility ~ Examination, data = swiss)
##
## Coefficients:
## (Intercept) Examination
## 86.819 -1.011
From the results above, it can be interpreted as follows: When the
value of the Examination
variable increases by one unit,
the Fertility
variable tends to be 1.011 units lower on
average.
上の結果から次のように解釈することができる:「Examination
変数が1単位大きい値を取るとき,Fertility
変数は平均的に
1.011 単位小さな値を取る」
It cannot be interpreted as: A one-unit increase in
Examination
reduces Fertility
by
1.011 units.
This is because the phrase ‘reduces’ implies a causal relationship, whereas the analysis above is merely measuring correlation, not causation.
次のように解釈することはできない:「Examination
の1単位の増加は Fertility
を 1.011 単位減少させる」
なぜならば,「減少させる」という表現は通常因果関係を意味するが,上の分析では因果関係ではなく相関関係を測定しているにすぎないからである.
summary
functionsummary
function display the R-squared and other
statistics.
lm
関数で計算した回帰オブジェクトを summary
関数に渡すと決定係数なども表示される.
summary(lm(formula = Fertility ~ Examination, data = swiss))
##
## Call:
## lm(formula = Fertility ~ Examination, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.9375 -6.0044 -0.3393 7.9239 19.7399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.8185 3.2576 26.651 < 2e-16 ***
## Examination -1.0113 0.1782 -5.675 9.45e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.642 on 45 degrees of freedom
## Multiple R-squared: 0.4172, Adjusted R-squared: 0.4042
## F-statistic: 32.21 on 1 and 45 DF, p-value: 9.45e-07
Let’s consider the following regression model (multiple regression model) with two explanatory variables.
2つの説明変数をもつ次のような回帰モデル(重回帰モデル)を考える.
\[ Fertility = \alpha + \beta_1 Exam + \beta_2 Educ + u \]
summary(lm(Fertility ~ Examination + Education, data = swiss))
##
## Call:
## lm(formula = Fertility ~ Examination + Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9935 -6.8894 -0.3621 7.1640 19.2634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.2533 3.0855 27.630 <2e-16 ***
## Examination -0.5572 0.2319 -2.402 0.0206 *
## Education -0.5395 0.1924 -2.803 0.0075 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.982 on 44 degrees of freedom
## Multiple R-squared: 0.5055, Adjusted R-squared: 0.483
## F-statistic: 22.49 on 2 and 44 DF, p-value: 1.87e-07
From the results above, it can be interpreted as follows: Given that
Education
is held at a certain level, when
Examination
increases by one unit, Fertility
tends to be 0.5572 units lower on average.
上の結果から次のように解釈できる:Education
を一定水準に固定したもとで Examination
が1単位大きい値を取るとき,Fertility
は平均的に 0.5572
小さな値を取る傾向にある.
When specifying a regression model, we can use the I
function to represent squares and interaction terms.
The notation var1 : var2
represents var1 × var2, while
var1 * var2
represents var1 + var2 + (var1 × var2).
回帰モデルを指定する際に I
関数を使って二乗や交差項を表現する.
「var1 : var2
」は「var1 × var2
」を表し,「var1 * var2
」は「var1 + var2 + (var1 × var2)
」を表す.
\[ Fertility = \alpha + \beta_1 Exam + \beta_2 Educ^2 + u \]
lm(Fertility ~ Examination + I(Education^2), data = swiss)
##
## Call:
## lm(formula = Fertility ~ Examination + I(Education^2), data = swiss)
##
## Coefficients:
## (Intercept) Examination I(Education^2)
## 83.22072 -0.66069 -0.01035
\[ Fertility = \alpha + \beta_1 Exam \times Educ + u \]
lm(Fertility ~ Examination:Education, data = swiss)
\[ Fertility = \alpha + \beta_1 Exam + \beta_2 Educ + \beta_3 Exam \times Educ + u \]
lm(Fertility ~ Examination*Education, data = swiss)
##
## Call:
## lm(formula = Fertility ~ Examination * Education, data = swiss)
##
## Coefficients:
## (Intercept) Examination Education
## 87.178104 -0.625731 -0.807552
## Examination:Education
## 0.009201
We can use log
without I
when taking the
logarithm of either the explanatory variables or the dependent
variable.
説明変数や被説明変数に対数を取る場合は I
なしで
log
を使ってよい.
\[ \log(Fertility) = \alpha + \beta_1 Exam + u \]
lm(log(Fertility) ~ Examination, data = swiss)
##
## Call:
## lm(formula = log(Fertility) ~ Examination, data = swiss)
##
## Coefficients:
## (Intercept) Examination
## 4.49251 -0.01574
Let’s define a dummy variable, exam_dummy
, that takes
the value 1 when the Examination
variable is greater than
the average (approximately 16).
Use the
ifelse(logical expression, value if TRUE, value if FALSE)
function to create a variable that returns 1 if the logical expression
is TRUE
.
Examination
変数が平均(≒ 16)よりも大きい場合に 1
をとるダミー変数を定義してみる.
ifelse(論理演算, TRUE の場合, FALSE の場合)
を利用し,論理演算部分が TRUE
なら 1
を返すように変数を作成する.
swiss$exam_dummy <- ifelse(swiss$Examination > 16, 1, 0)
head(swiss[, c("Examination", "exam_dummy")])
## Examination exam_dummy
## Courtelary 15 0
## Delemont 6 0
## Franches-Mnt 5 0
## Moutier 12 0
## Neuveville 17 1
## Porrentruy 9 0
\[ Fertility = \alpha + \beta_1 1(Exam > 16) + u \]
summary(lm(Fertility ~ exam_dummy, data = swiss))
##
## Call:
## lm(formula = Fertility ~ exam_dummy, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.359 -5.557 1.945 6.793 16.441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.059 2.020 37.657 < 2e-16 ***
## exam_dummy -13.904 3.096 -4.491 4.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.5 on 45 degrees of freedom
## Multiple R-squared: 0.3095, Adjusted R-squared: 0.2941
## F-statistic: 20.17 on 1 and 45 DF, p-value: 4.907e-05
It can be observed that in regions where Examination
exceeds 16, Fertility
tends to be, on average, 13.904 units
lower compared to regions where Examination does not exceed 16.
Examination
が16を超えている地域では超えていない地域と比較して
Fertility
が平均的に 13.904
単位低い値を取る傾向にあることが分かる.