library(tidyverse)

1 Panel data analysis / パネルデータ分析

Heiss (2020) Using R for Introductory Econometrics, Section 13.5 に依拠して，1981年から1987年までの郡レベルの年次パネルデータ crime4 を利用する．

1.1 Data

このデータは Wooldridge の教科書で使用されており，wooldridge パッケージに収められている．

主な変数

crmrte: crime rate (crimes committed per person), 犯罪率
lprbarr: log of probability of arrest, 逮捕確率
lprbconv: log of probability of conviction, 有罪判決の確率
lpolpc: log of police per capita, 警察官の人数

# install.packages("wooldridge")
data(crime4, package = "wooldridge")  # load data 
nrow(crime4)  # sample size

## [1] 630

table(crime4$year)  # year

## 
## 81 82 83 84 85 86 87 
## 90 90 90 90 90 90 90

length(unique(crime4$county))  # number of county

## [1] 90

head(table(crime4$county, crime4$year), 3)  # cross table of county x year

##    
##     81 82 83 84 85 86 87
##   1  1  1  1  1  1  1  1
##   3  1  1  1  1  1  1  1
##   5  1  1  1  1  1  1  1

変数の作成や推定には fixest パッケージを用いる．使い方はパッケージの公式Vignette：Fast Fixed-Effects Estimation: Short Introduction を参照．

# install.packages("fixest")
library(fixest)

注：以前は plm パッケージが主流だった．

1.2 Without fixed effects （参考）

固定効果を含めない場合は？

summary(lm(crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4))

## 
## Call:
## lm(formula = crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.045389 -0.006514 -0.000872  0.006269  0.095020 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1004765  0.0062616   16.05   <2e-16 ***
## lprbarr     -0.0215479  0.0011893  -18.12   <2e-16 ***
## lprbconv    -0.0157643  0.0008423  -18.72   <2e-16 ***
## lpolpc       0.0165269  0.0009717   17.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0123 on 626 degrees of freedom
## Multiple R-squared:  0.5412, Adjusted R-squared:  0.539 
## F-statistic: 246.1 on 3 and 626 DF,  p-value: < 2.2e-16

なお，実際には変数間に複雑・動学的な相互作用が生じていると思われるが，この演習ではその可能性について考えないこととする．

1.3 Oneway: Individual-fixed effect

Individual-level の固定効果を導入する．

\[ y_{it} = \beta_0 + \beta_1 x_{it} + \alpha_i + \epsilon_{it} \]

$i$ は county を表す index で，$\alpha_i$ は county-level の固定効果

1.3.1 Within (fixed effect) estimation

$\dot{y}$ を demean (over time) した変数 ($\dot{y}_{it} = y_{it} - \bar{y}_i$) とすると，モデルは以下のようになる．

\[ \dot{y}_{it} = \beta_1 \dot{x}_{it} + \dot{\epsilon}_{it} \]

1.3.1.1 Demean by hand

County ごとに平均値を計算して差し引けばよい．

crime4_ow_demean <- crime4 %>% 
  group_by(county) %>% 
  mutate(crmrte_demean = crmrte - mean(crmrte), 
         lprbarr_demean = lprbarr - mean(lprbarr), 
         lprbconv_demean =  lprbconv- mean(lprbconv),
         lpolpc_demean =  lpolpc- mean(lpolpc)) %>% 
  ungroup()  # グループ化の解除

実際に $\sum_t \dot{y}_{it} = 0 \ (\forall i)$ になっていることを確認する．

crime4_ow_demean %>% 
  group_by(county) %>% 
  summarise(crmrte_sum = sum(crmrte), 
            crmrte_demean_sum = sum(crmrte_demean)) %>% 
  slice(1:3)

## # A tibble: 3 × 3
##   county crmrte_sum crmrte_demean_sum
##    <int>      <dbl>             <dbl>
## 1      1     0.250           1.39e-17
## 2      3     0.105           3.47e-18
## 3      5     0.0880          1.73e-18

Demean された変数を用いて OLS で推定する．

summary(lm(crmrte_demean ~ lprbarr_demean + lprbconv_demean + lpolpc_demean, 
           data = crime4_ow_demean))

## 
## Call:
## lm(formula = crmrte_demean ~ lprbarr_demean + lprbconv_demean + 
##     lpolpc_demean, data = crime4_ow_demean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026634 -0.002622 -0.000318  0.002283  0.072563 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.927e-19  2.217e-04    0.00        1    
## lprbarr_demean  -9.843e-03  1.223e-03   -8.05 4.18e-15 ***
## lprbconv_demean -8.455e-03  8.002e-04  -10.57  < 2e-16 ***
## lpolpc_demean    1.527e-02  1.033e-03   14.78  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005566 on 626 degrees of freedom
## Multiple R-squared:  0.2744, Adjusted R-squared:  0.2709 
## F-statistic:  78.9 on 3 and 626 DF,  p-value: < 2.2e-16

1.3.1.2 Demean by `demean` function

fixest::demean 関数を使うことで容易に demean 変数を作成できる．

crime4_ow_demean2 <- demean(X = crime4[, c("crmrte", "lprbarr", "lprbconv", "lpolpc")], 
                            f = crime4[, c("county")])
str(crime4_ow_demean2)

## 'data.frame':    630 obs. of  4 variables:
##  $ crmrte  : num  0.004144 0.002604 -0.005437 -0.001015 0.000832 ...
##  $ lprbarr : num  -0.11058 0.04396 0.02104 0.11368 0.00563 ...
##  $ lprbconv: num  -0.2327 -0.1585 0.0355 0.1755 0.1315 ...
##  $ lpolpc  : num  -0.03221 -0.04358 -0.00516 0.02177 0.04196 ...

Demean された変数を用いて OLS で推定する．

summary(lm(crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4_ow_demean2))

## 
## Call:
## lm(formula = crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4_ow_demean2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026634 -0.002622 -0.000318  0.002283  0.072563 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.927e-19  2.217e-04    0.00        1    
## lprbarr     -9.843e-03  1.223e-03   -8.05 4.18e-15 ***
## lprbconv    -8.455e-03  8.002e-04  -10.57  < 2e-16 ***
## lpolpc       1.527e-02  1.033e-03   14.78  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005566 on 626 degrees of freedom
## Multiple R-squared:  0.2744, Adjusted R-squared:  0.2709 
## F-statistic:  78.9 on 3 and 626 DF,  p-value: < 2.2e-16

1.3.1.3 `fixest::feols` 関数で推定

固定効果は アウトカム ~ 説明変数1 + 説明変数2 | 固定効果 のように | (vertical bar) の後ろで指定する．

feols(crmrte ~ lprbarr + lprbconv + lpolpc | factor(county), data = crime4)

## OLS estimation, Dep. Var.: crmrte
## Observations: 630 
## Fixed-effects: factor(county): 90
## Standard-errors: Clustered (factor(county)) 
##           Estimate Std. Error  t value   Pr(>|t|)    
## lprbarr  -0.009843   0.003255 -3.02426 0.00325675 ** 
## lprbconv -0.008455   0.002280 -3.70806 0.00036275 ***
## lpolpc    0.015268   0.006681  2.28518 0.02467883 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.005548     Adj. R2: 0.890035
##                  Within R2: 0.274359

回帰係数は明示的に demean した変数を用いた場合の推定量と同じ値が推定されていることが確認できる．

一方で，標準誤差については county レベルのクラスターロバスト標準誤差が自動で計算されるため，値が異なる点に注意．

1.3.2 LSDV (least square dummy variable) estimation

郡の固定効果をモデルに含める．

\[ y_{it} = \beta_0 + \beta_1 x_{it} + \alpha_i + \epsilon_{it} \]

county 変数は数値型のため，そのまま回帰モデルに含めると連続変数として扱われてしまう．

lm(crmrte ~ county + lprbarr + lprbconv + lpolpc, data = crime4)

## 
## Call:
## lm(formula = crmrte ~ county + lprbarr + lprbconv + lpolpc, data = crime4)
## 
## Coefficients:
## (Intercept)       county      lprbarr     lprbconv       lpolpc  
##   9.875e-02    9.937e-06   -2.146e-02   -1.589e-02    1.641e-02

そこで，factor 関数で因子型に変換することでダミー変数として使用する．

panel_lsdv_ow_lm <- lm(crmrte ~ factor(county) + lprbarr + lprbconv + lpolpc, data = crime4)
# panel_lsdv_lm
library(modelsummary)
modelsummary(list(panel_lsdv_ow_lm), coef_omit = 1:90, fmt = 5, 
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE", stars = TRUE)

	(1)
lprbarr	-0.00984***
	(0.00132)
lprbconv	-0.00846***
	(0.00086)
lpolpc	0.01527***
	(0.00112)
Num.Obs.	630
R2	0.906
R2 Adj.	0.890
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

1.3.3 First difference estimation

郡の固定効果 $\alpha_i$ は差分を取ることで消える．

\[ \Delta y_{it} = \beta_1 \Delta x_{it} + \epsilon_{it} \]

差分を取るには d(変数名) とする．

feols(d(crmrte) ~ d(lprbarr) + d(lprbconv) + d(lpolpc), data = crime4, panel.id = ~ county + year)

## OLS estimation, Dep. Var.: d(crmrte, 1)
## Observations: 540 
## Standard-errors: Clustered (county) 
##                 Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)    -0.000142   0.000210 -0.673964 0.502081    
## d(lprbarr, 1)  -0.009403   0.003988 -2.357772 0.020576 *  
## d(lprbconv, 1) -0.007275   0.002834 -2.567251 0.011920 *  
## d(lpolpc, 1)    0.017817   0.009124  1.952776 0.053988 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.007548   Adj. R2: 0.287912

1.4 Twoway: Individual- and Time- fixed effect

Individual-level と time-level の2つの固定効果を同時に導入する場合．

\[ y_{it} = \beta_0 + \beta_1 x_{it} + \alpha_i + \lambda_t + \epsilon_{it} \]

$i$ は county を表す index で，$\alpha_i$ は county-level の固定効果
$t$ は year を表す index で，$\lambda_t$ は year-level の固定効果

1.4.1 Within estimation

$\dot{y}$ を demean した変数 ($\dot{y}_{it} = y_{it} - \bar{y}_i - \bar{y}_t + \bar{\bar{y}}$) とすると，モデルは以下のようになる．

\[ \dot{y}_{it} = \beta_1 \dot{x}_{it} + \dot{\epsilon}_{it} \]

1.4.1.1 Demean by hand

以下のように demean 関数を使わずに計算することもできるが，見ての通りコードが煩雑になる．

crime4 %>% 
  group_by(county) %>% 
  mutate(crmrte_mean_i = mean(crmrte)) %>% 
  ungroup() %>% 
  group_by(year) %>% 
  mutate(crmrte_mean_y = mean(crmrte)) %>% 
  ungroup() %>% 
  mutate(crmrte_mean = mean(crmrte),
         crmrte_demean = crmrte - crmrte_mean_i - crmrte_mean_y + crmrte_mean) %>% 
  select(county, year, starts_with("crmrte"))

## # A tibble: 630 × 7
##    county  year crmrte crmrte_mean_i crmrte_mean_y crmrte_mean crmrte_demean
##     <int> <int>  <dbl>         <dbl>         <dbl>       <dbl>         <dbl>
##  1      1    81 0.0399        0.0357        0.0328      0.0316      0.00298 
##  2      1    82 0.0383        0.0357        0.0326      0.0316      0.00154 
##  3      1    83 0.0303        0.0357        0.0307      0.0316     -0.00456 
##  4      1    84 0.0347        0.0357        0.0295      0.0316      0.00112 
##  5      1    85 0.0366        0.0357        0.0298      0.0316      0.00266 
##  6      1    86 0.0348        0.0357        0.0323      0.0316     -0.00169 
##  7      1    87 0.0356        0.0357        0.0335      0.0316     -0.00206 
##  8      3    81 0.0164        0.0149        0.0328      0.0316      0.000293
##  9      3    82 0.0191        0.0149        0.0326      0.0316      0.00307 
## 10      3    83 0.0151        0.0149        0.0307      0.0316      0.00109 
## # ℹ 620 more rows

1.4.1.2 Demean by `demean` function

Two-way fixed effect の場合は demean 関数のありがたみをより感じることができる．

crime4_demean <- demean(X = crime4[, c("crmrte", "lprbarr", "lprbconv", "lpolpc")], 
                        f = crime4[, c("county", "year")])
str(crime4_demean)

## 'data.frame':    630 obs. of  4 variables:
##  $ crmrte  : num  0.00298 0.00154 -0.00456 0.00112 0.00266 ...
##  $ lprbarr : num  -0.0926 0.024061 0.005557 0.07988 -0.000838 ...
##  $ lprbconv: num  -0.1858 -0.0973 0.0213 0.1606 0.033 ...
##  $ lpolpc  : num  0.01318 0.02494 -0.00746 -0.00512 0.00155 ...

実際に $\sum_i \dot{y}_{it} = 0 \ (\forall t)$ になっていることを確認する．

data.frame(year = crime4$year, crime4_demean) %>% 
  group_by(year) %>% 
  summarise(crmrte_sum = sum(crmrte)) %>% 
  slice(1:3)

## # A tibble: 3 × 2
##    year crmrte_sum
##   <int>      <dbl>
## 1    81  -4.16e-17
## 2    82   4.16e-17
## 3    83  -9.37e-17

$\sum_t \dot{y}_{it} = 0 \ (\forall i)$ も同様に確認．

data.frame(county = crime4$county, crime4_demean) %>% 
  group_by(county) %>% 
  summarise(crmrte_sum = sum(crmrte)) %>% 
  slice(1:3)

## # A tibble: 3 × 2
##   county crmrte_sum
##    <int>      <dbl>
## 1      1  -6.94e-18
## 2      3  -2.08e-17
## 3      5   1.39e-17

Demean された変数を用いて OLS で推定する．

summary(lm(crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4_demean))

## 
## Call:
## lm(formula = crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4_demean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026536 -0.002072 -0.000194  0.002198  0.071561 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.032e-17  2.149e-04   0.000        1    
## lprbarr     -9.008e-03  1.204e-03  -7.479 2.54e-13 ***
## lprbconv    -7.881e-03  7.910e-04  -9.963  < 2e-16 ***
## lpolpc       1.554e-02  1.013e-03  15.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005393 on 626 degrees of freedom
## Multiple R-squared:  0.2814, Adjusted R-squared:  0.278 
## F-statistic: 81.72 on 3 and 626 DF,  p-value: < 2.2e-16

1.4.1.3 `fixest::feols` 関数で推定

feols(crmrte ~ lprbarr + lprbconv + lpolpc | factor(county) + factor(year), data = crime4)

（出力は省略）

この演習では詳しく扱わないが，標準誤差を変更することができる．たとえば，Newey and West (ECMA 1987) が提案した標準誤差を適用するには，パネルデータの構造を panel.id 引数で指定したうえで以下のようにする．

crime4_within_twoway <- feols(crmrte ~ lprbarr + lprbconv + lpolpc | factor(county) + factor(year), 
                              data = crime4, panel.id = ~ county + year)
summary(crime4_within_twoway, "newey_west")

## OLS estimation, Dep. Var.: crmrte
## Observations: 630 
## Fixed-effects: factor(county): 90,  factor(year): 7
## Standard-errors: Newey-West (L=1) 
##           Estimate Std. Error  t value  Pr(>|t|)    
## lprbarr  -0.009008   0.003171 -2.84106 0.0295259 *  
## lprbconv -0.007881   0.001848 -4.26474 0.0052936 ** 
## lpolpc    0.015539   0.004952  3.13763 0.0201285 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.005376     Adj. R2: 0.895563
##                  Within R2: 0.281417

1.4.2 LSDV estimation

年と郡それぞれの固定効果をモデルに含める．

\[ y_{it} = \beta_0 + \beta_1 x_{it} + \alpha_i + \lambda_t + \epsilon_{it} \]

panel_lsdv_lm <- lm(crmrte ~ factor(county) + factor(year) + lprbarr + lprbconv + lpolpc, data = crime4)
# panel_lsdv_lm
library(modelsummary)
modelsummary(list(panel_lsdv_lm), coef_omit = 1:96, fmt = 5,
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE", stars = TRUE)

（出力は省略）

1.4.3 First difference estimation

郡の固定効果 $\alpha_i$ は差分を取ることで消える．

\[ \Delta y_{it} = \beta_0 + \beta_1 \Delta x_{it} + \lambda_t + \epsilon_{it} \]

feols(d(crmrte) ~ d(lprbarr) + d(lprbconv) + d(lpolpc) | factor(year), 
      data = crime4, panel.id = ~ county + year)

（出力は省略）

2 Convert data format

パネルデータは wide 形式と long 形式という2つの形式をとりうる．

ここまで使ってきたデータは long 形式と呼ばれるもので，ユニークな individual-time が1行に対応する．

（参考：ユニークな individual-time-variable または individual-time を1行に対応させたデータを long 形式と呼ぶ場合もある．）

Long 形式：1行＝1観測（すなわち，1行には1時点のデータしか含まれない）

County	Year	Crime rate	Prob. arrest
1	1981	0.0399	0.2897
1	1982	0.0383	0.3381
3	1981	0.0164	0.2029
3	1982	0.0191	0.1622
…	…	…	…

もう一つの形式は wide．ユニークな Individual を1行に対応させる形式で，縦に長い long 形式と比べて横に長く（文字通りワイドに）なる．

Wide 形式：1行＝1郡（すなわち，1行に複数時点のデータが含まれる）

County	Crime rate 1981	Crime rate 1982	Prob. arrest 1981	Prob. arrest 1982
1	0.0399	0.0383	0.2897	0.3381
3	0.0164	0.0191	0.2029	0.1622
…	…	…	…	…

一般的には tidy データ（1つの行に1つの observation が対応する）をベースにハンドリング・分析することが望ましく，パネルデータの文脈では long 形式が tidy である．

以下のように long ⇔ wide の変換が可能だが，関数の使い方が複雑なので ChatGPT に教えてもらうのがベター．

2.1 Long to Wide

注：通常は long 形式のまま分析を行うのでこのセクションを理解しなくても差し支えない．

Long から wide に変換するには tidyr::pivot_wider 関数を用いる．

注：tidyr パッケージは tidyverse を構成するパッケージの一つなので，tidyverse を（インストール＆）ロードすれば使うことができる．

crime4_wide <- crime4 %>% 
  select(county, year, crmrte, lprbarr, lprbconv, lpolpc) %>% 
  pivot_wider(names_from = "year", values_from = c(crmrte, lprbarr, lprbconv, lpolpc))
crime4_wide %>% slice(1:3)

## # A tibble: 3 × 29
##   county crmrte_81 crmrte_82 crmrte_83 crmrte_84 crmrte_85 crmrte_86 crmrte_87
##    <int>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1      1   0.0399     0.0383    0.0303    0.0347    0.0366    0.0348    0.0356
## 2      3   0.0164     0.0191    0.0151    0.0137    0.0120    0.0130    0.0153
## 3      5   0.00934    0.0123    0.0128    0.0138    0.0125    0.0142    0.0130
## # ℹ 21 more variables: lprbarr_81 <dbl>, lprbarr_82 <dbl>, lprbarr_83 <dbl>,
## #   lprbarr_84 <dbl>, lprbarr_85 <dbl>, lprbarr_86 <dbl>, lprbarr_87 <dbl>,
## #   lprbconv_81 <dbl>, lprbconv_82 <dbl>, lprbconv_83 <dbl>, lprbconv_84 <dbl>,
## #   lprbconv_85 <dbl>, lprbconv_86 <dbl>, lprbconv_87 <dbl>, lpolpc_81 <dbl>,
## #   lpolpc_82 <dbl>, lpolpc_83 <dbl>, lpolpc_84 <dbl>, lpolpc_85 <dbl>,
## #   lpolpc_86 <dbl>, lpolpc_87 <dbl>

2期間しかない場合は wide 形式のデータを用いて簡単に first difference 推定が可能．

crime4_wide_diff <- crime4_wide %>% 
  mutate(crmrte = crmrte_87 - crmrte_81,  # first difference 
         lprbarr = lprbarr_87 - lprbarr_81,
         lprbconv = lprbconv_87 - lprbconv_81,
         lpolpc = lpolpc_87 - lpolpc_81)
lm(crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4_wide_diff)

## 
## Call:
## lm(formula = crmrte ~ lprbarr + lprbconv + lpolpc, data = crime4_wide_diff)
## 
## Coefficients:
## (Intercept)      lprbarr     lprbconv       lpolpc  
##   0.0002265   -0.0073395   -0.0067913    0.0043511

Long 型のデータのまま推定する場合（答え合わせ）．

crime4_8187 <- crime4 %>% 
  filter(year %in% c(81, 87))  # same as [year == 81 | year == 87]
feols(d(crmrte) ~ d(lprbarr) + d(lprbconv) + d(lpolpc), data = crime4_8187, panel.id = ~ county + year)

## OLS estimation, Dep. Var.: d(crmrte, 1)
## Observations: 90 
## Standard-errors: Clustered (county) 
##                 Estimate Std. Error   t value  Pr(>|t|)    
## (Intercept)     0.000227   0.000727  0.311697 0.7560001    
## d(lprbarr, 1)  -0.007339   0.002760 -2.659303 0.0092854 ** 
## d(lprbconv, 1) -0.006791   0.003294 -2.062045 0.0421193 *  
## d(lpolpc, 1)    0.004351   0.003399  1.280283 0.2037725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.007976   Adj. R2: 0.13243

2.2 Wide to Long

Wide から long に変換するには tidyr::pivot_longer 関数を用いる．

(.*) などは正規表現と呼ばれるもので，このテクニックを文字列のパターン認識と変数名の作成に利用している．正規表現は文字データの処理をする上で非常に有用なテクニックだが，統計分析とは直接関係がないのでこの授業では理解しなくてよい．

crime4_long <- crime4_wide %>% 
  pivot_longer(cols = -county,
               names_to = c(".value", "year"), 
               names_pattern = "(.*)_(\\d+)")
crime4_long %>% slice(1:3)

## # A tibble: 3 × 6
##   county year  crmrte lprbarr lprbconv lpolpc
##    <int> <chr>  <dbl>   <dbl>    <dbl>  <dbl>
## 1      1 81    0.0399   -1.24   -0.911  -6.33
## 2      1 82    0.0383   -1.08   -0.837  -6.34
## 3      1 83    0.0303   -1.11   -0.643  -6.30

元のデータと比較．

crime4 %>% 
  select(county, year, crmrte, lprbarr, lprbconv, lpolpc) %>% 
  slice(1:3)

##   county year    crmrte   lprbarr   lprbconv    lpolpc
## 1      1   81 0.0398849 -1.238923 -0.9111490 -6.327340
## 2      1   82 0.0383449 -1.084381 -0.8370060 -6.338704
## 3      1   83 0.0303048 -1.107303 -0.6430188 -6.300291

3 Difference-in-differences / 差分の差分法

Heiss (2020) Using R for Introductory Econometrics, Section 13.2 に依拠して，Kiel and McClain (JEEM 1995) のデータを用いた DID 分析を行う．ただし，Kiel and McClain (1995) 自体は DID を行っているわけではない．

3.1 Data

ごみ焼却施設 (garbage incinerator) の建設が近隣の住宅価格に与える影響を調べるために，焼却炉建設の噂が出回る前の1978年と建設が始まった1981年の2時点のデータを用いて DID 分析を行う．

このデータは Wooldridge の教科書で使用されており，wooldridge パッケージに kielmc という名前でデータセットが収められている．

主な変数

rprice: 取引価格
nearinc: 焼却施設に近いところ（処置群）で 1 をとるダミー変数
y81: 1981年（処置後）に 1 をとるダミー変数

data(kielmc, package = "wooldridge")
head(kielmc, 2)

##   year age agesq nbh  cbd intst lintst price rooms area land baths  dist
## 1 1978  48  2304   4 3000  1000 6.9078 60000     7 1660 4578     1 10700
## 2 1978  83  6889   4 4000  1000 6.9078 40000     6 2612 8370     2 11000
##      ldist wind   lprice y81    larea    lland y81ldist lintstsq nearinc
## 1 9.277999    3 11.00210   0 7.414573 8.429017        0  47.7177       1
## 2 9.305651    3 10.59663   0 7.867871 9.032409        0  47.7177       1
##   y81nrinc rprice  lrprice
## 1        0  60000 11.00210
## 2        0  40000 10.59663

kielmc %>% select(rprice, nearinc, y81) %>% summary

##      rprice          nearinc            y81        
##  Min.   : 26000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 59000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 82000   Median :0.0000   Median :0.0000  
##  Mean   : 83721   Mean   :0.2991   Mean   :0.4424  
##  3rd Qu.:100230   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :300000   Max.   :1.0000   Max.   :1.0000

3.2 DID table

以下のように difference をとる．

kielmc %>% 
  group_by(nearinc, y81) %>% 
  summarise(rprice = mean(rprice)) %>% 
  print.data.frame(digits = 6)

##   nearinc y81   rprice
## 1       0   0  82517.2
## 2       0   1 101307.5
## 3       1   0  63692.9
## 4       1   1  70619.2

101307.5 - 82517.2  # difference in control group [nearinc == 0]

## [1] 18790.3

70619.2 - 63692.9  # difference in treatment group [nearinc == 1]

## [1] 6926.3

(70619.2 - 63692.9) - (101307.5 - 82517.2)  # DID

## [1] -11864

Variable	Control (`nearinc == 0`)	Treatment (`nearinc == 1`)	Difference: Near-Far
Price: Before (1978)	82,517.2	63,692.9	-18,824.3
Price: After (1981)	101,307.5	70,619.2	-30,688.3
Change in Price: After-Before	18,790.3	6,926.3	-11,864

すなわち，焼却施設の建設によって $11,864 だけ価格が下落する．

中級者向け：以下のように計算してもよい．

kielmc %>% 
  group_by(nearinc, y81) %>% 
  summarise(rprice = mean(rprice)) %>% 
  group_by(nearinc) %>% 
  summarise(diff = diff(rprice)) %>% 
  mutate(diff_in_diff = diff(diff)) %>% 
  slice(2)  # extract 2nd row

## # A tibble: 1 × 3
##   nearinc  diff diff_in_diff
##     <int> <dbl>        <dbl>
## 1       1 6926.      -11864.

3.3 Regression

\[ y = \beta_0 + \beta_1 \mbox{Treated} + \beta_2 \mbox{After} + \beta_3 \mbox{Treated} \times \mbox{After} + \epsilon \]

nearinc:y81 は nearinc (Treat) と y81 (After) の交差項．この係数が DID 推定量で，上のセクションで計算した値と同じになる．

lm(formula = rprice ~ nearinc + y81 + nearinc:y81, data = kielmc)

## 
## Call:
## lm(formula = rprice ~ nearinc + y81 + nearinc:y81, data = kielmc)
## 
## Coefficients:
## (Intercept)      nearinc          y81  nearinc:y81  
##       82517       -18824        18790       -11864

treat * after は treat + after + treat:after と同じ．

did_reg_i <- lm(formula = rprice ~ nearinc*y81, data = kielmc)
did_reg_i

## 
## Call:
## lm(formula = rprice ~ nearinc * y81, data = kielmc)
## 
## Coefficients:
## (Intercept)      nearinc          y81  nearinc:y81  
##       82517       -18824        18790       -11864

説明変数として，家屋の築年数 age，州間高速道路出入口までの距離 intst，土地面積 land，家屋面積 area，部屋数 rooms，bathroom の数 baths を加える．

did_reg_ii <- lm(formula = rprice ~ nearinc * y81 + age + I(age^2 / 100) + 
                   log(intst) + log(land) + log(area) + rooms + baths, data = kielmc)
did_reg_iii <- lm(formula = I(rprice/area) ~ nearinc * y81 + age + I(age^2 / 100) + 
                   log(intst) + log(land) + log(area) + rooms + baths, data = kielmc)
did_reg_iv <- lm(formula = log(rprice) ~ nearinc * y81 + age + I(age^2 / 100) + 
                   log(intst) + log(land) + log(area) + rooms + baths, data = kielmc)

まとめて出力．

library(modelsummary)
modelsummary(list(Price = did_reg_i, Price = did_reg_ii, 
                  `Unit Price` = did_reg_iii, `ln(Price)` = did_reg_iv), 
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE", stars = TRUE)

	Price	Price	Unit Price	ln(Price)
(Intercept)	82517.228***	-238591.134***	185.247***	7.652***
	(2726.910)	(41020.561)	(15.657)	(0.416)
nearinc	-18824.370***	10941.232*	1.419	0.032
	(4875.322)	(4683.912)	(1.788)	(0.047)
y81	18790.286***	16270.665***	6.365***	0.162***
	(4050.065)	(2811.080)	(1.073)	(0.028)
nearinc × y81	-11863.903	-17767.932***	-5.536**	-0.132*
	(7456.646)	(5126.169)	(1.957)	(0.052)
age		-704.170***	-0.312***	-0.008***
		(139.188)	(0.053)	(0.001)
I(age^2/100)		349.628***	0.147***	0.004***
		(85.501)	(0.033)	(0.001)
log(intst)		-7024.937*	-3.354**	-0.061+
		(3107.732)	(1.186)	(0.032)
log(land)		10359.432***	4.548***	0.100***
		(2415.652)	(0.922)	(0.024)
log(area)		32442.580***	-23.031***	0.351***
		(5078.362)	(1.938)	(0.051)
rooms		2842.505+	1.316*	0.047**
		(1709.086)	(0.652)	(0.017)
baths		7092.305**	3.249**	0.094***
		(2734.712)	(1.044)	(0.028)
Num.Obs.	321	321	321	321
R2	0.174	0.643	0.492	0.733
R2 Adj.	0.166	0.632	0.476	0.724
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

物件属性をコントロール変数としてモデルに追加した場合，DID 推定量は -$17,768 となり，統計的にも有意になる．アウトカムに自然対数を取った場合に係数は -0.132 と推定され，これは焼却施設の建設が価格を 13.2% 低下させることを意味する．

4 Event study

fixest パッケージに含まれているサンプルデータ base_did を利用する．

主な変数

y: outcome
x1: explanatory variable
period: from 1 to 10
post: = 1 if period >= 6
- つまり，処置群の個体はすべて同じタイミングで処置を受けるため，staggered の設定ではない
treat: = 1 if treatment group

nrow(base_did)

## [1] 1080

head(base_did)

##             y         x1 id period post treat
## 1  2.87530627  0.5365377  1      1    0     1
## 2  1.86065272 -3.0431894  1      2    0     1
## 3  0.09416524  5.5768439  1      3    0     1
## 4  3.78147485 -2.8300587  1      4    0     1
## 5 -2.55819959 -5.0443544  1      5    0     1
## 6  1.72873240 -0.6363849  1      6    1     1

table(base_did$period)

## 
##   1   2   3   4   5   6   7   8   9  10 
## 108 108 108 108 108 108 108 108 108 108

4.1 Twoway fixed effect model

feols(y ~ x1 + post:treat | id + period, data = base_did)

## OLS estimation, Dep. Var.: y
## Observations: 1,080 
## Fixed-effects: id: 108,  period: 10
## Standard-errors: Clustered (id) 
##            Estimate Std. Error t value  Pr(>|t|)    
## x1         0.972166   0.045972 21.1468 < 2.2e-16 ***
## post:treat 4.850645   0.455222 10.6556 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.90173     Adj. R2: 0.474684
##                 Within R2: 0.368706

4.2 Event study: Estimation by hand

Period = 6 で処置が始まるので，その直前の Period = 5 を reference point として推定する．

\[ y_{it} = \beta_0 + \beta_1 x_{it1} + \sum_{\tau \in \{1, \ldots, 4,6,\ldots,10 \}} \gamma_{\tau} (\mbox{Treat})_i \times 1(\mbox{Period}_t = \tau) + \alpha_i + \lambda_t + \epsilon_{it} \]

base_did_mod <- base_did %>% 
  mutate(
    treat_p1 = treat * (period == 1), 
    treat_p2 = treat * (period == 2), 
    treat_p3 = treat * (period == 3), 
    treat_p4 = treat * (period == 4), 
    treat_p6 = treat * (period == 6), 
    treat_p7 = treat * (period == 7), 
    treat_p8 = treat * (period == 8), 
    treat_p9 = treat * (period == 9), 
    treat_p10 = treat * (period == 10)
    )
feols_es_hand <- feols(y ~ x1 + treat_p1 + treat_p2 + treat_p3 + treat_p4 + treat_p6 + treat_p7 + 
                         treat_p8 + treat_p9 + treat_p10 | id + period, data = base_did_mod) 
feols_es_hand

## OLS estimation, Dep. Var.: y
## Observations: 1,080 
## Fixed-effects: id: 108,  period: 10
## Standard-errors: Clustered (id) 
##            Estimate Std. Error   t value   Pr(>|t|)    
## x1         0.973490   0.045678 21.311868  < 2.2e-16 ***
## treat_p1  -1.403045   1.110267 -1.263701 2.0908e-01    
## treat_p2  -1.247511   1.093145 -1.141213 2.5633e-01    
## treat_p3  -0.273206   1.106935 -0.246813 8.0553e-01    
## treat_p4  -1.795721   1.087974 -1.650518 1.0177e-01    
## treat_p6   0.784452   1.028388  0.762798 4.4726e-01    
## treat_p7   3.598897   1.101563  3.267081 1.4609e-03 ** 
## treat_p8   3.811766   1.247502  3.055519 2.8366e-03 ** 
## treat_p9   4.731426   1.097113  4.312617 3.6041e-05 ***
## treat_p10  6.606229   1.120494  5.895817 4.4031e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.83653     Adj. R2: 0.48783 
##                 Within R2: 0.389628

95%信頼区間（Estimate ± 1.96×SE）とあわせて描画する．

feols_es_coef <- data.frame(period = c(1:4, 6:10), feols_es_hand$coeftable[-1, ])
head(feols_es_coef, 2)

##          period  Estimate Std..Error   t.value  Pr...t..
## treat_p1      1 -1.403045   1.112119 -1.261597 0.2074027
## treat_p2      2 -1.247511   1.111927 -1.121935 0.2621725

ggplot(feols_es_coef, aes(x = period, y = Estimate)) +
  geom_point() + 
  geom_errorbar(aes(ymin = Estimate - 1.96*Std..Error, ymax = Estimate + 1.96*Std..Error), width = .2) + 
  geom_hline(yintercept = 0, color = "gray")

4.3 Event study: Using `feols`+`i` and `iplot`

fixest::feols 関数の中で interaction 変数を作成する fixest::i 関数を使用すれば，fixest::iplot と以下のように組み合わせることで容易に event study graph が作成できる．

i 関数は i(factor_var = time, var = treat, ref = ref_time) の形式で使う
- 1つ目の引数は factor 型として扱われる変数（時間），2つ目が treatment dummy を表す変数
- Period = 6 で処置が始まるため，その直前の Period = 5 を ref 引数として指定

feols_es <- feols(y ~ x1 + i(period, treat, ref = 5) | id + period, data = base_did) 
feols_es

## OLS estimation, Dep. Var.: y
## Observations: 1,080 
## Fixed-effects: id: 108,  period: 10
## Standard-errors: Clustered (id) 
##                   Estimate Std. Error   t value   Pr(>|t|)    
## x1                0.973490   0.045678 21.311868  < 2.2e-16 ***
## period::1:treat  -1.403045   1.110267 -1.263701 2.0908e-01    
## period::2:treat  -1.247511   1.093145 -1.141213 2.5633e-01    
## period::3:treat  -0.273206   1.106935 -0.246813 8.0553e-01    
## period::4:treat  -1.795721   1.087974 -1.650518 1.0177e-01    
## period::6:treat   0.784452   1.028388  0.762798 4.4726e-01    
## period::7:treat   3.598897   1.101563  3.267081 1.4609e-03 ** 
## period::8:treat   3.811766   1.247502  3.055519 2.8366e-03 ** 
## period::9:treat   4.731426   1.097113  4.312617 3.6041e-05 ***
## period::10:treat  6.606229   1.120494  5.895817 4.4031e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.83653     Adj. R2: 0.48783 
##                 Within R2: 0.389628

iplot(feols_es)

Data Analysis Using Statistical Packages: Panel Data Analysis and DID

Sho Kuroda / 黒田翔

2024年3月 (Last update: 2024年4月)

1 Panel data analysis / パネルデータ分析

1.1 Data

1.2 Without fixed effects （参考）

1.3 Oneway: Individual-fixed effect

1.3.1 Within (fixed effect) estimation

1.3.1.1 Demean by hand

1.3.1.2 Demean by `demean` function

1.3.1.3 `fixest::feols` 関数で推定

1.3.2 LSDV (least square dummy variable) estimation

1.3.3 First difference estimation

1.4 Twoway: Individual- and Time- fixed effect

1.4.1 Within estimation

1.4.1.1 Demean by hand

1.4.1.2 Demean by `demean` function

1.4.1.3 `fixest::feols` 関数で推定

1.4.2 LSDV estimation

1.4.3 First difference estimation

2 Convert data format

2.1 Long to Wide

2.2 Wide to Long

3 Difference-in-differences / 差分の差分法

3.1 Data

3.2 DID table

3.3 Regression

4 Event study

4.1 Twoway fixed effect model

4.2 Event study: Estimation by hand

4.3 Event study: Using `feols`+`i` and `iplot`

Data Analysis Using Statistical Packages: Panel Data Analysis and DID

Sho Kuroda / 黒田翔

2024年3月 (Last update: 2024年4月)

1 Panel data analysis / パネルデータ分析

1.1 Data

1.2 Without fixed effects （参考）

1.3 Oneway: Individual-fixed effect

1.3.1 Within (fixed effect) estimation

1.3.1.1 Demean by hand

1.3.1.2 Demean by demean function

1.3.1.3 fixest::feols 関数で推定

1.3.2 LSDV (least square dummy variable) estimation

1.3.3 First difference estimation

1.4 Twoway: Individual- and Time- fixed effect

1.4.1 Within estimation

1.4.1.1 Demean by hand

1.4.1.2 Demean by demean function

1.4.1.3 fixest::feols 関数で推定

1.4.2 LSDV estimation

1.4.3 First difference estimation

2 Convert data format

2.1 Long to Wide

2.2 Wide to Long

3 Difference-in-differences / 差分の差分法

3.1 Data

3.2 DID table

3.3 Regression

4 Event study

4.1 Twoway fixed effect model

4.2 Event study: Estimation by hand

4.3 Event study: Using feols+i and iplot

1.3.1.2 Demean by `demean` function

1.3.1.3 `fixest::feols` 関数で推定

1.4.1.2 Demean by `demean` function

1.4.1.3 `fixest::feols` 関数で推定

4.3 Event study: Using `feols`+`i` and `iplot`