import rpy2.rinterface
import rpy2.robjects as robjects
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.factorplots import interaction_plot
# https://www.statsmodels.org/stable/examples/notebooks/generated/formulas.html
import seaborn as sns
%load_ext rpy2.ipython
sns.set_theme(color_codes=True)
sns.set_context("poster")
%%R -o Germany
load('LM3.RData')
Specify two models to predict the amount of alcohol consumed on weekends and weekdays by the respondents’education, gender, income,and age.
Please, use the log of the amount of alcohol.
Interpret the results(coefficients, R^2).
modelDays = smf.ols(formula='np.log(alcwkdyn) ~ eduyrsn + gndr + incdec + age',
data=Germany).fit()
modelDays.summary()
Dep. Variable: | np.log(alcwkdyn) | R-squared: | 0.048 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.046 |
Method: | Least Squares | F-statistic: | 25.29 |
Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 1.85e-20 |
Time: | 18:18:30 | Log-Likelihood: | -3372.1 |
No. Observations: | 2000 | AIC: | 6754. |
Df Residuals: | 1995 | BIC: | 6782. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.3322 | 0.200 | 11.673 | 0.000 | 1.940 | 2.724 |
gndr[T.Female] | -0.5492 | 0.059 | -9.291 | 0.000 | -0.665 | -0.433 |
eduyrsn | 0.0116 | 0.009 | 1.233 | 0.218 | -0.007 | 0.030 |
incdec | 0.0202 | 0.012 | 1.752 | 0.080 | -0.002 | 0.043 |
age | 0.0030 | 0.002 | 1.344 | 0.179 | -0.001 | 0.007 |
Omnibus: | 205.007 | Durbin-Watson: | 1.973 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 272.041 |
Skew: | -0.901 | Prob(JB): | 8.45e-60 |
Kurtosis: | 2.882 | Cond. No. | 328. |
# Translating from log to percents
print((np.exp(modelDays.params[modelDays.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -42.257076 dtype: float64
In the first model for alcohol consumption on weekdays
only effect of gender
is significant:
Gender
: by 42% less
consumed alcohol for females
Overall prediction power equals to 4.6%
of explained variance which depicts model poorness.
modelEnds = smf.ols(formula='np.log(alcwkndn) ~ eduyrsn + gndr + incdec + age',
data=Germany).fit()
modelEnds.summary()
Dep. Variable: | np.log(alcwkndn) | R-squared: | 0.092 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.090 |
Method: | Least Squares | F-statistic: | 50.33 |
Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 2.08e-40 |
Time: | 18:18:30 | Log-Likelihood: | -2904.6 |
No. Observations: | 1999 | AIC: | 5819. |
Df Residuals: | 1994 | BIC: | 5847. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 3.4394 | 0.158 | 21.738 | 0.000 | 3.129 | 3.750 |
gndr[T.Female] | -0.5459 | 0.047 | -11.657 | 0.000 | -0.638 | -0.454 |
eduyrsn | 0.0177 | 0.007 | 2.369 | 0.018 | 0.003 | 0.032 |
incdec | 0.0099 | 0.009 | 1.081 | 0.280 | -0.008 | 0.028 |
age | -0.0096 | 0.002 | -5.467 | 0.000 | -0.013 | -0.006 |
Omnibus: | 438.473 | Durbin-Watson: | 1.996 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 940.560 |
Skew: | -1.250 | Prob(JB): | 5.75e-205 |
Kurtosis: | 5.246 | Cond. No. | 327. |
# Translating from log to percents
print((np.exp(modelEnds.params[modelEnds.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -42.068734 eduyrsn 1.785707 age -0.952473 dtype: float64
In the second model for alcohol consumption on weekends
only effect of Income Decile
is unsignificant:
Years of Education
: on 1.7% more
consumed alcohol per year
Gender
: same 42% less
consumed alcohol for females
Age
: by ~1% less
consumed alcohol per year
Overall prediction power equals to 9%
of explained variance which almost twice as higher than for weekdays model.
Estimate the interaction terms between gender and education and between income and age in both models.
Interpret the results.
modelDaysI = smf.ols(formula='np.log(alcwkdyn) ~ eduyrsn * gndr + incdec * age',
data=Germany).fit()
modelDaysI.summary()
Dep. Variable: | np.log(alcwkdyn) | R-squared: | 0.050 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.047 |
Method: | Least Squares | F-statistic: | 17.32 |
Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 1.26e-19 |
Time: | 18:18:30 | Log-Likelihood: | -3370.7 |
No. Observations: | 2000 | AIC: | 6755. |
Df Residuals: | 1993 | BIC: | 6795. |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.0834 | 0.311 | 6.706 | 0.000 | 1.474 | 2.693 |
gndr[T.Female] | -0.7451 | 0.277 | -2.688 | 0.007 | -1.289 | -0.202 |
eduyrsn | 0.0069 | 0.013 | 0.550 | 0.582 | -0.018 | 0.031 |
eduyrsn:gndr[T.Female] | 0.0124 | 0.018 | 0.708 | 0.479 | -0.022 | 0.047 |
incdec | 0.0734 | 0.036 | 2.031 | 0.042 | 0.003 | 0.144 |
age | 0.0103 | 0.005 | 1.996 | 0.046 | 0.000 | 0.020 |
incdec:age | -0.0012 | 0.001 | -1.543 | 0.123 | -0.003 | 0.000 |
Omnibus: | 203.116 | Durbin-Watson: | 1.976 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 268.868 |
Skew: | -0.896 | Prob(JB): | 4.13e-59 |
Kurtosis: | 2.885 | Cond. No. | 3.41e+03 |
# Translating from log to percents
print((np.exp(modelDaysI.params[modelDaysI.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -52.532990 incdec 7.620540 age 1.037384 dtype: float64
Significant effects on alcohol consumption on weekdays only by Gender
, Income Decile
and Age
:
Gender
: 52% less
consumed alcohol by females
Income Decile
: 7% more
per levelAge
: 1% more
per yearThe same prediction power as for model without interactions.
modelEndsI = smf.ols(formula='np.log(alcwkndn) ~ eduyrsn * gndr + incdec * age',
data=Germany).fit()
modelEndsI.summary()
Dep. Variable: | np.log(alcwkndn) | R-squared: | 0.097 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.094 |
Method: | Least Squares | F-statistic: | 35.54 |
Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 4.79e-41 |
Time: | 18:18:30 | Log-Likelihood: | -2899.1 |
No. Observations: | 1999 | AIC: | 5812. |
Df Residuals: | 1992 | BIC: | 5851. |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 4.0471 | 0.246 | 16.443 | 0.000 | 3.564 | 4.530 |
gndr[T.Female] | -0.9995 | 0.219 | -4.554 | 0.000 | -1.430 | -0.569 |
eduyrsn | 0.0023 | 0.010 | 0.235 | 0.815 | -0.017 | 0.022 |
eduyrsn:gndr[T.Female] | 0.0298 | 0.014 | 2.141 | 0.032 | 0.003 | 0.057 |
incdec | -0.0544 | 0.029 | -1.902 | 0.057 | -0.111 | 0.002 |
age | -0.0181 | 0.004 | -4.406 | 0.000 | -0.026 | -0.010 |
incdec:age | 0.0015 | 0.001 | 2.411 | 0.016 | 0.000 | 0.003 |
Omnibus: | 438.934 | Durbin-Watson: | 1.993 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 949.826 |
Skew: | -1.247 | Prob(JB): | 5.60e-207 |
Kurtosis: | 5.278 | Cond. No. | 3.42e+03 |
# Translating from log to percents
print((np.exp(modelEndsI.params[modelEndsI.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -63.192093 eduyrsn:gndr[T.Female] 3.020500 age -1.793094 incdec:age 0.154312 dtype: float64
Significant effects on alcohol consumption on weekends by Gender
, Age
and by both interactions:
Gender
: 63% less
consumed alcohol by females
Age
: 1% less
per yearSlightly higher prediction power in comparison with model without interactions.
Visualize the estimated interaction terms and comment on the results.
sns.lmplot(x="eduyrsn", y="alcwkdyn", hue="gndr", data=Germany,
truncate=0, scatter=0, line_kws={'linewidth':10},
height=9, aspect=12/8).set(yscale="log");
From sjPlot.plot_model
documentation on rule for choosing moderator values in non-binary cases (which they based on this source):
[plot_model] uses the
mean value
of the moderator as well asone standard deviation below and above mean value
to plot the effect of the moderator on the independent variable (following the convention suggested by Cohen and Cohen and popularized by Aiken and West (1991), i.e. using the mean, the value one standard deviation above, and the value one standard deviation below the mean as values of the moderator)
print('Mean - SD: {:.2f}'.format(np.mean(Germany['incdec']) - np.std(Germany['incdec'])))
print(' Mean: {:.2f}'.format(np.mean(Germany['incdec'])))
print('Mean + SD: {:.2f}'.format(np.mean(Germany['incdec']) + np.std(Germany['incdec'])))
Mean - SD: 3.19 Mean: 5.97 Mean + SD: 8.75
Germany['incdec_mod'] = 'mean'
for i, val in enumerate(Germany['incdec']):
if val < (np.mean(Germany['incdec']) - np.std(Germany['incdec'])):
Germany['incdec_mod'].iat[i] = 'mean-sd'
if val > (np.mean(Germany['incdec']) + np.std(Germany['incdec'])):
Germany['incdec_mod'].iat[i] = 'mean+sd'
sns.lmplot(x="age", y="alcwkdyn", hue="incdec_mod", data=Germany,
hue_order=['mean-sd', 'mean', 'mean+sd'],
truncate=0, scatter=0, line_kws={'linewidth':10}, palette="Set1",
height=9, aspect=12/8).set(yscale="log");
sns.lmplot(x="eduyrsn", y="alcwkndn", hue="gndr", data=Germany,
truncate=0, scatter=0, line_kws={'linewidth':10},
height=9, aspect=12/8).set(yscale="log");
Years of Education
have facinating effect on females
- more educated women tend to consume more
alcohol comparing with less educated ones.
At the same time Years of Education
have opposite effect on males
(although less steep) - more educated men tend to consume less
alcohol comparing with less educated ones).
sns.lmplot(x="age", y="alcwkndn", hue="incdec_mod", data=Germany,
hue_order=['mean-sd', 'mean', 'mean+sd'],
truncate=0, scatter=0, line_kws={'linewidth':10}, palette="Set1",
height=9, aspect=12/8).set(yscale="log");
Income Decile
moderates effect of Age
- for anyone alcohol consumption on weekends will lower
with higher Age
, but:
Age
is more steep for person with lower
level of Income Decile
'lower income'
)mean
level of Income Decile
, Age
is having very little effect on person with higher
level of Income Decile
.