import rpy2.rinterface
import rpy2.robjects as robjects
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.factorplots import interaction_plot
# https://www.statsmodels.org/stable/examples/notebooks/generated/formulas.html
import seaborn as sns
%load_ext rpy2.ipython
sns.set_theme(color_codes=True)
sns.set_context("poster")
%%R -o Germany
load('LM3.RData')
Specify two models to predict the amount of alcohol consumed on weekends and weekdays by the respondents’education, gender, income,and age.
Please, use the log of the amount of alcohol.
Interpret the results(coefficients, R^2).
modelDays = smf.ols(formula='np.log(alcwkdyn) ~ eduyrsn + gndr + incdec + age',
data=Germany).fit()
modelDays.summary()
| Dep. Variable: | np.log(alcwkdyn) | R-squared: | 0.048 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.046 |
| Method: | Least Squares | F-statistic: | 25.29 |
| Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 1.85e-20 |
| Time: | 18:18:30 | Log-Likelihood: | -3372.1 |
| No. Observations: | 2000 | AIC: | 6754. |
| Df Residuals: | 1995 | BIC: | 6782. |
| Df Model: | 4 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 2.3322 | 0.200 | 11.673 | 0.000 | 1.940 | 2.724 |
| gndr[T.Female] | -0.5492 | 0.059 | -9.291 | 0.000 | -0.665 | -0.433 |
| eduyrsn | 0.0116 | 0.009 | 1.233 | 0.218 | -0.007 | 0.030 |
| incdec | 0.0202 | 0.012 | 1.752 | 0.080 | -0.002 | 0.043 |
| age | 0.0030 | 0.002 | 1.344 | 0.179 | -0.001 | 0.007 |
| Omnibus: | 205.007 | Durbin-Watson: | 1.973 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 272.041 |
| Skew: | -0.901 | Prob(JB): | 8.45e-60 |
| Kurtosis: | 2.882 | Cond. No. | 328. |
# Translating from log to percents
print((np.exp(modelDays.params[modelDays.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -42.257076 dtype: float64
In the first model for alcohol consumption on weekdays only effect of gender is significant:
Gender: by 42% less consumed alcohol for femalesOverall prediction power equals to 4.6% of explained variance which depicts model poorness.
modelEnds = smf.ols(formula='np.log(alcwkndn) ~ eduyrsn + gndr + incdec + age',
data=Germany).fit()
modelEnds.summary()
| Dep. Variable: | np.log(alcwkndn) | R-squared: | 0.092 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.090 |
| Method: | Least Squares | F-statistic: | 50.33 |
| Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 2.08e-40 |
| Time: | 18:18:30 | Log-Likelihood: | -2904.6 |
| No. Observations: | 1999 | AIC: | 5819. |
| Df Residuals: | 1994 | BIC: | 5847. |
| Df Model: | 4 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 3.4394 | 0.158 | 21.738 | 0.000 | 3.129 | 3.750 |
| gndr[T.Female] | -0.5459 | 0.047 | -11.657 | 0.000 | -0.638 | -0.454 |
| eduyrsn | 0.0177 | 0.007 | 2.369 | 0.018 | 0.003 | 0.032 |
| incdec | 0.0099 | 0.009 | 1.081 | 0.280 | -0.008 | 0.028 |
| age | -0.0096 | 0.002 | -5.467 | 0.000 | -0.013 | -0.006 |
| Omnibus: | 438.473 | Durbin-Watson: | 1.996 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 940.560 |
| Skew: | -1.250 | Prob(JB): | 5.75e-205 |
| Kurtosis: | 5.246 | Cond. No. | 327. |
# Translating from log to percents
print((np.exp(modelEnds.params[modelEnds.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -42.068734 eduyrsn 1.785707 age -0.952473 dtype: float64
In the second model for alcohol consumption on weekends only effect of Income Decile is unsignificant:
Years of Education: on 1.7% more consumed alcohol per yearGender: same 42% less consumed alcohol for femalesAge: by ~1% less consumed alcohol per yearOverall prediction power equals to 9% of explained variance which almost twice as higher than for weekdays model.
Estimate the interaction terms between gender and education and between income and age in both models.
Interpret the results.
modelDaysI = smf.ols(formula='np.log(alcwkdyn) ~ eduyrsn * gndr + incdec * age',
data=Germany).fit()
modelDaysI.summary()
| Dep. Variable: | np.log(alcwkdyn) | R-squared: | 0.050 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.047 |
| Method: | Least Squares | F-statistic: | 17.32 |
| Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 1.26e-19 |
| Time: | 18:18:30 | Log-Likelihood: | -3370.7 |
| No. Observations: | 2000 | AIC: | 6755. |
| Df Residuals: | 1993 | BIC: | 6795. |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 2.0834 | 0.311 | 6.706 | 0.000 | 1.474 | 2.693 |
| gndr[T.Female] | -0.7451 | 0.277 | -2.688 | 0.007 | -1.289 | -0.202 |
| eduyrsn | 0.0069 | 0.013 | 0.550 | 0.582 | -0.018 | 0.031 |
| eduyrsn:gndr[T.Female] | 0.0124 | 0.018 | 0.708 | 0.479 | -0.022 | 0.047 |
| incdec | 0.0734 | 0.036 | 2.031 | 0.042 | 0.003 | 0.144 |
| age | 0.0103 | 0.005 | 1.996 | 0.046 | 0.000 | 0.020 |
| incdec:age | -0.0012 | 0.001 | -1.543 | 0.123 | -0.003 | 0.000 |
| Omnibus: | 203.116 | Durbin-Watson: | 1.976 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 268.868 |
| Skew: | -0.896 | Prob(JB): | 4.13e-59 |
| Kurtosis: | 2.885 | Cond. No. | 3.41e+03 |
# Translating from log to percents
print((np.exp(modelDaysI.params[modelDaysI.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -52.532990 incdec 7.620540 age 1.037384 dtype: float64
Significant effects on alcohol consumption on weekdays only by Gender, Income Decile and Age:
Gender: 52% less consumed alcohol by femalesIncome Decile: 7% more per levelAge: 1% more per yearThe same prediction power as for model without interactions.
modelEndsI = smf.ols(formula='np.log(alcwkndn) ~ eduyrsn * gndr + incdec * age',
data=Germany).fit()
modelEndsI.summary()
| Dep. Variable: | np.log(alcwkndn) | R-squared: | 0.097 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.094 |
| Method: | Least Squares | F-statistic: | 35.54 |
| Date: | Сб, 16 янв 2021 | Prob (F-statistic): | 4.79e-41 |
| Time: | 18:18:30 | Log-Likelihood: | -2899.1 |
| No. Observations: | 1999 | AIC: | 5812. |
| Df Residuals: | 1992 | BIC: | 5851. |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 4.0471 | 0.246 | 16.443 | 0.000 | 3.564 | 4.530 |
| gndr[T.Female] | -0.9995 | 0.219 | -4.554 | 0.000 | -1.430 | -0.569 |
| eduyrsn | 0.0023 | 0.010 | 0.235 | 0.815 | -0.017 | 0.022 |
| eduyrsn:gndr[T.Female] | 0.0298 | 0.014 | 2.141 | 0.032 | 0.003 | 0.057 |
| incdec | -0.0544 | 0.029 | -1.902 | 0.057 | -0.111 | 0.002 |
| age | -0.0181 | 0.004 | -4.406 | 0.000 | -0.026 | -0.010 |
| incdec:age | 0.0015 | 0.001 | 2.411 | 0.016 | 0.000 | 0.003 |
| Omnibus: | 438.934 | Durbin-Watson: | 1.993 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 949.826 |
| Skew: | -1.247 | Prob(JB): | 5.60e-207 |
| Kurtosis: | 5.278 | Cond. No. | 3.42e+03 |
# Translating from log to percents
print((np.exp(modelEndsI.params[modelEndsI.pvalues < 0.05][1:])-1)*100)
gndr[T.Female] -63.192093 eduyrsn:gndr[T.Female] 3.020500 age -1.793094 incdec:age 0.154312 dtype: float64
Significant effects on alcohol consumption on weekends by Gender, Age and by both interactions:
Gender: 63% less consumed alcohol by femalesAge: 1% less per yearSlightly higher prediction power in comparison with model without interactions.
Visualize the estimated interaction terms and comment on the results.
sns.lmplot(x="eduyrsn", y="alcwkdyn", hue="gndr", data=Germany,
truncate=0, scatter=0, line_kws={'linewidth':10},
height=9, aspect=12/8).set(yscale="log");
From sjPlot.plot_model documentation on rule for choosing moderator values in non-binary cases (which they based on this source):
[plot_model] uses the
mean valueof the moderator as well asone standard deviation below and above mean valueto plot the effect of the moderator on the independent variable (following the convention suggested by Cohen and Cohen and popularized by Aiken and West (1991), i.e. using the mean, the value one standard deviation above, and the value one standard deviation below the mean as values of the moderator)
print('Mean - SD: {:.2f}'.format(np.mean(Germany['incdec']) - np.std(Germany['incdec'])))
print(' Mean: {:.2f}'.format(np.mean(Germany['incdec'])))
print('Mean + SD: {:.2f}'.format(np.mean(Germany['incdec']) + np.std(Germany['incdec'])))
Mean - SD: 3.19
Mean: 5.97
Mean + SD: 8.75
Germany['incdec_mod'] = 'mean'
for i, val in enumerate(Germany['incdec']):
if val < (np.mean(Germany['incdec']) - np.std(Germany['incdec'])):
Germany['incdec_mod'].iat[i] = 'mean-sd'
if val > (np.mean(Germany['incdec']) + np.std(Germany['incdec'])):
Germany['incdec_mod'].iat[i] = 'mean+sd'
sns.lmplot(x="age", y="alcwkdyn", hue="incdec_mod", data=Germany,
hue_order=['mean-sd', 'mean', 'mean+sd'],
truncate=0, scatter=0, line_kws={'linewidth':10}, palette="Set1",
height=9, aspect=12/8).set(yscale="log");
sns.lmplot(x="eduyrsn", y="alcwkndn", hue="gndr", data=Germany,
truncate=0, scatter=0, line_kws={'linewidth':10},
height=9, aspect=12/8).set(yscale="log");
Years of Education have facinating effect on females - more educated women tend to consume more alcohol comparing with less educated ones.
At the same time Years of Education have opposite effect on males (although less steep) - more educated men tend to consume less alcohol comparing with less educated ones).
sns.lmplot(x="age", y="alcwkndn", hue="incdec_mod", data=Germany,
hue_order=['mean-sd', 'mean', 'mean+sd'],
truncate=0, scatter=0, line_kws={'linewidth':10}, palette="Set1",
height=9, aspect=12/8).set(yscale="log");
Income Decile moderates effect of Age - for anyone alcohol consumption on weekends will lower with higher Age, but:
Age is more steep for person with lower level of Income Decile'lower income')mean level of Income Decile, Age is having very little effect on person with higher level of Income Decile.