In [1]:

import rpy2.rinterface
import rpy2.robjects as robjects

import numpy as np
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.factorplots import interaction_plot
# https://www.statsmodels.org/stable/examples/notebooks/generated/formulas.html
import seaborn as sns

Task 1¶

Specify two models to predict the amount of alcohol consumed on weekends and weekdays by the respondents’education, gender, income,and age.
Please, use the log of the amount of alcohol.
Interpret the results(coefficients, R^2).

Model for weekdays:¶

In [4]:

modelDays = smf.ols(formula='np.log(alcwkdyn) ~ eduyrsn + gndr + incdec + age', 
                    data=Germany).fit()
modelDays.summary()

Out[4]:

OLS Regression Results
Dep. Variable:	np.log(alcwkdyn)	R-squared:	0.048
Model:	OLS	Adj. R-squared:	0.046
Method:	Least Squares	F-statistic:	25.29
Date:	Сб, 16 янв 2021	Prob (F-statistic):	1.85e-20
Time:	18:18:30	Log-Likelihood:	-3372.1
No. Observations:	2000	AIC:	6754.
Df Residuals:	1995	BIC:	6782.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.3322	0.200	11.673	0.000	1.940	2.724
gndr[T.Female]	-0.5492	0.059	-9.291	0.000	-0.665	-0.433
eduyrsn	0.0116	0.009	1.233	0.218	-0.007	0.030
incdec	0.0202	0.012	1.752	0.080	-0.002	0.043
age	0.0030	0.002	1.344	0.179	-0.001	0.007

Omnibus:	205.007	Durbin-Watson:	1.973
Prob(Omnibus):	0.000	Jarque-Bera (JB):	272.041
Skew:	-0.901	Prob(JB):	8.45e-60
Kurtosis:	2.882	Cond. No.	328.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [5]:

# Translating from log to percents
print((np.exp(modelDays.params[modelDays.pvalues < 0.05][1:])-1)*100)

gndr[T.Female]   -42.257076
dtype: float64

In the first model for alcohol consumption on weekdays only effect of gender is significant:

for Gender: by 42% less consumed alcohol for females

Overall prediction power equals to 4.6% of explained variance which depicts model poorness.

Model for weekends:¶

In [6]:

modelEnds = smf.ols(formula='np.log(alcwkndn) ~ eduyrsn + gndr + incdec + age', 
                    data=Germany).fit()
modelEnds.summary()

Out[6]:

OLS Regression Results
Dep. Variable:	np.log(alcwkndn)	R-squared:	0.092
Model:	OLS	Adj. R-squared:	0.090
Method:	Least Squares	F-statistic:	50.33
Date:	Сб, 16 янв 2021	Prob (F-statistic):	2.08e-40
Time:	18:18:30	Log-Likelihood:	-2904.6
No. Observations:	1999	AIC:	5819.
Df Residuals:	1994	BIC:	5847.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.4394	0.158	21.738	0.000	3.129	3.750
gndr[T.Female]	-0.5459	0.047	-11.657	0.000	-0.638	-0.454
eduyrsn	0.0177	0.007	2.369	0.018	0.003	0.032
incdec	0.0099	0.009	1.081	0.280	-0.008	0.028
age	-0.0096	0.002	-5.467	0.000	-0.013	-0.006

Omnibus:	438.473	Durbin-Watson:	1.996
Prob(Omnibus):	0.000	Jarque-Bera (JB):	940.560
Skew:	-1.250	Prob(JB):	5.75e-205
Kurtosis:	5.246	Cond. No.	327.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [7]:

# Translating from log to percents
print((np.exp(modelEnds.params[modelEnds.pvalues < 0.05][1:])-1)*100)

gndr[T.Female]   -42.068734
eduyrsn            1.785707
age               -0.952473
dtype: float64

In the second model for alcohol consumption on weekends only effect of Income Decile is unsignificant:

for Years of Education: on 1.7% more consumed alcohol per year
for Gender: same 42% less consumed alcohol for females
for Age: by ~1% less consumed alcohol per year

Overall prediction power equals to 9% of explained variance which almost twice as higher than for weekdays model.

Task 2¶

Estimate the interaction terms between gender and education and between income and age in both models.
Interpret the results.

Model with moderations for weekdays:¶

In [8]:

modelDaysI = smf.ols(formula='np.log(alcwkdyn) ~ eduyrsn * gndr + incdec * age', 
                     data=Germany).fit()
modelDaysI.summary()

Out[8]:

OLS Regression Results
Dep. Variable:	np.log(alcwkdyn)	R-squared:	0.050
Model:	OLS	Adj. R-squared:	0.047
Method:	Least Squares	F-statistic:	17.32
Date:	Сб, 16 янв 2021	Prob (F-statistic):	1.26e-19
Time:	18:18:30	Log-Likelihood:	-3370.7
No. Observations:	2000	AIC:	6755.
Df Residuals:	1993	BIC:	6795.
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.0834	0.311	6.706	0.000	1.474	2.693
gndr[T.Female]	-0.7451	0.277	-2.688	0.007	-1.289	-0.202
eduyrsn	0.0069	0.013	0.550	0.582	-0.018	0.031
eduyrsn:gndr[T.Female]	0.0124	0.018	0.708	0.479	-0.022	0.047
incdec	0.0734	0.036	2.031	0.042	0.003	0.144
age	0.0103	0.005	1.996	0.046	0.000	0.020
incdec:age	-0.0012	0.001	-1.543	0.123	-0.003	0.000

Omnibus:	203.116	Durbin-Watson:	1.976
Prob(Omnibus):	0.000	Jarque-Bera (JB):	268.868
Skew:	-0.896	Prob(JB):	4.13e-59
Kurtosis:	2.885	Cond. No.	3.41e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.41e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [9]:

# Translating from log to percents
print((np.exp(modelDaysI.params[modelDaysI.pvalues < 0.05][1:])-1)*100)

gndr[T.Female]   -52.532990
incdec             7.620540
age                1.037384
dtype: float64

Significant effects on alcohol consumption on weekdays only by Gender, Income Decile and Age:

for Gender: 52% less consumed alcohol by females
for Income Decile: 7% more per level
for Age: 1% more per year

The same prediction power as for model without interactions.

Model with moderations for weekends:¶

In [10]:

modelEndsI = smf.ols(formula='np.log(alcwkndn) ~ eduyrsn * gndr + incdec * age', 
                     data=Germany).fit()
modelEndsI.summary()

Out[10]:

OLS Regression Results
Dep. Variable:	np.log(alcwkndn)	R-squared:	0.097
Model:	OLS	Adj. R-squared:	0.094
Method:	Least Squares	F-statistic:	35.54
Date:	Сб, 16 янв 2021	Prob (F-statistic):	4.79e-41
Time:	18:18:30	Log-Likelihood:	-2899.1
No. Observations:	1999	AIC:	5812.
Df Residuals:	1992	BIC:	5851.
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.0471	0.246	16.443	0.000	3.564	4.530
gndr[T.Female]	-0.9995	0.219	-4.554	0.000	-1.430	-0.569
eduyrsn	0.0023	0.010	0.235	0.815	-0.017	0.022
eduyrsn:gndr[T.Female]	0.0298	0.014	2.141	0.032	0.003	0.057
incdec	-0.0544	0.029	-1.902	0.057	-0.111	0.002
age	-0.0181	0.004	-4.406	0.000	-0.026	-0.010
incdec:age	0.0015	0.001	2.411	0.016	0.000	0.003

Omnibus:	438.934	Durbin-Watson:	1.993
Prob(Omnibus):	0.000	Jarque-Bera (JB):	949.826
Skew:	-1.247	Prob(JB):	5.60e-207
Kurtosis:	5.278	Cond. No.	3.42e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [11]:

# Translating from log to percents
print((np.exp(modelEndsI.params[modelEndsI.pvalues < 0.05][1:])-1)*100)

gndr[T.Female]           -63.192093
eduyrsn:gndr[T.Female]     3.020500
age                       -1.793094
incdec:age                 0.154312
dtype: float64

Significant effects on alcohol consumption on weekends by Gender, Age and by both interactions:

for Gender: 63% less consumed alcohol by females
for Age: 1% less per year

Slightly higher prediction power in comparison with model without interactions.

Task 3¶

Visualize the estimated interaction terms and comment on the results.

Results unsignificant so there is nothing to interpret.¶

From sjPlot.plot_model documentation on rule for choosing moderator values in non-binary cases (which they based on this source):

[plot_model] uses the mean value of the moderator as well as one standard deviation below and above mean value to plot the effect of the moderator on the independent variable (following the convention suggested by Cohen and Cohen and popularized by Aiken and West (1991), i.e. using the mean, the value one standard deviation above, and the value one standard deviation below the mean as values of the moderator)

In [13]:

print('Mean - SD: {:.2f}'.format(np.mean(Germany['incdec']) - np.std(Germany['incdec'])))
print('     Mean: {:.2f}'.format(np.mean(Germany['incdec'])))
print('Mean + SD: {:.2f}'.format(np.mean(Germany['incdec']) + np.std(Germany['incdec'])))

Mean - SD: 3.19
     Mean: 5.97
Mean + SD: 8.75

In [14]:

Germany['incdec_mod'] = 'mean'
for i, val in enumerate(Germany['incdec']):
    if val < (np.mean(Germany['incdec']) - np.std(Germany['incdec'])):
        Germany['incdec_mod'].iat[i] = 'mean-sd'
    if val > (np.mean(Germany['incdec']) + np.std(Germany['incdec'])):
        Germany['incdec_mod'].iat[i] = 'mean+sd'

Results unsignificant so there is nothing to interpret.¶

Years of Education have facinating effect on females - more educated women tend to consume more alcohol comparing with less educated ones.
At the same time Years of Education have opposite effect on males (although less steep) - more educated men tend to consume less alcohol comparing with less educated ones).

Income Decile moderates effect of Age - for anyone alcohol consumption on weekends will lower with higher Age, but:

that negative effect of Age is more steep for person with lower level of Income Decile
(i.e. 'lower income')
than for person with mean level of Income Decile,
and Age is having very little effect on person with higher level of Income Decile.

Task 1¶

Model for weekdays:¶

Model for weekends:¶

Task 2¶

Model with moderations for weekdays:¶

Model with moderations for weekends:¶

Task 3¶

For weekdays:¶

Results unsignificant so there is nothing to interpret.¶

Results unsignificant so there is nothing to interpret.¶

For weekends:¶