Task 1
par(mfrow = c(1, 3), pin = c(2.2, 2.5))
barplot(table(df$happy), main = "happy")
barplot(table(df$stflife), main = "subjective wellbeing")
barplot(table(df$health), main = "subjective health")
Distributions look near normal with slight skeweness to the right for happy and subjective wellbeing distributions and with slight skeweness to the left for subjective health.
Also for both for happiness and subjective wellbeing scales starts with lower/more negative values but for subjective health situation is opposite - it is better to reverse the scale for further analysis.
df$health <- factor(df$health, levels = rev(levels(df$health)))
Testing assumption for distributions normality
shapiro.test(as.numeric(df$happy))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(df$happy)
## W = 0.96401, p-value < 2.2e-16
shapiro.test(as.numeric(df$stflife))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(df$stflife)
## W = 0.9682, p-value < 2.2e-16
shapiro.test(as.numeric(df$health))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(df$health)
## W = 0.86442, p-value < 2.2e-16
For each variable Shapiro-Wilk normality test tells that distributions are likely to be close to normal one.
Both by visual investigation of distributions and by Shapiro-Wilk normality test results observed distributions can be considered as quite normal ones so Pearson parametric correlation can be used.
Formal test
tab_corr(data.frame('happy' = as.numeric(df$happy),
'stflife' = as.numeric(df$stflife),
'health' = as.numeric(df$health)), corr.method = "pearson")
happy | stflife | health | |
---|---|---|---|
happy | 0.583*** | 0.271*** | |
stflife | 0.583*** | 0.253*** | |
health | 0.271*** | 0.253*** | |
Computed correlation used pearson-method with listwise-deletion. |
All correlations are significant with p-value < 0.05.
Interpretation
Pearson parametric correlation shows following results:
- happiness and subjective wellbeing have moderate/fair correlation with positive direction
- happiness and subjective health have weak/poor correlation with positive direction
- subjective wellbeing and subjective health have weak/poor correlation with positive direction
Positive moderate correlation between happiness and subjective wellbeing seems obvious and expected, while positive weak/poor correlations between subjective health and both happiness and subjective wellbeing is quite interesting and may indicate that for Russians concept of subjective health don’t have strong connection with happiness and subjective wellbeing.
Visual results
tmp <- df %>%
mutate(happy = as.numeric(happy),
stflife = as.numeric(stflife),
health = as.numeric(df$health)) %>%
cor_mat(happy, stflife, health)
par(mfrow = c(1, 1), pin = c(3, 2.5))
corrplot(as.matrix(tmp[1:3,2:4]), is.corr=T, method="square", diag=F)
Task 2
For me it is obvious that chi-square test should be used, but from the condition to use both parametric and non-parametric tests my understanding becomes questionable - so I decides to use firstly chi-square test and treat both variables as factors, then use ANoVA and non-parametric analogies.
Chi-square test
mytable <- table(df$rfgfrpc, df$stfgov)
plot_xtab(df$stfgov, df$rfgfrpc, margin = "row", bar.pos = "stack", show.total = FALSE)
Looks like that people who take any polar position (extremely dissatisfied / satisfied) on the government satisfaction question tend to be more concrete on the refugee question - both polar groups have less neutral answers percentage.
Seems that uncertain people tend to be uncertain in any question :)
# Degrees of freedom
(nrow(mytable)-1) * (ncol(mytable)-1)
## [1] 40
# Critical value with p-value = 0.05
qchisq(p = 0.95, df = 40)
## [1] 55.75848
With given degrees of freedom = 40 critical value of chi-square test to be passed for rejecting H0 is 55.758.
(mychisq <- chisq.test(mytable))
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 89.794, df = 40, p-value = 1.088e-05
Interpretation
X2 (40, N = 2430) = 89.794, p = 1.088e-05
And chi-square test results are significant and can be further interpreted.
Visual results
corrplot(mychisq$residuals, is.cor=FALSE, method="square")
Interpretation of residuals
Visual investigation of test’s standardized residuals confirms stated hypothesis that people from polar groups on either government or refugee questions tend to not have neutral position on both questions - extremely dissatisfied with government group of people observed more frequently that is expected to take either agree strongly of disagree strongly positions on refugee question.
ANoVA (parametric)
boxplot(as.numeric(df$stfgov) ~ df$rfgfrpc, las = 2)
Treating satisfaction in government as numeric variable and plotting such a boxplot helps stated previously hypothesis that people on the polar groups on the refugee question have more variability on the satisfaction in government - people on polar group on one question tend to be on polar group on the other question.
Some kind of extreme people those Russians.
Assumptions
Variance equality
# H0: variances are equal
leveneTest(as.numeric(df$stfgov) ~ df$rfgfrpc)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 6.8039 1.942e-05 ***
## 2052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variances are unequal so it’s better to use non-parametric test but anyway firstly I will use ANoVA.
aov.out <- aov(as.numeric(df$stfgov) ~ df$rfgfrpc)
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## df$rfgfrpc 4 243 60.78 10.64 1.55e-08 ***
## Residuals 2052 11720 5.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 373 observations deleted due to missingness
Interpretation
F(4, 2052) = 10.64, p = 1.55e-08
ANoVA shows significant results and let’s look for effect size
omega_sq(aov(as.numeric(df$stfgov) ~ df$rfgfrpc))
## term omegasq
## 1 df$rfgfrpc 0.018
eta_sq(aov(as.numeric(df$stfgov) ~ df$rfgfrpc))
## term etasq
## 1 df$rfgfrpc 0.02
df %>% kruskal_effsize(df$rfgfrpc ~ as.numeric(df$stfgov))
## # A tibble: 1 x 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 df$rfgfrpc 2430 0.000842 eta2[H] small
And the effect size is quite small and by test or by visual investigation nothing interesting besides polar people can not found.
Non-parametric analogies
pairwise.t.test(as.numeric(df$stfgov), df$rfgfrpc, adjust = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: as.numeric(df$stfgov) and df$rfgfrpc
##
## Agree strongly Agree Neither agree nor disagree
## Agree 0.00166 - -
## Neither agree nor disagree 2.1e-05 0.43432 -
## Disagree 0.00075 0.98251 0.98251
## Disagree strongly 0.98251 0.00035 6.3e-06
## Disagree
## Agree -
## Neither agree nor disagree -
## Disagree -
## Disagree strongly 0.00015
##
## P value adjustment method: holm
kruskal.test(as.numeric(df$stfgov) ~ df$rfgfrpc)
##
## Kruskal-Wallis rank sum test
##
## data: as.numeric(df$stfgov) by df$rfgfrpc
## Kruskal-Wallis chi-squared = 38.304, df = 4, p-value = 9.697e-08
dunnTest(as.numeric(df$stfgov), df$rfgfrpc, method = "bonferroni")
## Comparison Z P.unadj
## 1 Agree - Agree strongly 3.4344140 5.938365e-04
## 2 Agree - Disagree -0.8806108 3.785285e-01
## 3 Agree strongly - Disagree -3.7706265 1.628382e-04
## 4 Agree - Disagree strongly 3.5032439 4.596284e-04
## 5 Agree strongly - Disagree strongly 0.6051745 5.450631e-01
## 6 Disagree - Disagree strongly 3.8312363 1.275010e-04
## 7 Agree - Neither agree nor disagree -1.8361007 6.634276e-02
## 8 Agree strongly - Neither agree nor disagree -4.7372830 2.166026e-06
## 9 Disagree - Neither agree nor disagree -0.6322535 5.272212e-01
## 10 Disagree strongly - Neither agree nor disagree -4.5663148 4.963733e-06
## P.adj
## 1 5.938365e-03
## 2 1.000000e+00
## 3 1.628382e-03
## 4 4.596284e-03
## 5 1.000000e+00
## 6 1.275010e-03
## 7 6.634276e-01
## 8 2.166026e-05
## 9 1.000000e+00
## 10 4.963733e-05
Task 3
Grouping conditions seems not really clear to me - are those conditions strict and other groups (like those who not finished even middle school) need to be removed or such groups should be included in upper group (‘not finished middle school’ as part of ‘finished middle school’ group)?
Anyway I decided to take conditions as strict ones and do exactly as written.
df <- df %>% mutate(edu_group = factor(
case_when(
edlvdru == levels(df$edlvdru)[4] ~ "school",
edlvdru == levels(df$edlvdru)[7] ~ "special",
edlvdru %in% levels(df$edlvdru)[8:11] ~ "higher",
), levels = c("school", "special", "higher"), ordered = T))
Assumptions
# Shapiro-Wilk:
shapiro.test(as.numeric(df$rlgdgr)[df$edu_group == "school"])
##
## Shapiro-Wilk normality test
##
## data: as.numeric(df$rlgdgr)[df$edu_group == "school"]
## W = 0.9557, p-value = 7.467e-08
shapiro.test(as.numeric(df$rlgdgr)[df$edu_group == "special"])
##
## Shapiro-Wilk normality test
##
## data: as.numeric(df$rlgdgr)[df$edu_group == "special"]
## W = 0.95489, p-value = 5.261e-15
shapiro.test(as.numeric(df$rlgdgr)[df$edu_group == "higher"])
##
## Shapiro-Wilk normality test
##
## data: as.numeric(df$rlgdgr)[df$edu_group == "higher"]
## W = 0.95837, p-value = 5.194e-14
All three groups are distributed normally as it’s can be interpreted from the Shapiro-Wilk test results.
bartlett.test(as.numeric(df$rlgdgr) ~ df$edu_group)
##
## Bartlett test of homogeneity of variances
##
## data: as.numeric(df$rlgdgr) by df$edu_group
## Bartlett's K-squared = 1.2813, df = 2, p-value = 0.5269
But variances are far from being equal - there will be problems with parametric tests!
Parametric test
mytable <- table(df$edu_group, df$rlgdgr)
(mychisq <- chisq.test(mytable))
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 25.34, df = 20, p-value = 0.1887
corrplot(mychisq$residuals, is.cor=FALSE, method="square")
Interesting - chi-square test is not significant, so any post-hoc analysis can’t be done (but still mosaic plot is too nice to not plot it).
Let’s try ANoVA:
aov.out <- aov(as.numeric(df$rlgdgr) ~ df$edu_group)
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## df$edu_group 2 34 16.782 2.502 0.0822 .
## Residuals 1874 12570 6.708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 553 observations deleted due to missingness
And even ANoVA is not significant!
That’s start to look interesting.
pairwise.t.test(as.numeric(df$rlgdgr), df$edu_group, adjust = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: as.numeric(df$rlgdgr) and df$edu_group
##
## school special
## special 0.296 -
## higher 0.079 0.296
##
## P value adjustment method: holm
kruskal.test(as.numeric(df$rlgdgr) ~ df$edu_group)
##
## Kruskal-Wallis rank sum test
##
## data: as.numeric(df$rlgdgr) by df$edu_group
## Kruskal-Wallis chi-squared = 6.0819, df = 2, p-value = 0.04779
library(FSA)
dunnTest(as.numeric(df$rlgdgr), df$edu_group, method = "bonferroni")
## Comparison Z P.unadj P.adj
## 1 higher - school 2.455082 0.01408523 0.04225569
## 2 higher - special 1.140480 0.25408653 0.76225960
## 3 school - special -1.621632 0.10488221 0.31464663
Still there are no valid analysis can be done even using non-parametric tests - only for ‘higher-school’ group comparison is can be valid, but that’s all.
Thank you for your attention!