ht4

Dmitriy Tsimokha

23.11.20

Task 1

par(mfrow = c(1, 3), pin = c(2.2, 2.5))
barplot(table(df$happy), main = "happy")
barplot(table(df$stflife), main = "subjective wellbeing")
barplot(table(df$health), main = "subjective health")

Distributions look near normal with slight skeweness to the right for happy and subjective wellbeing distributions and with slight skeweness to the left for subjective health.

Also for both for happiness and subjective wellbeing scales starts with lower/more negative values but for subjective health situation is opposite - it is better to reverse the scale for further analysis.

df$health <- factor(df$health, levels = rev(levels(df$health)))

Testing assumption for distributions normality

shapiro.test(as.numeric(df$happy))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(df$happy)
## W = 0.96401, p-value < 2.2e-16

shapiro.test(as.numeric(df$stflife))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(df$stflife)
## W = 0.9682, p-value < 2.2e-16

shapiro.test(as.numeric(df$health))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(df$health)
## W = 0.86442, p-value < 2.2e-16

For each variable Shapiro-Wilk normality test tells that distributions are likely to be close to normal one.

Both by visual investigation of distributions and by Shapiro-Wilk normality test results observed distributions can be considered as quite normal ones so Pearson parametric correlation can be used.

Formal test

tab_corr(data.frame('happy' = as.numeric(df$happy), 
                    'stflife' = as.numeric(df$stflife), 
                    'health' = as.numeric(df$health)), corr.method = "pearson")

	happy	stflife	health
happy		0.583***	0.271***
stflife	0.583***		0.253***
health	0.271***	0.253***
Computed correlation used pearson-method with listwise-deletion.

All correlations are significant with p-value < 0.05.

Interpretation

Pearson parametric correlation shows following results:
- happiness and subjective wellbeing have moderate/fair correlation with positive direction
- happiness and subjective health have weak/poor correlation with positive direction
- subjective wellbeing and subjective health have weak/poor correlation with positive direction

Positive moderate correlation between happiness and subjective wellbeing seems obvious and expected, while positive weak/poor correlations between subjective health and both happiness and subjective wellbeing is quite interesting and may indicate that for Russians concept of subjective health don’t have strong connection with happiness and subjective wellbeing.

Visual results

tmp <- df %>%
  mutate(happy = as.numeric(happy), 
         stflife = as.numeric(stflife), 
         health = as.numeric(df$health)) %>% 
  cor_mat(happy, stflife, health)
par(mfrow = c(1, 1), pin = c(3, 2.5))
corrplot(as.matrix(tmp[1:3,2:4]), is.corr=T, method="square", diag=F)

Task 2

For me it is obvious that chi-square test should be used, but from the condition to use both parametric and non-parametric tests my understanding becomes questionable - so I decides to use firstly chi-square test and treat both variables as factors, then use ANoVA and non-parametric analogies.

Chi-square test

mytable <- table(df$rfgfrpc, df$stfgov)
plot_xtab(df$stfgov, df$rfgfrpc, margin = "row", bar.pos = "stack", show.total = FALSE)

Looks like that people who take any polar position (extremely dissatisfied / satisfied) on the government satisfaction question tend to be more concrete on the refugee question - both polar groups have less neutral answers percentage.

Seems that uncertain people tend to be uncertain in any question :)

# Degrees of freedom
(nrow(mytable)-1) * (ncol(mytable)-1)

## [1] 40

# Critical value with p-value = 0.05
qchisq(p = 0.95, df = 40)

## [1] 55.75848

With given degrees of freedom = 40 critical value of chi-square test to be passed for rejecting H0 is 55.758.

(mychisq <- chisq.test(mytable))

## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 89.794, df = 40, p-value = 1.088e-05

Interpretation

X2 (40, N = 2430) = 89.794, p = 1.088e-05

And chi-square test results are significant and can be further interpreted.

Visual results

corrplot(mychisq$residuals, is.cor=FALSE, method="square")

Interpretation of residuals

Visual investigation of test’s standardized residuals confirms stated hypothesis that people from polar groups on either government or refugee questions tend to not have neutral position on both questions - extremely dissatisfied with government group of people observed more frequently that is expected to take either agree strongly of disagree strongly positions on refugee question.

ANoVA (parametric)

boxplot(as.numeric(df$stfgov) ~ df$rfgfrpc, las = 2)

Treating satisfaction in government as numeric variable and plotting such a boxplot helps stated previously hypothesis that people on the polar groups on the refugee question have more variability on the satisfaction in government - people on polar group on one question tend to be on polar group on the other question.

Some kind of extreme people those Russians.

Assumptions

Variance equality

# H0: variances are equal
leveneTest(as.numeric(df$stfgov) ~ df$rfgfrpc)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    4  6.8039 1.942e-05 ***
##       2052                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variances are unequal so it’s better to use non-parametric test but anyway firstly I will use ANoVA.

aov.out <- aov(as.numeric(df$stfgov) ~ df$rfgfrpc)
summary(aov.out)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## df$rfgfrpc     4    243   60.78   10.64 1.55e-08 ***
## Residuals   2052  11720    5.71                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 373 observations deleted due to missingness

Interpretation

F(4, 2052) = 10.64, p = 1.55e-08

ANoVA shows significant results and let’s look for effect size

omega_sq(aov(as.numeric(df$stfgov) ~ df$rfgfrpc))

##         term omegasq
## 1 df$rfgfrpc   0.018

eta_sq(aov(as.numeric(df$stfgov) ~ df$rfgfrpc))

##         term etasq
## 1 df$rfgfrpc  0.02

df %>% kruskal_effsize(df$rfgfrpc ~ as.numeric(df$stfgov))

## # A tibble: 1 x 5
##   .y.            n  effsize method  magnitude
## * <chr>      <int>    <dbl> <chr>   <ord>    
## 1 df$rfgfrpc  2430 0.000842 eta2[H] small

And the effect size is quite small and by test or by visual investigation nothing interesting besides polar people can not found.

Non-parametric analogies

pairwise.t.test(as.numeric(df$stfgov), df$rfgfrpc, adjust = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  as.numeric(df$stfgov) and df$rfgfrpc 
## 
##                            Agree strongly Agree   Neither agree nor disagree
## Agree                      0.00166        -       -                         
## Neither agree nor disagree 2.1e-05        0.43432 -                         
## Disagree                   0.00075        0.98251 0.98251                   
## Disagree strongly          0.98251        0.00035 6.3e-06                   
##                            Disagree
## Agree                      -       
## Neither agree nor disagree -       
## Disagree                   -       
## Disagree strongly          0.00015 
## 
## P value adjustment method: holm

kruskal.test(as.numeric(df$stfgov) ~ df$rfgfrpc)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  as.numeric(df$stfgov) by df$rfgfrpc
## Kruskal-Wallis chi-squared = 38.304, df = 4, p-value = 9.697e-08

dunnTest(as.numeric(df$stfgov), df$rfgfrpc, method = "bonferroni")

##                                        Comparison          Z      P.unadj
## 1                          Agree - Agree strongly  3.4344140 5.938365e-04
## 2                                Agree - Disagree -0.8806108 3.785285e-01
## 3                       Agree strongly - Disagree -3.7706265 1.628382e-04
## 4                       Agree - Disagree strongly  3.5032439 4.596284e-04
## 5              Agree strongly - Disagree strongly  0.6051745 5.450631e-01
## 6                    Disagree - Disagree strongly  3.8312363 1.275010e-04
## 7              Agree - Neither agree nor disagree -1.8361007 6.634276e-02
## 8     Agree strongly - Neither agree nor disagree -4.7372830 2.166026e-06
## 9           Disagree - Neither agree nor disagree -0.6322535 5.272212e-01
## 10 Disagree strongly - Neither agree nor disagree -4.5663148 4.963733e-06
##           P.adj
## 1  5.938365e-03
## 2  1.000000e+00
## 3  1.628382e-03
## 4  4.596284e-03
## 5  1.000000e+00
## 6  1.275010e-03
## 7  6.634276e-01
## 8  2.166026e-05
## 9  1.000000e+00
## 10 4.963733e-05

Task 3

Grouping conditions seems not really clear to me - are those conditions strict and other groups (like those who not finished even middle school) need to be removed or such groups should be included in upper group (‘not finished middle school’ as part of ‘finished middle school’ group)?

Anyway I decided to take conditions as strict ones and do exactly as written.

df <- df %>% mutate(edu_group = factor(
  case_when(
    edlvdru == levels(df$edlvdru)[4] ~ "school",
    edlvdru == levels(df$edlvdru)[7] ~ "special",
    edlvdru %in% levels(df$edlvdru)[8:11] ~ "higher",
), levels = c("school", "special", "higher"), ordered = T))

Assumptions

# Shapiro-Wilk:
shapiro.test(as.numeric(df$rlgdgr)[df$edu_group == "school"])

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(df$rlgdgr)[df$edu_group == "school"]
## W = 0.9557, p-value = 7.467e-08

shapiro.test(as.numeric(df$rlgdgr)[df$edu_group == "special"])

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(df$rlgdgr)[df$edu_group == "special"]
## W = 0.95489, p-value = 5.261e-15

shapiro.test(as.numeric(df$rlgdgr)[df$edu_group == "higher"])

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(df$rlgdgr)[df$edu_group == "higher"]
## W = 0.95837, p-value = 5.194e-14

All three groups are distributed normally as it’s can be interpreted from the Shapiro-Wilk test results.

bartlett.test(as.numeric(df$rlgdgr) ~ df$edu_group)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  as.numeric(df$rlgdgr) by df$edu_group
## Bartlett's K-squared = 1.2813, df = 2, p-value = 0.5269

But variances are far from being equal - there will be problems with parametric tests!

Parametric test

mytable <- table(df$edu_group, df$rlgdgr)
(mychisq <- chisq.test(mytable))

## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 25.34, df = 20, p-value = 0.1887

corrplot(mychisq$residuals, is.cor=FALSE, method="square")

Interesting - chi-square test is not significant, so any post-hoc analysis can’t be done (but still mosaic plot is too nice to not plot it).

Let’s try ANoVA:

aov.out <- aov(as.numeric(df$rlgdgr) ~ df$edu_group)
summary(aov.out)

##                Df Sum Sq Mean Sq F value Pr(>F)  
## df$edu_group    2     34  16.782   2.502 0.0822 .
## Residuals    1874  12570   6.708                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 553 observations deleted due to missingness

And even ANoVA is not significant!

That’s start to look interesting.

pairwise.t.test(as.numeric(df$rlgdgr), df$edu_group, adjust = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  as.numeric(df$rlgdgr) and df$edu_group 
## 
##         school special
## special 0.296  -      
## higher  0.079  0.296  
## 
## P value adjustment method: holm

kruskal.test(as.numeric(df$rlgdgr) ~ df$edu_group)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  as.numeric(df$rlgdgr) by df$edu_group
## Kruskal-Wallis chi-squared = 6.0819, df = 2, p-value = 0.04779

library(FSA)
dunnTest(as.numeric(df$rlgdgr), df$edu_group, method = "bonferroni")

##         Comparison         Z    P.unadj      P.adj
## 1  higher - school  2.455082 0.01408523 0.04225569
## 2 higher - special  1.140480 0.25408653 0.76225960
## 3 school - special -1.621632 0.10488221 0.31464663

Still there are no valid analysis can be done even using non-parametric tests - only for ‘higher-school’ group comparison is can be valid, but that’s all.

Thank you for your attention!