Research question

The survey question “Will People Born Today Have a Better Life Than Their Parents?” assume only binary answers “Yes” or “No” and coded in dataset in the same way.

I will prefer not to build a model with random good-looking predictors, but elaborate a very small research based on specific topic - how respondent’s level of expertise (by professional coding expirience and level of influence at work) and structure at work can influence answers on outcome question.

I think, that such parameters as expertise and structuredness can reflect level of respondent’s confidence, and therefore I assume it will also influence on evaluation of ‘how hard is life to me and people around in comparison with our parents’ - logic is simple, if you ‘fit’ well in life, feeling confident and your work is structured, you can think that your life is better, than your parents’ life.

Let’s test that hypothesis!

Predictors

I will use three predictors in my main model: YearsCodePro, WorkPlan, PurchaseWhat.

Continuous: * YearsCodePro: How many years have you coded professionally (as a part of your work)?

Categorical: * WorkPlan: How structured or planned is your work? * PurchaseWhat: What level of influence do you, personally, have over new technology purchases at your organization?

Now I will load the data and relevel factors, then recode them with more short names, it will be more useful to read and interpret.

In preprocessing:

  • dataset was filtered by age to include values in range from 18 to 60 to avoid outliers
  • also dataset was filtered by predictor YearsCodePro to not include values “Less than 1 year” & “More than 50 years” and then recoded to numeric
  • then numeric predictor YearsCodePro was scaled and centered
  • all NA’s in predictors PurchaseWhat & WorkPlan were recoded as “Non-responce” values
  • lastly, all but YearsCodePro variables (including outcome) were recoded to factors:
    • reference level in WorkPlan is “There’s no schedule or spec; I work on what seems most important or urgent
    • reference level in PurchaseWhat is “I have little or no influence

Model equation

That’s how my model’s equasion looks like - I’m not interested in any interactions, using only direct effect on outcome and evaluating it:

\[\log[P(BetterLife)/P(1-BetterLife)] = \beta_{0} + \beta_{1} \cdot YearsCodePro + \beta_{2} \cdot WorkPlan + \beta_{3} \cdot PurchaseWhat\]

Data description

So here is quick description of chosen variables:

## 
##  Descriptive statistics by group 
## group: No
##               vars     n mean   sd median trimmed  mad   min   max range skew
## BetterLife*      1 23242 1.00 0.00   1.00    1.00 0.00  1.00  1.00     0  NaN
## YearsCodePro     2 23242 0.67 7.51  -2.31   -0.49 5.93 -7.31 41.69    49 1.39
## WorkPlan*        3 23242 2.00 0.85   2.00    1.92 1.48  1.00  4.00     3 0.65
## PurchaseWhat*    4 23242 2.08 1.05   2.00    1.97 1.48  1.00  4.00     3 0.62
##               kurtosis   se
## BetterLife*        NaN 0.00
## YearsCodePro      1.65 0.05
## WorkPlan*        -0.08 0.01
## PurchaseWhat*    -0.83 0.01
## ------------------------------------------------------------ 
## group: Yes
##               vars     n  mean   sd median trimmed  mad   min   max range skew
## BetterLife*      1 38735  2.00 0.00   2.00    2.00 0.00  2.00  2.00     0  NaN
## YearsCodePro     2 38735 -0.41 6.67  -2.31   -1.51 4.45 -7.31 37.69    45 1.56
## WorkPlan*        3 38735  2.03 0.83   2.00    1.97 1.48  1.00  4.00     3 0.53
## PurchaseWhat*    4 38735  2.11 1.04   2.00    2.01 1.48  1.00  4.00     3 0.56
##               kurtosis   se
## BetterLife*        NaN 0.00
## YearsCodePro      2.53 0.03
## WorkPlan*        -0.21 0.00
## PurchaseWhat*    -0.88 0.01

From the survey’s report we already know that 63.7% choose “Yes” and 36.3% choose “No” answering on our outcome variable question - so most simple model will tell that we have probability of 0.63 to obtain “Yes” as an answers.

From interesting points on that data description:

  • looking at centrality measures of YearsCodePro predictor (centered!) can be seen that in group with answer “No” it’s mean = 0.67 and in group with answer “Yes” mean = -0.41. From that I can assume, that respondents in group “No” in general have more experience in coding professionally than respondents in group “Yes” - such data goes against my hypothesis about years of coding and I need to check it in logistic regression model later.

For other factor variables it is meaningless to look at such table so I will use plots and cross-tables to check them.

Checking for missing values

I have only 1 percent of NA’s in my data provided only in outcome variable, all predictors’ NA’s were recoded into “Non-response” values.

Anyway, I will omit that 1 percent of NA’s to take to a model really clean data.

Variables scales and distributions

YearsCodePro: How many years have you coded professionally (as a part of your work)?

As can be seen, main group of distribution is gathered left from zero and there is long tail right from zero - our distribution highly skewed. But both groups have such skew.

WorkPlan: How structured or planned is your work?

Here the same proportions of “Yes” & “No” respondents in each group of work planning except “Non-responce” group. Also can be seen, that proportion between “Yes” and “No” is slightly raising to favor of “Yes” group with raise of level work structuredness - that helps my hypothesis a little.

PurchaseWhat: What level of influence do you, personally, have over new technology purchases at your organization?

Here the same proportions and they changes to “Yes” group favor even more slightly, so that not really helps my hypothesis, but I will see it in model.

Checking distributions per each category

Also I can look at xtabs for better view on distribution between groups:

##           WorkPlan
## BetterLife -lowStrct -medStrct -highStrct Non-responce
##        No       6865     11086       3758         1533
##        Yes     10540     18567       7449         2179
##           PurchaseWhat
## BetterLife -lowInfl -medInfl -highInfl Non-responce
##        No      8513     7881      3385         3463
##        Yes    13587    13113      6405         5630

We have more respondents with lower levels of structuredness and influence at work - I can assume, that we have just more junior lower-aged specialists, than senior ones.

Also we can notice that the higher levels of structuredness and influence at work, the more proportion between “Yes” and “No” - almost 2/1 with “high” levels. So that part of my hypothesis is slightly supported.

Choosing the best model

Before creating a model I will split dataset into train/test subsets to gain accuracy after modelling:

And now let’s build a models with different combinations (and emtpy model too) of predictors and choose the better model:

Firstly, I will look at LogLikelihood to choose model with closest to zero value of it:

## Likelihood ratio test
## 
## Model 1: BetterLife ~ YearsCodePro + WorkPlan
## Model 2: BetterLife ~ YearsCodePro + PurchaseWhat
## Model 3: BetterLife ~ WorkPlan + PurchaseWhat
## Model 4: BetterLife ~ YearsCodePro
## Model 5: BetterLife ~ WorkPlan
## Model 6: BetterLife ~ PurchaseWhat
## Model 7: BetterLife ~ YearsCodePro + WorkPlan + PurchaseWhat
## Model 8: BetterLife ~ 1
##   #Df LogLik Df    Chisq Pr(>Chisq)    
## 1   5 -30602                           
## 2   5 -30606  0   8.8352  < 2.2e-16 ***
## 3   7 -30687  2 162.4324  < 2.2e-16 ***
## 4   2 -30646 -5  83.3000  < 2.2e-16 ***
## 5   4 -30713  2 134.5126  < 2.2e-16 ***
## 6   4 -30744  0  61.7700  < 2.2e-16 ***
## 7   8 -30555  4 377.1831  < 2.2e-16 ***
## 8   1 -30766 -7 422.2071  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By LogLikelihood I see that the most complex model (seventh in a list) have better value of it and all my models are significant.

## $Models
##   Formula                                              
## 1 "BetterLife ~ YearsCodePro + WorkPlan"               
## 2 "BetterLife ~ YearsCodePro + PurchaseWhat"           
## 3 "BetterLife ~ WorkPlan + PurchaseWhat"               
## 4 "BetterLife ~ YearsCodePro"                          
## 5 "BetterLife ~ WorkPlan"                              
## 6 "BetterLife ~ PurchaseWhat"                          
## 7 "BetterLife ~ YearsCodePro + WorkPlan + PurchaseWhat"
## 8 "BetterLife ~ 1"                                     
## 
## $Fit.criteria
##   Rank Df.res   AIC  AICc   BIC  McFadden Cox.and.Snell Nagelkerke   p.value
## 1    5  46480 61220 61220 61270 0.0053510     0.0070590   0.009619 2.597e-70
## 2    5  46480 61220 61220 61280 0.0052080     0.0068700   0.009361 2.095e-68
## 3    7  46480 61390 61390 61460 0.0025680     0.0033940   0.004624 7.630e-32
## 4    2  46480 61300 61300 61320 0.0039220     0.0051780   0.007056 1.025e-54
## 5    4  46480 61440 61440 61480 0.0017360     0.0022950   0.003127 2.662e-23
## 6    4  46480 61500 61500 61540 0.0007317     0.0009682   0.001319 4.475e-10
## 7    8  46470 61130 61130 61210 0.0068620     0.0090420   0.012320 2.030e-87
## 8    1  46480 61540 61540 61550 0.0000000     0.0000000   0.000000       Inf

And also by three different pseudo-R values seventh most complex model is the better solution, despite all models including the best one have really bad values of pseudo-R below 0.01. So all of my models are bad and have no effect, but I can compare them and choose greatest from bad.

Deciding by LogLikelihood and pseudo-R coefficients I’m choosing seventh most complex model and elaborate further analysis on it.

Model summary and interpretation of coefficients

Observations 46482
Dependent variable BetterLife
Type Generalized linear model
Family binomial
Link logit
𝛘²(7) 422.21
Pseudo-R² (Cragg-Uhler) 0.01
Pseudo-R² (McFadden) 0.01
AIC 61126.18
BIC 61196.15
exp(Est.) 2.5% 97.5% z val. p VIF
(Intercept) 1.40 1.34 1.46 15.09 0.00 NA
YearsCodePro 0.98 0.98 0.98 -16.28 0.00 1.04
WorkPlan-medStrct 1.10 1.05 1.15 4.15 0.00 1.24
WorkPlan-highStrct 1.29 1.22 1.37 8.76 0.00 1.24
WorkPlanNon-responce 0.88 0.80 0.96 -2.77 0.01 1.24
PurchaseWhat-medInfl 1.07 1.03 1.12 3.09 0.00 1.28
PurchaseWhat-highInfl 1.31 1.24 1.39 9.17 0.00 1.28
PurchaseWhatNon-responce 1.18 1.11 1.26 5.05 0.00 1.28
Standard errors: MLE

Odds ratios

All interpretations are to favor of “Yes” answer:

  • When YearsCodePro increases by one unit, the odds of BetterLife change by a factor of 0.97 = they multiply by 0.98 = they decrease by 97 percent [95% CI = 0.97; 0.98]

    • When WorkPlan changes from “low” to:
      • “med”: the odds increases by 109%
      • “high”: the odds increases by 129%
      • if “Non-responce”: the odds decreases by 88%
    • When PurchaseWhat changes from “low” to:
      • “med”: the odds increases by 107%
      • “high”: the odds increases by 131%
      • if “Non-responce”: the odds increases by 117%

Confidence intervals

No predictor is crossing 1 with its confidence intervals!

Predicted probabilities

For different factor predictors combinations:

##    YearsCodePro     WorkPlan PurchaseWhat      pred
## 1  -0.007079771    -lowStrct     -medInfl 0.6005853
## 2  -0.007079771    -medStrct     -medInfl 0.6230403
## 3  -0.007079771   -highStrct     -medInfl 0.6605831
## 4  -0.007079771 Non-responce     -medInfl 0.5696743
## 5  -0.007079771    -lowStrct     -lowInfl 0.5833346
## 6  -0.007079771    -medStrct     -lowInfl 0.6061234
## 7  -0.007079771   -highStrct     -lowInfl 0.6443894
## 8  -0.007079771 Non-responce     -lowInfl 0.5520843
## 9  -0.007079771    -lowStrct Non-responce 0.6229198
## 10 -0.007079771    -medStrct Non-responce 0.6448616
## 11 -0.007079771   -highStrct Non-responce 0.6813427
## 12 -0.007079771 Non-responce Non-responce 0.5925647
## 13 -0.007079771    -lowStrct    -highInfl 0.6476639
## 14 -0.007079771    -medStrct    -highInfl 0.6689312
## 15 -0.007079771   -highStrct    -highInfl 0.7040744
## 16 -0.007079771 Non-responce    -highInfl 0.6180795

Here I see that all predicted probabilities per different combinations of groups with mean of centered YearsCodePro is above 0.5 - that’s not really good.

For different YearsCodePro values:

With WorkPlan groups

Those groups not really differ from each other, but I really have high structured work plan group above others, medium structured right below and low structured at the bottom (except Non-responce group) - that slopes help my hypothesis, the more structured respondent’s work, the more that respondent thinks that his/her life better in comparison with his/her parents.

But with years of coding professionally my hypothesis breakes and I see decrease of probability with more years given.

With PurchaseWhat groups

Here is the same situation and slopes of influence in purchasing helps my hypothesis, but with more years of coding professionally chances decreasing.

Average marginal effects

##                    factor     AME     SE        z      p   lower   upper
##     PurchaseWhat-highInfl  0.0625 0.0067   9.3303 0.0000  0.0494  0.0756
##      PurchaseWhat-medInfl  0.0168 0.0054   3.0921 0.0020  0.0062  0.0275
##  PurchaseWhatNon-responce  0.0385 0.0075   5.1044 0.0000  0.0237  0.0533
##        WorkPlan-highStrct  0.0593 0.0067   8.8503 0.0000  0.0462  0.0725
##         WorkPlan-medStrct  0.0222 0.0054   4.1365 0.0000  0.0117  0.0327
##      WorkPlanNon-responce -0.0306 0.0111  -2.7465 0.0060 -0.0524 -0.0088
##              YearsCodePro -0.0052 0.0003 -16.4590 0.0000 -0.0058 -0.0046
  • if centered YearsCodePro changes by one unit it will decrease probability on -0.005 percent
    • if PurchaseWhat changes from “low” (with control of YearsCodePro):
      • to “medium” probability is greater on 0.01 percent
      • to “high” probability is greater on 0.06 percent
      • to “Non-responce” probability is greater on 0.03 percent
    • if WorkPlan changes from “low” (with control of YearsCodePro):
      • to “medium” probability is greater on 0.02 percent
      • to “high” probability is greater on 0.06 percent
      • to “Non-responce” probability is lesser on 0.03 percent

And here I finally see that my hypothesis about structuredness and influence were right, but not about years of coding professionally!

Accuracy, ROC and Osius and Rojek’s test

## Confusion Matrix and Statistics
## 
##      
## data    No  Yes
##   No   128  106
##   Yes 5653 9608
##                                           
##                Accuracy : 0.6283          
##                  95% CI : (0.6207, 0.6359)
##     No Information Rate : 0.6269          
##     P-Value [Acc > NIR] : 0.3607          
##                                           
##                   Kappa : 0.0139          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.022141        
##             Specificity : 0.989088        
##          Pos Pred Value : 0.547009        
##          Neg Pred Value : 0.629579        
##              Prevalence : 0.373088        
##          Detection Rate : 0.008261        
##    Detection Prevalence : 0.015102        
##       Balanced Accuracy : 0.505615        
##                                           
##        'Positive' Class : No              
## 

With such Accuracy and No-Information rate I can claim that my model is not statistically significantly better than no model at all.

Looking at ROC curve:

AUC = 55.3% shows that my model is “useless” according to the test summary itself:

##          auc lower 95% CI upper 95% CI 
##     55.33457     54.79547     55.87368 
## attr(,"interpret")
## [1] "auc = 0.5       --> useless"   "0.7 < auc < 0.8 --> good"     
## [3] "0.8 < auc < 0.9 --> excellent"

Looking at significant difference between model and observed data:

##          test  stat        val df       pVal
## 1:         HL chiSq 8.82096889  8 0.35762333
## 2:        mHL     F 0.55712121  9 0.83235596
## 3:       OsRo     Z 2.01124177 NA 0.04429993
## 4: SstPgeq0.5     Z 0.22523135 NA 0.82179931
## 5:   SstPl0.5     Z 0.52881251 NA 0.59693552
## 6:    SstBoth chiSq 0.33037183  2 0.84773608
## 7: SllPgeq0.5 chiSq 0.05072991  1 0.82179802
## 8:   SllPl0.5 chiSq 0.28142922  1 0.59576598
## 9:    SllBoth chiSq 0.57259517  2 0.75103909

Our goodness-of-fit measures shows that:

  • for Hosmer and Lemeshow tests = 0.35
    • for Osius and Rojek’s tests = 0.04

It is better to believe Osius and Rojek’s test, which shows level of significance = 0.04 (and that supported by other diagnostics) and that shows us that my model is really poor.

Model diagnostics

Linearity

##              Test stat Pr(>|Test stat|)
## YearsCodePro    0.0578           0.8099
## WorkPlan                               
## PurchaseWhat

Looks strange, maybe problem lies in it?

Here too.

Multicollinearity

##                  GVIF Df GVIF^(1/(2*Df))
## YearsCodePro 1.037050  1        1.018356
## WorkPlan     1.244897  3        1.037183
## PurchaseWhat 1.280756  3        1.042104

I don’t have VIF > 10, that’s good!

Outliers

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 7009 -1.634313            0.10219           NA

Some points can be outliers, but I think that doesn’t change my outcome - I’ve already have poor fitting model :(

Cook’s distance

##          StudRes          Hat        CookD
## 19488 -1.0022709 0.0009808869 8.009260e-05
## 49428  1.3146992 0.0008639994 1.484042e-04
## 7009  -1.6343127 0.0002356230 8.252462e-05
## 61839  1.3462571 0.0008809057 1.625163e-04
## 56802 -1.6343127 0.0002356230 8.252462e-05
## 8906  -0.9847613 0.0010307213 8.049172e-05

I don’t have outliers, that’s great!

Conclusion

That was really interesting topic, I like logistic regression and will use it in future (just because it is interesting).

And here I’m partly proved my hypotheses: about structuredness and influence, but disproved about years of coding professionally. But with such effect sizes, pseudo-Rs and accuracy I can’t claim that I’ve built a nice model - data is also was interesting, but strange.

Thanks for the topic!