link to the code on repo

link to the GitHub Page

Preparations

Firsly, I will transform chosen variables to more applicable forms and omit NA values for better clusterization:

(and also I will scale YearsCode variable but not center to avoid getting NaNs because later I will logarithm it)

Chosing variables

To clusterize R user meaningfully I take following variables:

  • MainBranch: how close respondent to developing and programming (but what about data analysis? - that’s not developing, to highly depends on code-writing especially MLlinked jobs)
  • Employment: about job format, and I recoded it into two groups
  • EdLevel: respondent’s level of education, too recoded but into 3 groups
  • YearsCode: scaled continuous variable about how many years respondent write code professionaly

With such variables I’m tryin to cluster respondents based on their level of coding experience and for that purpose I chose small but nice set of variables, described above.

Exploring data

After preparing and recoding data (and choosing variables) there is need to describe structure of dataset:

## $MainBranch
##        Developer          Student Partly-developer         Hobbyist 
##             2650              628             1271              110 
##     Ex-developer 
##               99 
## 
## $Employment
##               Employed Partly or not employed 
##                   3439                   1319 
## 
## $EdLevel
##         Bachelor degree Master of higher degree              Pre-degree 
##                    1672                    2319                     767 
## 
## $YearsCode
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03174 0.25391 0.73000 0.82737 1.42825 1.58695
  • MainBranch: the biggest group is developers, half-sized of that group is respondents who partly using code in their work, then about 630 respondents are students and lastly two groups about hundred respondents each consist hobbyist and ex-developers. So there are huge amount of professional developers, more than half of the whole sample
  • Employment: nearly two thirds of a sample fully employed and other third partly of not employed
  • EdLevel: about half of the whole sample are ones with master degree or higher (PhD and so on), half-sized of that is group of bachelors and there is really small group of respondents without a degree
  • YearsCode: and that variable is centered and scaled, so there is nothing about it can be described now.

Calculating distance

Firstly, I need to calculate distance before clustering anything and I choose daisy() function and gower distance metric because three variables are factors and only one is continuous, so there is need to chose algorithm and metric applicable for mixed data:

(and also I delete respondents ids because such variable is meaningless in clusterization - but still I have it in my dataset, maybe I need it later)

## 11316903 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2918  0.5040  0.4792  0.6561  1.0000 
## Metric :  mixed ;  Types = N, N, N, I 
## Number of objects : 4758

Distance calculated and mean is on 0.5 - from that I can predict concentration of observation in the center and great variation and overlapping clusters, that is not good at all, but I still will try to test my set of variables.

Most similar cases

##    Respondent       MainBranch Employment                 EdLevel YearsCode
## 1:       1337 Partly-developer   Employed Master of higher degree  1.586949
## 2:         20 Partly-developer   Employed Master of higher degree  1.555210

Looks like identical twins, nice!

Most dissimilar cases

##    Respondent       MainBranch             Employment                 EdLevel
## 1:       1261          Student Partly or not employed         Bachelor degree
## 2:         20 Partly-developer               Employed Master of higher degree
##     YearsCode
## 1: 0.03173898
## 2: 1.55520985

So there is not or partly employed bachelor student vs. partly-coder with master of higher degree epmployee with great difference in years of coding professionally - works nice too, but I assumed there will be master developer vs. partly-remployed respondent without a degree.

Dendrogram

This is looks like too much clusters, more than 10, not really nice, but as I think it highly depends on data and there are some issues with it so I can’t do anything.

PAM

Firstly, I will clusterize data with PAM and describe my clusters:

I will choose 5 clusters as optimal solution - more clusters can increase value but not so much and it will be hard to interpret that much clusters.

With medoids and plot I can describe clusters and their dissimilarities, but firtsly I will visualize my solution:

Visualizing solution

Looks awful, but I have three factor variables and only one numeric, so it’s fine for such set of varibles and given data. Again, there is nothing can be said about clusters themselves by such plot, so it is better to describe clusters by summary:

Describing clusters

Cluster 1
Respondent MainBranch Employment EdLevel YearsCode cluster
Min. : 6 Developer : 0 Employed :327 Bachelor degree :380 Min. :0.03174 Min. :1
1st Qu.:23198 Student : 19 Partly or not employed: 71 Master of higher degree: 0 1st Qu.:0.38087 1st Qu.:1
Median :44624 Partly-developer:335 NA Pre-degree : 18 Median :0.76174 Median :1
Mean :45111 Hobbyist : 19 NA NA Mean :0.85552 Mean :1
3rd Qu.:67644 Ex-developer : 25 NA NA 3rd Qu.:1.42825 3rd Qu.:1
Max. :88375 NA NA NA Max. :1.58695 Max. :1

First cluster mostly about employed partly-developers with bachelor degree and mean years of coding pro equals 0.8

Cluster 2
Respondent MainBranch Employment EdLevel YearsCode cluster
Min. : 10 Developer :1361 Employed :1233 Bachelor degree : 0 Min. :0.03174 Min. :2
1st Qu.:22652 Student : 6 Partly or not employed: 184 Master of higher degree:1325 1st Qu.:0.19043 1st Qu.:2
Median :44800 Partly-developer: 0 NA Pre-degree : 92 Median :0.50782 Median :2
Mean :44724 Hobbyist : 15 NA NA Mean :0.70386 Mean :2
3rd Qu.:67209 Ex-developer : 35 NA NA 3rd Qu.:1.42825 3rd Qu.:2
Max. :88830 NA NA NA Max. :1.58695 Max. :2

Second cluster primarly about employed developers with master of higher degree and lesser years of coding pro equals to 0.7

Cluster 3
Respondent MainBranch Employment EdLevel YearsCode cluster
Min. : 12 Developer :114 Employed : 25 Bachelor degree :168 Min. :0.03174 Min. :3
1st Qu.:22976 Student :562 Partly or not employed:743 Master of higher degree: 93 1st Qu.:0.41261 1st Qu.:3
Median :47002 Partly-developer: 38 NA Pre-degree :507 Median :1.07913 Median :3
Mean :46281 Hobbyist : 45 NA NA Mean :0.96267 Mean :3
3rd Qu.:70384 Ex-developer : 9 NA NA 3rd Qu.:1.44412 3rd Qu.:3
Max. :88873 NA NA NA Max. :1.58695 Max. :3

Third cluster is about partly or not employed students who are still studying but have greater mean of years of coding pro equals 0.96

Cluster 4
Respondent MainBranch Employment EdLevel YearsCode cluster
Min. : 18 Developer : 0 Employed :821 Bachelor degree : 0 Min. :0.03174 Min. :4
1st Qu.:21322 Student : 18 Partly or not employed:121 Master of higher degree:901 1st Qu.:0.22217 1st Qu.:4
Median :40468 Partly-developer:898 NA Pre-degree : 41 Median :0.73000 Median :4
Mean :42634 Hobbyist : 12 NA NA Mean :0.76642 Mean :4
3rd Qu.:64889 Ex-developer : 14 NA NA 3rd Qu.:1.42825 3rd Qu.:4
Max. :88645 NA NA NA Max. :1.58695 Max. :4

Fourth cluster is about partly-developers with jobs and master or higher degree and similar to second cluster mean of years coding pro

Cluster 5
Respondent MainBranch Employment EdLevel YearsCode cluster
Min. : 95 Developer :1175 Employed :1033 Bachelor degree :1124 Min. :0.03174 Min. :5
1st Qu.:22790 Student : 23 Partly or not employed: 200 Master of higher degree: 0 1st Qu.:0.31739 1st Qu.:5
Median :45158 Partly-developer: 0 NA Pre-degree : 109 Median :1.07913 Median :5
Mean :45125 Hobbyist : 19 NA NA Mean :0.92252 Mean :5
3rd Qu.:67952 Ex-developer : 16 NA NA 3rd Qu.:1.49173 3rd Qu.:5
Max. :88876 NA NA NA Max. :1.58695 Max. :5

Fifth cluster, the last one, consists from employed developers with bachelor degree and greater mean of years of coding pro equals 0.9

## 
## --------Summary descriptives table by 'cluster'---------
## 
## ___________________________________________________________________________________________________________ 
##                                   1             2             3             4             5       p.overall 
##                                 N=398        N=1417         N=768         N=942        N=1233               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Respondent                  45111 (26040) 44724 (25656) 46281 (26355) 42634 (25548) 45125 (25702)   0.053   
## MainBranch:                                                                                         0.000   
##     Developer                 0 (0.00%)   1361 (96.0%)   114 (14.8%)    0 (0.00%)   1175 (95.3%)            
##     Student                  19 (4.77%)     6 (0.42%)    562 (73.2%)   18 (1.91%)    23 (1.87%)             
##     Partly-developer         335 (84.2%)    0 (0.00%)    38 (4.95%)    898 (95.3%)    0 (0.00%)             
##     Hobbyist                 19 (4.77%)    15 (1.06%)    45 (5.86%)    12 (1.27%)    19 (1.54%)             
##     Ex-developer             25 (6.28%)    35 (2.47%)     9 (1.17%)    14 (1.49%)    16 (1.30%)             
## Employment:                                                                                         0.000   
##     Employed                 327 (82.2%)  1233 (87.0%)   25 (3.26%)    821 (87.2%)  1033 (83.8%)            
##     Partly or not employed   71 (17.8%)    184 (13.0%)   743 (96.7%)   121 (12.8%)   200 (16.2%)            
## EdLevel:                                                                                            0.000   
##     Bachelor degree          380 (95.5%)    0 (0.00%)    168 (21.9%)    0 (0.00%)   1124 (91.2%)            
##     Master of higher degree   0 (0.00%)   1325 (93.5%)   93 (12.1%)    901 (95.6%)    0 (0.00%)             
##     Pre-degree               18 (4.52%)    92 (6.49%)    507 (66.0%)   41 (4.35%)    109 (8.84%)            
## YearsCode                    0.86 (0.56)   0.70 (0.56)   0.96 (0.52)   0.77 (0.55)   0.92 (0.57)   <0.001   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

And maybe it will be more interesting to look at comparison table.

So there are nice clusters really separate from each other on all dimensions, I think that is fine solution.

Other clustering method

(Source)[https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/]

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Looks nice and that’s all for now!

But later I will try to explore this topic and understand more about different techniques, thank you for the course!