Preparations

Firsly, I will transform chosen variables to more applicable forms and omit NA values for better clusterization:

(and also I will scale YearsCode variable but not center to avoid getting NaNs because later I will logarithm it)

set.seed(2020)
require(data.table); require(dplyr)
df <- fread("dataforproject2.csv", stringsAsFactors = T)
df <- df %>% select(Respondent, MainBranch, Employment, EdLevel, YearsCode)
# changing YearsCode to numeric
df <- subset(df, YearsCode != "Less than 1 year" & YearsCode != "More than 50 years")
# not centering to not achieve NaNs later in daisy() when log this variable
df$YearsCode <- scale(as.numeric(df$YearsCode), center = F, scale = T); df <- na.omit(df)
# creating less groups to get smaller matrix later
# for MainBranch
levels(df$MainBranch) <- c("Developer", "Student", "Partly-developer", "Hobbyist", "Ex-developer")
# for Employment
df$Employment <- df %>% 
  transmute(Employment = 
              fifelse(Employment == "Employed full-time", "Employed", "Partly or not employed"))
df$Employment <- as.factor(df$Employment)
# for Education
df$EdLevel <- df %>% 
  transmute(EdLevel = 
              fifelse(EdLevel == "I never completed any formal education" | 
                      EdLevel == "Primary/elementary school" | 
                      EdLevel == "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)" |
                      EdLevel == "Professional degree (JD, MD, etc.)" | 
                      EdLevel == "Some college/university study without earning a degree", 
                  "Pre-degree", 
              fifelse(EdLevel == "Bachelor’s degree (BA, BS, B.Eng., etc.)", 
                  "Bachelor degree", 
                  "Master of higher degree")))
df$EdLevel <- as.factor(df$EdLevel)

Chosing variables

To clusterize R user meaningfully I take following variables:

MainBranch: how close respondent to developing and programming (but what about data analysis? - that’s not developing, to highly depends on code-writing especially MLlinked jobs)
Employment: about job format, and I recoded it into two groups
EdLevel: respondent’s level of education, too recoded but into 3 groups
YearsCode: scaled continuous variable about how many years respondent write code professionaly

With such variables I’m tryin to cluster respondents based on their level of coding experience and for that purpose I chose small but nice set of variables, described above.

Exploring data

After preparing and recoding data (and choosing variables) there is need to describe structure of dataset:

sapply(df[,c("MainBranch", "Employment", "EdLevel", "YearsCode")], summary)

## $MainBranch
##        Developer          Student Partly-developer         Hobbyist 
##             2650              628             1271              110 
##     Ex-developer 
##               99 
## 
## $Employment
##               Employed Partly or not employed 
##                   3439                   1319 
## 
## $EdLevel
##         Bachelor degree Master of higher degree              Pre-degree 
##                    1672                    2319                     767 
## 
## $YearsCode
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03174 0.25391 0.73000 0.82737 1.42825 1.58695

MainBranch: the biggest group is developers, half-sized of that group is respondents who partly using code in their work, then about 630 respondents are students and lastly two groups about hundred respondents each consist hobbyist and ex-developers. So there are huge amount of professional developers, more than half of the whole sample
Employment: nearly two thirds of a sample fully employed and other third partly of not employed
EdLevel: about half of the whole sample are ones with master degree or higher (PhD and so on), half-sized of that is group of bachelors and there is really small group of respondents without a degree
YearsCode: and that variable is centered and scaled, so there is nothing about it can be described now.

Calculating distance

Firstly, I need to calculate distance before clustering anything and I choose daisy() function and gower distance metric because three variables are factors and only one is continuous, so there is need to chose algorithm and metric applicable for mixed data:

(and also I delete respondents ids because such variable is meaningless in clusterization - but still I have it in my dataset, maybe I need it later)

library(ISLR); library(cluster)
gower_dist <- daisy(df[,-"Respondent"], metric = "gower", type = list(logratio = 4))
summary(gower_dist); gower_mat <- as.matrix(gower_dist)

## 11316903 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2918  0.5040  0.4792  0.6561  1.0000 
## Metric :  mixed ;  Types = N, N, N, I 
## Number of objects : 4758

Distance calculated and mean is on 0.5 - from that I can predict concentration of observation in the center and great variation and overlapping clusters, that is not good at all, but I still will try to test my set of variables.

Most similar cases

df[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]

##    Respondent       MainBranch Employment                 EdLevel YearsCode
## 1:       1337 Partly-developer   Employed Master of higher degree  1.586949
## 2:         20 Partly-developer   Employed Master of higher degree  1.555210

Looks like identical twins, nice!

Most dissimilar cases

df[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]

##    Respondent       MainBranch             Employment                 EdLevel
## 1:       1261          Student Partly or not employed         Bachelor degree
## 2:         20 Partly-developer               Employed Master of higher degree
##     YearsCode
## 1: 0.03173898
## 2: 1.55520985

So there is not or partly employed bachelor student vs. partly-coder with master of higher degree epmployee with great difference in years of coding professionally - works nice too, but I assumed there will be master developer vs. partly-remployed respondent without a degree.

Dendrogram

heatmap(gower_mat, symm = T,
        distfun = function(x) as.dist(x))

This is looks like too much clusters, more than 10, not really nice, but as I think it highly depends on data and there are some issues with it so I can’t do anything.

PAM

Firstly, I will clusterize data with PAM and describe my clusters:

sil_width <- c(NA)

for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

plot(1:10, sil_width,
     xlab = "Number of clusters", xaxt='n',
     ylab = "Silhouette Width",
     ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)

I will choose 5 clusters as optimal solution - more clusters can increase value but not so much and it will be hard to interpret that much clusters.

pam_fit <- pam(gower_dist, diss = TRUE, k = 5)

pam_results <- df %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

With medoids and plot I can describe clusters and their dissimilarities, but firtsly I will visualize my solution:

Visualizing solution

library(Rtsne); library(ggplot2)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = df$Respondent)

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Looks awful, but I have three factor variables and only one numeric, so it’s fine for such set of varibles and given data. Again, there is nothing can be said about clusters themselves by such plot, so it is better to describe clusters by summary:

Describing clusters

Cluster 1
Respondent	MainBranch	Employment	EdLevel	YearsCode	cluster
Min. : 6	Developer : 0	Employed :327	Bachelor degree :380	Min. :0.03174	Min. :1
1st Qu.:23198	Student : 19	Partly or not employed: 71	Master of higher degree: 0	1st Qu.:0.38087	1st Qu.:1
Median :44624	Partly-developer:335	NA	Pre-degree : 18	Median :0.76174	Median :1
Mean :45111	Hobbyist : 19	NA	NA	Mean :0.85552	Mean :1
3rd Qu.:67644	Ex-developer : 25	NA	NA	3rd Qu.:1.42825	3rd Qu.:1
Max. :88375	NA	NA	NA	Max. :1.58695	Max. :1

First cluster mostly about employed partly-developers with bachelor degree and mean years of coding pro equals 0.8

Cluster 2
Respondent	MainBranch	Employment	EdLevel	YearsCode	cluster
Min. : 10	Developer :1361	Employed :1233	Bachelor degree : 0	Min. :0.03174	Min. :2
1st Qu.:22652	Student : 6	Partly or not employed: 184	Master of higher degree:1325	1st Qu.:0.19043	1st Qu.:2
Median :44800	Partly-developer: 0	NA	Pre-degree : 92	Median :0.50782	Median :2
Mean :44724	Hobbyist : 15	NA	NA	Mean :0.70386	Mean :2
3rd Qu.:67209	Ex-developer : 35	NA	NA	3rd Qu.:1.42825	3rd Qu.:2
Max. :88830	NA	NA	NA	Max. :1.58695	Max. :2

Second cluster primarly about employed developers with master of higher degree and lesser years of coding pro equals to 0.7

Cluster 3
Respondent	MainBranch	Employment	EdLevel	YearsCode	cluster
Min. : 12	Developer :114	Employed : 25	Bachelor degree :168	Min. :0.03174	Min. :3
1st Qu.:22976	Student :562	Partly or not employed:743	Master of higher degree: 93	1st Qu.:0.41261	1st Qu.:3
Median :47002	Partly-developer: 38	NA	Pre-degree :507	Median :1.07913	Median :3
Mean :46281	Hobbyist : 45	NA	NA	Mean :0.96267	Mean :3
3rd Qu.:70384	Ex-developer : 9	NA	NA	3rd Qu.:1.44412	3rd Qu.:3
Max. :88873	NA	NA	NA	Max. :1.58695	Max. :3

Third cluster is about partly or not employed students who are still studying but have greater mean of years of coding pro equals 0.96

Cluster 4
Respondent	MainBranch	Employment	EdLevel	YearsCode	cluster
Min. : 18	Developer : 0	Employed :821	Bachelor degree : 0	Min. :0.03174	Min. :4
1st Qu.:21322	Student : 18	Partly or not employed:121	Master of higher degree:901	1st Qu.:0.22217	1st Qu.:4
Median :40468	Partly-developer:898	NA	Pre-degree : 41	Median :0.73000	Median :4
Mean :42634	Hobbyist : 12	NA	NA	Mean :0.76642	Mean :4
3rd Qu.:64889	Ex-developer : 14	NA	NA	3rd Qu.:1.42825	3rd Qu.:4
Max. :88645	NA	NA	NA	Max. :1.58695	Max. :4

Fourth cluster is about partly-developers with jobs and master or higher degree and similar to second cluster mean of years coding pro

Cluster 5
Respondent	MainBranch	Employment	EdLevel	YearsCode	cluster
Min. : 95	Developer :1175	Employed :1033	Bachelor degree :1124	Min. :0.03174	Min. :5
1st Qu.:22790	Student : 23	Partly or not employed: 200	Master of higher degree: 0	1st Qu.:0.31739	1st Qu.:5
Median :45158	Partly-developer: 0	NA	Pre-degree : 109	Median :1.07913	Median :5
Mean :45125	Hobbyist : 19	NA	NA	Mean :0.92252	Mean :5
3rd Qu.:67952	Ex-developer : 16	NA	NA	3rd Qu.:1.49173	3rd Qu.:5
Max. :88876	NA	NA	NA	Max. :1.58695	Max. :5

Fifth cluster, the last one, consists from employed developers with bachelor degree and greater mean of years of coding pro equals 0.9

## 
## --------Summary descriptives table by 'cluster'---------
## 
## ___________________________________________________________________________________________________________ 
##                                   1             2             3             4             5       p.overall 
##                                 N=398        N=1417         N=768         N=942        N=1233               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Respondent                  45111 (26040) 44724 (25656) 46281 (26355) 42634 (25548) 45125 (25702)   0.053   
## MainBranch:                                                                                         0.000   
##     Developer                 0 (0.00%)   1361 (96.0%)   114 (14.8%)    0 (0.00%)   1175 (95.3%)            
##     Student                  19 (4.77%)     6 (0.42%)    562 (73.2%)   18 (1.91%)    23 (1.87%)             
##     Partly-developer         335 (84.2%)    0 (0.00%)    38 (4.95%)    898 (95.3%)    0 (0.00%)             
##     Hobbyist                 19 (4.77%)    15 (1.06%)    45 (5.86%)    12 (1.27%)    19 (1.54%)             
##     Ex-developer             25 (6.28%)    35 (2.47%)     9 (1.17%)    14 (1.49%)    16 (1.30%)             
## Employment:                                                                                         0.000   
##     Employed                 327 (82.2%)  1233 (87.0%)   25 (3.26%)    821 (87.2%)  1033 (83.8%)            
##     Partly or not employed   71 (17.8%)    184 (13.0%)   743 (96.7%)   121 (12.8%)   200 (16.2%)            
## EdLevel:                                                                                            0.000   
##     Bachelor degree          380 (95.5%)    0 (0.00%)    168 (21.9%)    0 (0.00%)   1124 (91.2%)            
##     Master of higher degree   0 (0.00%)   1325 (93.5%)   93 (12.1%)    901 (95.6%)    0 (0.00%)             
##     Pre-degree               18 (4.52%)    92 (6.49%)    507 (66.0%)   41 (4.35%)    109 (8.84%)            
## YearsCode                    0.86 (0.56)   0.70 (0.56)   0.96 (0.52)   0.77 (0.55)   0.92 (0.57)   <0.001   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

And maybe it will be more interesting to look at comparison table.

So there are nice clusters really separate from each other on all dimensions, I think that is fine solution.

Other clustering method

(Source)[https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/]

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

res.hc <- hclust(gower_dist, method = "ward.D2")

fviz_dend(res.hc, k = 5,
          cex = 0.5,
          color_labels_by_k = TRUE,
          rect = TRUE
          )

Looks nice and that’s all for now!

But later I will try to explore this topic and understand more about different techniques, thank you for the course!