Preparations
Firsly, I will transform chosen variables to more applicable forms and omit NA values for better clusterization:
(and also I will scale YearsCode variable but not center to avoid getting NaNs because later I will logarithm it)
set.seed(2020)
require(data.table); require(dplyr)
df <- fread("dataforproject2.csv", stringsAsFactors = T)
df <- df %>% select(Respondent, MainBranch, Employment, EdLevel, YearsCode)
# changing YearsCode to numeric
df <- subset(df, YearsCode != "Less than 1 year" & YearsCode != "More than 50 years")
# not centering to not achieve NaNs later in daisy() when log this variable
df$YearsCode <- scale(as.numeric(df$YearsCode), center = F, scale = T); df <- na.omit(df)
# creating less groups to get smaller matrix later
# for MainBranch
levels(df$MainBranch) <- c("Developer", "Student", "Partly-developer", "Hobbyist", "Ex-developer")
# for Employment
df$Employment <- df %>%
transmute(Employment =
fifelse(Employment == "Employed full-time", "Employed", "Partly or not employed"))
df$Employment <- as.factor(df$Employment)
# for Education
df$EdLevel <- df %>%
transmute(EdLevel =
fifelse(EdLevel == "I never completed any formal education" |
EdLevel == "Primary/elementary school" |
EdLevel == "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)" |
EdLevel == "Professional degree (JD, MD, etc.)" |
EdLevel == "Some college/university study without earning a degree",
"Pre-degree",
fifelse(EdLevel == "Bachelor’s degree (BA, BS, B.Eng., etc.)",
"Bachelor degree",
"Master of higher degree")))
df$EdLevel <- as.factor(df$EdLevel)
Chosing variables
To clusterize R user meaningfully I take following variables:
- MainBranch: how close respondent to developing and programming (but what about data analysis? - that’s not developing, to highly depends on code-writing especially MLlinked jobs)
- Employment: about job format, and I recoded it into two groups
- EdLevel: respondent’s level of education, too recoded but into 3 groups
- YearsCode: scaled continuous variable about how many years respondent write code professionaly
With such variables I’m tryin to cluster respondents based on their level of coding experience and for that purpose I chose small but nice set of variables, described above.
Exploring data
After preparing and recoding data (and choosing variables) there is need to describe structure of dataset:
## $MainBranch
## Developer Student Partly-developer Hobbyist
## 2650 628 1271 110
## Ex-developer
## 99
##
## $Employment
## Employed Partly or not employed
## 3439 1319
##
## $EdLevel
## Bachelor degree Master of higher degree Pre-degree
## 1672 2319 767
##
## $YearsCode
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03174 0.25391 0.73000 0.82737 1.42825 1.58695
- MainBranch: the biggest group is developers, half-sized of that group is respondents who partly using code in their work, then about 630 respondents are students and lastly two groups about hundred respondents each consist hobbyist and ex-developers. So there are huge amount of professional developers, more than half of the whole sample
- Employment: nearly two thirds of a sample fully employed and other third partly of not employed
- EdLevel: about half of the whole sample are ones with master degree or higher (PhD and so on), half-sized of that is group of bachelors and there is really small group of respondents without a degree
- YearsCode: and that variable is centered and scaled, so there is nothing about it can be described now.
Calculating distance
Firstly, I need to calculate distance before clustering anything and I choose daisy() function and gower distance metric because three variables are factors and only one is continuous, so there is need to chose algorithm and metric applicable for mixed data:
(and also I delete respondents ids because such variable is meaningless in clusterization - but still I have it in my dataset, maybe I need it later)
library(ISLR); library(cluster)
gower_dist <- daisy(df[,-"Respondent"], metric = "gower", type = list(logratio = 4))
summary(gower_dist); gower_mat <- as.matrix(gower_dist)
## 11316903 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2918 0.5040 0.4792 0.6561 1.0000
## Metric : mixed ; Types = N, N, N, I
## Number of objects : 4758
Distance calculated and mean is on 0.5 - from that I can predict concentration of observation in the center and great variation and overlapping clusters, that is not good at all, but I still will try to test my set of variables.
Most similar cases
## Respondent MainBranch Employment EdLevel YearsCode
## 1: 1337 Partly-developer Employed Master of higher degree 1.586949
## 2: 20 Partly-developer Employed Master of higher degree 1.555210
Looks like identical twins, nice!
Most dissimilar cases
## Respondent MainBranch Employment EdLevel
## 1: 1261 Student Partly or not employed Bachelor degree
## 2: 20 Partly-developer Employed Master of higher degree
## YearsCode
## 1: 0.03173898
## 2: 1.55520985
So there is not or partly employed bachelor student vs. partly-coder with master of higher degree epmployee with great difference in years of coding professionally - works nice too, but I assumed there will be master developer vs. partly-remployed respondent without a degree.
Dendrogram
This is looks like too much clusters, more than 10, not really nice, but as I think it highly depends on data and there are some issues with it so I can’t do anything.
PAM
Firstly, I will clusterize data with PAM and describe my clusters:
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters", xaxt='n',
ylab = "Silhouette Width",
ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)
I will choose 5 clusters as optimal solution - more clusters can increase value but not so much and it will be hard to interpret that much clusters.
pam_fit <- pam(gower_dist, diss = TRUE, k = 5)
pam_results <- df %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
With medoids and plot I can describe clusters and their dissimilarities, but firtsly I will visualize my solution:
Visualizing solution
library(Rtsne); library(ggplot2)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = df$Respondent)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))
Looks awful, but I have three factor variables and only one numeric, so it’s fine for such set of varibles and given data. Again, there is nothing can be said about clusters themselves by such plot, so it is better to describe clusters by summary:
Describing clusters
Respondent | MainBranch | Employment | EdLevel | YearsCode | cluster | |
---|---|---|---|---|---|---|
Min. : 6 | Developer : 0 | Employed :327 | Bachelor degree :380 | Min. :0.03174 | Min. :1 | |
1st Qu.:23198 | Student : 19 | Partly or not employed: 71 | Master of higher degree: 0 | 1st Qu.:0.38087 | 1st Qu.:1 | |
Median :44624 | Partly-developer:335 | NA | Pre-degree : 18 | Median :0.76174 | Median :1 | |
Mean :45111 | Hobbyist : 19 | NA | NA | Mean :0.85552 | Mean :1 | |
3rd Qu.:67644 | Ex-developer : 25 | NA | NA | 3rd Qu.:1.42825 | 3rd Qu.:1 | |
Max. :88375 | NA | NA | NA | Max. :1.58695 | Max. :1 |
First cluster mostly about employed partly-developers with bachelor degree and mean years of coding pro equals 0.8
Respondent | MainBranch | Employment | EdLevel | YearsCode | cluster | |
---|---|---|---|---|---|---|
Min. : 10 | Developer :1361 | Employed :1233 | Bachelor degree : 0 | Min. :0.03174 | Min. :2 | |
1st Qu.:22652 | Student : 6 | Partly or not employed: 184 | Master of higher degree:1325 | 1st Qu.:0.19043 | 1st Qu.:2 | |
Median :44800 | Partly-developer: 0 | NA | Pre-degree : 92 | Median :0.50782 | Median :2 | |
Mean :44724 | Hobbyist : 15 | NA | NA | Mean :0.70386 | Mean :2 | |
3rd Qu.:67209 | Ex-developer : 35 | NA | NA | 3rd Qu.:1.42825 | 3rd Qu.:2 | |
Max. :88830 | NA | NA | NA | Max. :1.58695 | Max. :2 |
Second cluster primarly about employed developers with master of higher degree and lesser years of coding pro equals to 0.7
Respondent | MainBranch | Employment | EdLevel | YearsCode | cluster | |
---|---|---|---|---|---|---|
Min. : 12 | Developer :114 | Employed : 25 | Bachelor degree :168 | Min. :0.03174 | Min. :3 | |
1st Qu.:22976 | Student :562 | Partly or not employed:743 | Master of higher degree: 93 | 1st Qu.:0.41261 | 1st Qu.:3 | |
Median :47002 | Partly-developer: 38 | NA | Pre-degree :507 | Median :1.07913 | Median :3 | |
Mean :46281 | Hobbyist : 45 | NA | NA | Mean :0.96267 | Mean :3 | |
3rd Qu.:70384 | Ex-developer : 9 | NA | NA | 3rd Qu.:1.44412 | 3rd Qu.:3 | |
Max. :88873 | NA | NA | NA | Max. :1.58695 | Max. :3 |
Third cluster is about partly or not employed students who are still studying but have greater mean of years of coding pro equals 0.96
Respondent | MainBranch | Employment | EdLevel | YearsCode | cluster | |
---|---|---|---|---|---|---|
Min. : 18 | Developer : 0 | Employed :821 | Bachelor degree : 0 | Min. :0.03174 | Min. :4 | |
1st Qu.:21322 | Student : 18 | Partly or not employed:121 | Master of higher degree:901 | 1st Qu.:0.22217 | 1st Qu.:4 | |
Median :40468 | Partly-developer:898 | NA | Pre-degree : 41 | Median :0.73000 | Median :4 | |
Mean :42634 | Hobbyist : 12 | NA | NA | Mean :0.76642 | Mean :4 | |
3rd Qu.:64889 | Ex-developer : 14 | NA | NA | 3rd Qu.:1.42825 | 3rd Qu.:4 | |
Max. :88645 | NA | NA | NA | Max. :1.58695 | Max. :4 |
Fourth cluster is about partly-developers with jobs and master or higher degree and similar to second cluster mean of years coding pro
Respondent | MainBranch | Employment | EdLevel | YearsCode | cluster | |
---|---|---|---|---|---|---|
Min. : 95 | Developer :1175 | Employed :1033 | Bachelor degree :1124 | Min. :0.03174 | Min. :5 | |
1st Qu.:22790 | Student : 23 | Partly or not employed: 200 | Master of higher degree: 0 | 1st Qu.:0.31739 | 1st Qu.:5 | |
Median :45158 | Partly-developer: 0 | NA | Pre-degree : 109 | Median :1.07913 | Median :5 | |
Mean :45125 | Hobbyist : 19 | NA | NA | Mean :0.92252 | Mean :5 | |
3rd Qu.:67952 | Ex-developer : 16 | NA | NA | 3rd Qu.:1.49173 | 3rd Qu.:5 | |
Max. :88876 | NA | NA | NA | Max. :1.58695 | Max. :5 |
Fifth cluster, the last one, consists from employed developers with bachelor degree and greater mean of years of coding pro equals 0.9
##
## --------Summary descriptives table by 'cluster'---------
##
## ___________________________________________________________________________________________________________
## 1 2 3 4 5 p.overall
## N=398 N=1417 N=768 N=942 N=1233
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Respondent 45111 (26040) 44724 (25656) 46281 (26355) 42634 (25548) 45125 (25702) 0.053
## MainBranch: 0.000
## Developer 0 (0.00%) 1361 (96.0%) 114 (14.8%) 0 (0.00%) 1175 (95.3%)
## Student 19 (4.77%) 6 (0.42%) 562 (73.2%) 18 (1.91%) 23 (1.87%)
## Partly-developer 335 (84.2%) 0 (0.00%) 38 (4.95%) 898 (95.3%) 0 (0.00%)
## Hobbyist 19 (4.77%) 15 (1.06%) 45 (5.86%) 12 (1.27%) 19 (1.54%)
## Ex-developer 25 (6.28%) 35 (2.47%) 9 (1.17%) 14 (1.49%) 16 (1.30%)
## Employment: 0.000
## Employed 327 (82.2%) 1233 (87.0%) 25 (3.26%) 821 (87.2%) 1033 (83.8%)
## Partly or not employed 71 (17.8%) 184 (13.0%) 743 (96.7%) 121 (12.8%) 200 (16.2%)
## EdLevel: 0.000
## Bachelor degree 380 (95.5%) 0 (0.00%) 168 (21.9%) 0 (0.00%) 1124 (91.2%)
## Master of higher degree 0 (0.00%) 1325 (93.5%) 93 (12.1%) 901 (95.6%) 0 (0.00%)
## Pre-degree 18 (4.52%) 92 (6.49%) 507 (66.0%) 41 (4.35%) 109 (8.84%)
## YearsCode 0.86 (0.56) 0.70 (0.56) 0.96 (0.52) 0.77 (0.55) 0.92 (0.57) <0.001
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
And maybe it will be more interesting to look at comparison table.
So there are nice clusters really separate from each other on all dimensions, I think that is fine solution.
Other clustering method
(Source)[https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/]
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
res.hc <- hclust(gower_dist, method = "ward.D2")
fviz_dend(res.hc, k = 5,
cex = 0.5,
color_labels_by_k = TRUE,
rect = TRUE
)
Looks nice and that’s all for now!
But later I will try to explore this topic and understand more about different techniques, thank you for the course!