In class Exercise 3.1 - Spatially constrained clustering - Clustgeo method

Author

Allan Chong

Overview

In this hands-on exercise, we are interested to delineate Shan State, Myanmar into homogeneous regions by using multiple Information and Communication technology (ICT) measures, namely: Radio, Television, Land line phone, Mobile phone, Computer, and Internet at home.

This is a continuation from Hands On Exercise 4:

Skip to Spatially constrained clustering clustgeo method


Getting Started

Firstly, we load the required packages in R

  • Spatial data handling

    • sf, rgdal and spdep
  • Attribute data handling

    • knitr, tidyverse, funModeling especially readr, ggplot2 and dplyr
  • Choropleth mapping

    • tmap
  • Multivariate data visualization and analysis

    • coorplot, ggpubr, GGally and heatmaply
  • Cluster analysis

    • cluster

    • ClustGeo

pacman::p_load(knitr, rgdal, spdep, tmap, sf, 
               ggpubr, GGally, cluster, funModeling,
               factoextra, NbClust, ClustGeo,
               heatmaply, corrplot, psych, tidyverse)

Importing & preparing the data

Geospatial Data

In this section, we will import Myanmar Township Boundary GIS data and its associated attrbiute table into the R environment.

The Myanmar Township Boundary GIS data is in ESRI shapefile format. It will be imported into R environment by using the st_read() function of sf.

As we are only interested in Shan State, we will filter only values that represents the Shan State.

shan_sf = st_read(dsn="data/geospatial", layer="myanmar_township_boundaries") %>%
  filter(ST %in% c("Shan (East)", "Shan (North)", "Shan (South)"))
Reading layer `myanmar_township_boundaries' from data source 
  `D:\Allanckw\ISSS624\In-Class_Ex3\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 330 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 92.17275 ymin: 9.671252 xmax: 101.1699 ymax: 28.54554
Geodetic CRS:  WGS 84
shan_sf
Simple feature collection with 55 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 96.15107 ymin: 19.29932 xmax: 101.1699 ymax: 24.15907
Geodetic CRS:  WGS 84
First 10 features:
   OBJECTID           ST ST_PCODE       DT   DT_PCODE        TS  TS_PCODE
1       163 Shan (North)   MMR015  Mongmit MMR015D008   Mongmit MMR015017
2       203 Shan (South)   MMR014 Taunggyi MMR014D001   Pindaya MMR014006
3       240 Shan (South)   MMR014 Taunggyi MMR014D001   Ywangan MMR014007
4       106 Shan (South)   MMR014 Taunggyi MMR014D001  Pinlaung MMR014009
5        72 Shan (North)   MMR015  Mongmit MMR015D008    Mabein MMR015018
6        40 Shan (South)   MMR014 Taunggyi MMR014D001     Kalaw MMR014005
7       194 Shan (South)   MMR014 Taunggyi MMR014D001     Pekon MMR014010
8       159 Shan (South)   MMR014 Taunggyi MMR014D001  Lawksawk MMR014008
9        61 Shan (North)   MMR015  Kyaukme MMR015D003 Nawnghkio MMR015013
10      124 Shan (North)   MMR015  Kyaukme MMR015D003   Kyaukme MMR015012
                 ST_2            LABEL2 SELF_ADMIN ST_RG T_NAME_WIN T_NAME_M3
1  Shan State (North)    Mongmit\n61072       <NA> State   rdk;rdwf      မိုးမိတ်
2  Shan State (South)    Pindaya\n77769       Danu State     yif;w,     ပင်းတယ
3  Shan State (South)    Ywangan\n76933       Danu State      &GmiH       ရွာငံ
4  Shan State (South)  Pinlaung\n162537       Pa-O State  yifavmif;   ပင်လောင်း
5  Shan State (North)     Mabein\n35718       <NA> State     rbdrf;      မဘိမ်း
6  Shan State (South)     Kalaw\n163138       <NA> State       uavm      ကလော
7  Shan State (South)      Pekon\n94226       <NA> State     z,fcHk       ဖယ်ခုံ
8  Shan State (South)          Lawksawk       <NA> State   &yfapmuf    ရပ်စောက်
9  Shan State (North) Nawnghkio\n128357       <NA> State  aemifcsdK    နောင်ချို
10 Shan State (North)   Kyaukme\n172874       <NA> State   ausmufrJ    ကျောက်မဲ
       AREA                       geometry
1  2703.611 MULTIPOLYGON (((96.96001 23...
2   629.025 MULTIPOLYGON (((96.7731 21....
3  2984.377 MULTIPOLYGON (((96.78483 21...
4  3396.963 MULTIPOLYGON (((96.49518 20...
5  5034.413 MULTIPOLYGON (((96.66306 24...
6  1456.624 MULTIPOLYGON (((96.49518 20...
7  2073.513 MULTIPOLYGON (((97.14738 19...
8  5145.659 MULTIPOLYGON (((96.94981 22...
9  3271.537 MULTIPOLYGON (((96.75648 22...
10 3920.869 MULTIPOLYGON (((96.95498 22...

We can then use glimpse() to verify each field’s data type & available values.

glimpse(shan_sf)
Rows: 55
Columns: 15
$ OBJECTID   <dbl> 163, 203, 240, 106, 72, 40, 194, 159, 61, 124, 71, 155, 101…
$ ST         <chr> "Shan (North)", "Shan (South)", "Shan (South)", "Shan (Sout…
$ ST_PCODE   <chr> "MMR015", "MMR014", "MMR014", "MMR014", "MMR015", "MMR014",…
$ DT         <chr> "Mongmit", "Taunggyi", "Taunggyi", "Taunggyi", "Mongmit", "…
$ DT_PCODE   <chr> "MMR015D008", "MMR014D001", "MMR014D001", "MMR014D001", "MM…
$ TS         <chr> "Mongmit", "Pindaya", "Ywangan", "Pinlaung", "Mabein", "Kal…
$ TS_PCODE   <chr> "MMR015017", "MMR014006", "MMR014007", "MMR014009", "MMR015…
$ ST_2       <chr> "Shan State (North)", "Shan State (South)", "Shan State (So…
$ LABEL2     <chr> "Mongmit\n61072", "Pindaya\n77769", "Ywangan\n76933", "Pinl…
$ SELF_ADMIN <chr> NA, "Danu", "Danu", "Pa-O", NA, NA, NA, NA, NA, NA, NA, NA,…
$ ST_RG      <chr> "State", "State", "State", "State", "State", "State", "Stat…
$ T_NAME_WIN <chr> "rdk;rdwf", "yif;w,", "&GmiH", "yifavmif;", "rbdrf;", "uavm…
$ T_NAME_M3  <chr> "မိုးမိတ်", "ပင်းတယ", "ရွာငံ", "ပင်လောင်း", "မဘိမ်း", "ကလော", "ဖယ်ခုံ", "…
$ AREA       <dbl> 2703.611, 629.025, 2984.377, 3396.963, 5034.413, 1456.624, …
$ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((96.96001 23..., MULTIPOLYGON (…

Aspatial Data

Loading the Data

To load the raw data file, we use the read_csv function The imported InfoComm variables are extracted from The 2014 Myanmar Population and Housing Census Myanmar. The attribute data set is called ict. It is saved in R’s * tibble data.frame* format.

We can view the summary statistics with summary()

ict = read_csv("data/aspatial/Shan-ICT.csv") 
summary(ict)
 District Pcode     District Name      Township Pcode     Township Name     
 Length:55          Length:55          Length:55          Length:55         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 Total households     Radio         Television    Land line phone 
 Min.   : 3318    Min.   :  115   Min.   :  728   Min.   :  20.0  
 1st Qu.: 8711    1st Qu.: 1260   1st Qu.: 3744   1st Qu.: 266.5  
 Median :13685    Median : 2497   Median : 6117   Median : 695.0  
 Mean   :18369    Mean   : 4487   Mean   :10183   Mean   : 929.9  
 3rd Qu.:23471    3rd Qu.: 6192   3rd Qu.:13906   3rd Qu.:1082.5  
 Max.   :82604    Max.   :30176   Max.   :62388   Max.   :6736.0  
  Mobile phone      Computer      Internet at home
 Min.   :  150   Min.   :  20.0   Min.   :   8.0  
 1st Qu.: 2037   1st Qu.: 121.0   1st Qu.:  88.0  
 Median : 3559   Median : 244.0   Median : 316.0  
 Mean   : 6470   Mean   : 575.5   Mean   : 760.2  
 3rd Qu.: 7177   3rd Qu.: 507.0   3rd Qu.: 630.5  
 Max.   :48461   Max.   :6705.0   Max.   :9746.0  

There are a total of 11 fields and 55 observation in the tibble data.frame.

Derive new variables with dplyr package

The number of households is used as the measurement unit for the values. The underlying total number of households will influence the results when these statistics are used directly. Typically, the townships with a larger proportion of total households will also have a larger proportion of homes with radio, TV, etc.

We shall calculate the penetration rate of each ICT variable to address this issue by dividing it by the total number of households and multiply by 1000 and adding it to the data frame by using mutate() of dplyr package and renaming the column using rename_with()

new_col_names = c('DT_PCODE', 'DT', 'TS_PCODE', 'TS', 'TT_HOUSEHOLDS', 'RADIO', 'TV', 'LLPHONE', 'MPHONE', 'COMPUTER',  'INTERNET')

old_col_names = c('District Pcode', 'District Name', 'Township Pcode', 'Township Name', 'Total households', 'Radio', 'Television', 'Land line phone', 'Mobile phone', 'Computer', 'Internet at home')

ict_derived = ict %>%
  mutate(`RADIO_PR` = `Radio`/`Total households`*1000) %>% #per thousand household
  mutate(`TV_PR` = `Television`/`Total households`*1000) %>%
  mutate(`LLPHONE_PR` = `Land line phone`/`Total households`*1000) %>%
  mutate(`MPHONE_PR` = `Mobile phone`/`Total households`*1000) %>%
  mutate(`COMPUTER_PR` = `Computer`/`Total households`*1000) %>%
  mutate(`INTERNET_PR` = `Internet at home`/`Total households`*1000) %>%
  rename_with(~ new_col_names, all_of(old_col_names)) 

Reviewing the summary statistics of the newly derived penetration rates

summary(ict_derived)
   DT_PCODE              DT              TS_PCODE              TS           
 Length:55          Length:55          Length:55          Length:55         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 TT_HOUSEHOLDS       RADIO             TV           LLPHONE      
 Min.   : 3318   Min.   :  115   Min.   :  728   Min.   :  20.0  
 1st Qu.: 8711   1st Qu.: 1260   1st Qu.: 3744   1st Qu.: 266.5  
 Median :13685   Median : 2497   Median : 6117   Median : 695.0  
 Mean   :18369   Mean   : 4487   Mean   :10183   Mean   : 929.9  
 3rd Qu.:23471   3rd Qu.: 6192   3rd Qu.:13906   3rd Qu.:1082.5  
 Max.   :82604   Max.   :30176   Max.   :62388   Max.   :6736.0  
     MPHONE         COMPUTER         INTERNET         RADIO_PR     
 Min.   :  150   Min.   :  20.0   Min.   :   8.0   Min.   : 21.05  
 1st Qu.: 2037   1st Qu.: 121.0   1st Qu.:  88.0   1st Qu.:138.95  
 Median : 3559   Median : 244.0   Median : 316.0   Median :210.95  
 Mean   : 6470   Mean   : 575.5   Mean   : 760.2   Mean   :215.68  
 3rd Qu.: 7177   3rd Qu.: 507.0   3rd Qu.: 630.5   3rd Qu.:268.07  
 Max.   :48461   Max.   :6705.0   Max.   :9746.0   Max.   :484.52  
     TV_PR         LLPHONE_PR       MPHONE_PR       COMPUTER_PR    
 Min.   :116.0   Min.   :  2.78   Min.   : 36.42   Min.   : 3.278  
 1st Qu.:450.2   1st Qu.: 22.84   1st Qu.:190.14   1st Qu.:11.832  
 Median :517.2   Median : 37.59   Median :305.27   Median :18.970  
 Mean   :509.5   Mean   : 51.09   Mean   :314.05   Mean   :24.393  
 3rd Qu.:606.4   3rd Qu.: 69.72   3rd Qu.:428.43   3rd Qu.:29.897  
 Max.   :842.5   Max.   :181.49   Max.   :735.43   Max.   :92.402  
  INTERNET_PR     
 Min.   :  1.041  
 1st Qu.:  8.617  
 Median : 22.829  
 Mean   : 30.644  
 3rd Qu.: 41.281  
 Max.   :117.985  

Exploratory Data Analysis (EDA)

EDA using statistical graphics

We can plot the distribution of the variables (i.e. Number of households with radio) by using appropriate Exploratory Data Analysis (EDA) methods by using functions in ggplot2. We will also place the mean and median lines with geom_vline

A Histogram is useful to identify the overall distribution of the data values (i.e. left skew, right skew or normal distribution)

#{r, fig.width=4, fig.height=4
ggplot(data = ict_derived, aes(x=`RADIO`)) + 
      geom_histogram(bins=30, color="black", fill="light blue") +
      
  labs(x = "Radio", y = "Frequency") +

  geom_vline(aes(xintercept = mean(ict_derived$RADIO)),   
               color="red", linetype="dashed", linewidth=1) +
  
  geom_vline(aes(xintercept=median(ict_derived$RADIO)),   
               color="brown", linetype="dashed", linewidth=1)

We can also use box plot to detect outliers

ggplot(data=ict_derived, 
       aes(x=`RADIO`)) +
  geom_boxplot(color="black", 
               fill="light blue")

From the boxplot, we can infer that there are 3 outliers, we can find the outliers and display them using kable() below from the code below

ict_derived_outliers_radio = ict_derived %>%
  filter(RADIO > 12000)
ict_derived_outliers_radio %>% select ('DT_PCODE', 'DT', 'TS_PCODE', 'TS', 'TT_HOUSEHOLDS', 'RADIO') %>%
  kable()
DT_PCODE DT TS_PCODE TS TT_HOUSEHOLDS RADIO
MMR014D001 Taunggyi MMR014001 Taunggyi 82604 30176
MMR014D001 Taunggyi MMR014002 Nyaungshwe 42634 13801
MMR015D001 Lashio MMR015001 Lashio 64932 15307

Next, we will plot the histogram of the newly derived variables (i.e. Radio penetration rate) by using the code below. We will also place the mean and median lines with geom_vline

ggplot(data = ict_derived, aes(x=`RADIO_PR`)) + 
      geom_histogram(bins=30, color="black", fill="light blue") +
      
  labs(x = "Radio", y = "Frequency") +

  geom_vline(aes(xintercept = mean(ict_derived$RADIO_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  
  geom_vline(aes(xintercept=median(ict_derived$RADIO_PR)),   
               color="brown", linetype="dashed", linewidth=1)

From the histogram, we can tell it is positively skewed, with an outliers after the 450 mark.

We can also use boxplot to detect outliers

ggplot(data=ict_derived, 
       aes(x=`RADIO_PR`)) +
  geom_boxplot(color="black", 
               fill="light blue")

From the box plot, we can infer that there are 1 outlier, we can find the outlier and display it using kable() below from the code below

ict_derived_outliers_radio = ict_derived %>%
  filter(RADIO_PR > 450)
ict_derived_outliers_radio %>% select ('DT_PCODE', 'DT', 'TS_PCODE', 'TS', 'TT_HOUSEHOLDS', 'RADIO_PR') %>%
  kable()
DT_PCODE DT TS_PCODE TS TT_HOUSEHOLDS RADIO_PR
MMR014D001 Taunggyi MMR014007 Ywangan 18348 484.5215

In the figure below, multiple histograms are plotted to reveal the distribution of the selected variables in the ict_derived data.frame. First, We do this by creating all the histograms assigned to individual variables.

radio = ggplot(data=ict_derived, 
             aes(x= `RADIO_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "Radio", y = "Frequency") +
  geom_vline(aes(xintercept = mean(ict_derived$RADIO_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(ict_derived$RADIO_PR)),   
               color="brown", linetype="dashed", linewidth=1)


tv = ggplot(data=ict_derived, 
             aes(x= `TV_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "TV", y = "Frequency") +
  geom_vline(aes(xintercept = mean(ict_derived$TV_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(ict_derived$TV_PR)),   
               color="brown", linetype="dashed", linewidth=1)

llphone = ggplot(data=ict_derived, 
             aes(x= `LLPHONE_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  labs(x = "Land line phone", y = "Frequency") +
  geom_vline(aes(xintercept = mean(ict_derived$LLPHONE_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(ict_derived$LLPHONE_PR)),   
               color="brown", linetype="dashed", linewidth=1)

mphone = ggplot(data=ict_derived, 
             aes(x= `MPHONE_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")  +
  labs(x = "Mobile phone", y = "Frequency") +
  geom_vline(aes(xintercept = mean(ict_derived$MPHONE_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(ict_derived$MPHONE_PR)),   
               color="brown", linetype="dashed", linewidth=1)

computer = ggplot(data=ict_derived, 
             aes(x= `COMPUTER_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")  +
  labs(x = "Computer", y = "Frequency") +
  geom_vline(aes(xintercept = mean(ict_derived$COMPUTER_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(ict_derived$COMPUTER_PR)),   
               color="brown", linetype="dashed", linewidth=1)

internet = ggplot(data=ict_derived, 
             aes(x= `INTERNET_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")  +
  labs(x = "internet", y = "Frequency") +
  geom_vline(aes(xintercept = mean(ict_derived$INTERNET_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  geom_vline(aes(xintercept=median(ict_derived$INTERNET_PR)),   
               color="brown", linetype="dashed", linewidth=1)

Next, ggarange() of ggpubr package is used to group these histograms together.

ggarrange(radio, tv, llphone, mphone, computer, internet, 
          ncol = 2, 
          nrow = 3)

From the chart, we can tell

  • Radio penetration rate is positively skewed

  • TV penetration rate is negatively skewed

  • Land line phone is penetration rate positively skewed

  • Mobile phone penetration rate is positively skewed

  • Computer penetration rate is positively skewed with a really long tail

  • Similarly, Internet penetration rate is positively skewed with a really long tail, the pattern of computer and internet follows the same pattern. It may be the case that people with computers will likely also have internet

EDA using choropleth map

Joining geospatial data with aspatial data

We must first integrate the geographical data object (shan_sf) and aspatial data (ict_derived) before we can create the choropleth map. object into a single frame.

To do this, the dplyr package’s left_join function will be used. We will use TS_PCode as the common variable to join the 2 tables

shan_sf = left_join(shan_sf, ict_derived, #geospatial file first
                     by=c("TS_PCODE"="TS_PCODE"))

A choropleth map will be created so we can quickly see how the radio penetration rate is distributed across Shan State at the township level.

The choropleth is prepared by utilizing the functions of the tmap package

ttm()
tm_shape(shan_sf) +
          tm_fill(col = "RADIO_PR", 
          style = "pretty",
          palette="PuRd",
          title = "RADIO_PR") +
  tm_borders(alpha = 0.5)

By creating two choropleth maps—one for the total number of households (i.e. TT HOUSEHOLDS.map) and one for the total number of households with radios—we can show that the distribution depicted in the choropleth map above is biased to the underlying total number of households at the townships (RADIO.map) with functions of the tmap package. The jenks style is used as it locates clusters of related values and emphasizes the distinctions between categories.

TT_HOUSEHOLDS.map = tm_shape(shan_sf) + 
  tm_fill(col = "TT_HOUSEHOLDS",
          n = 5,
          style = "jenks", 
          title = "Total households") + 
  tm_borders(alpha = 0.5) 

RADIO.map = tm_shape(shan_sf) + 
  tm_fill(col = "RADIO",
          n = 5,
          style = "jenks",
          title = "Number Radio ") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(TT_HOUSEHOLDS.map, RADIO.map,
             asp=NA, ncol=2)

From the result, we can see from the choropleth maps above that townships with a higher proportion of households also have a higher proportion of radio owners, the summary statistics below shows that it the number is in fact in the 75th percentile

summary(ict_derived$RADIO)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    115    1260    2497    4487    6192   30176 

We will now plot the choropleth maps illustrating the distribution of the total number of households and the radio penetration rate.

RADIO_PR.map = tm_shape(shan_sf) + 
  tm_fill(col = "RADIO_PR",
          n = 5,
          style = "jenks",
          title = "Number Radio PR") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(TT_HOUSEHOLDS.map, RADIO_PR.map,
             asp=NA, ncol=2)
summary(ict_derived$RADIO_PR)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.05  138.95  210.95  215.68  268.07  484.52 

The penetration rate is 235.7 radios per 1000 which is only between the 50th and 75th percentile of the sample.

Correlation Analysis

It is crucial that we ensure the cluster variables are not highly correlated before we conduct cluster analysis.

We will discover how to see and analyze the correlation of the input variables using the corrplot.mixed() (ref) function of the corrplot package. However we need to find the correlation matrix first with cor() and only use the variables we are interested in, which are in column 12 to 17.

cluster_vars.cor = cor(ict_derived[,12:17]) #convert to correlation matrix [,cols]

corrplot.mixed(cluster_vars.cor, 
               lower = "ellipse", 
               upper = "number", 
               tl.pos = "lt", 
               diag="l", 
               tl.col="black")

The correlation graphic above demonstrates the strong correlation between COMPUTER_PR and INTERNET_PR. This suggests that only one of them, rather than both, should be included in the cluster analysis.

Hierarchy Cluster Analysis

There are 4 steps to hierarchical cluster analysis

  1. Using a specific distance metric, determine the proximity matrix.
  2. Each data point has a cluster allocated to it.
  3. Combine the clusters based on a metric for cluster similarity.
  4. Update the distance matrix

Using a specific distance metric, determine the proximity matrix.

Extracting clustering variables

First we need to extract the clustering variables from the shan_sf simple feature object into data.frame. We do not include the variable INTERNET_PR as it has a strong correlation with the variable COMPUTER_PR

cluster_vars = shan_sf %>%
          st_set_geometry(NULL) %>% #drop geometric column as we it is not one of our clustering variables
            select("TS.x", "RADIO_PR", "TV_PR", "LLPHONE_PR", "MPHONE_PR", "COMPUTER_PR")
head(cluster_vars, 10)
        TS.x RADIO_PR    TV_PR LLPHONE_PR MPHONE_PR COMPUTER_PR
1    Mongmit 286.1852 554.1313   35.30618  260.6944    12.15939
2    Pindaya 417.4647 505.1300   19.83584  162.3917    12.88190
3    Ywangan 484.5215 260.5734   11.93591  120.2856     4.41465
4   Pinlaung 231.6499 541.7189   28.54454  249.4903    13.76255
5     Mabein 449.4903 708.6423   72.75255  392.6089    16.45042
6      Kalaw 280.7624 611.6204   42.06478  408.7951    29.63160
7      Pekon 318.6118 535.8494   39.83270  214.8476    18.97032
8   Lawksawk 387.1017 630.0035   31.51366  320.5686    21.76677
9  Nawnghkio 349.3359 547.9456   38.44960  323.0201    15.76465
10   Kyaukme 210.9548 601.1773   39.58267  372.4930    30.94709

The following step is to replace row number with township name in the rows and delete the TS.x field by selecting only the required columns (2 to 6) by using rows.names

The columns names must only be our clustering variables

row.names(cluster_vars) = cluster_vars$"TS.x"
head(cluster_vars,10)
               TS.x RADIO_PR    TV_PR LLPHONE_PR MPHONE_PR COMPUTER_PR
Mongmit     Mongmit 286.1852 554.1313   35.30618  260.6944    12.15939
Pindaya     Pindaya 417.4647 505.1300   19.83584  162.3917    12.88190
Ywangan     Ywangan 484.5215 260.5734   11.93591  120.2856     4.41465
Pinlaung   Pinlaung 231.6499 541.7189   28.54454  249.4903    13.76255
Mabein       Mabein 449.4903 708.6423   72.75255  392.6089    16.45042
Kalaw         Kalaw 280.7624 611.6204   42.06478  408.7951    29.63160
Pekon         Pekon 318.6118 535.8494   39.83270  214.8476    18.97032
Lawksawk   Lawksawk 387.1017 630.0035   31.51366  320.5686    21.76677
Nawnghkio Nawnghkio 349.3359 547.9456   38.44960  323.0201    15.76465
Kyaukme     Kyaukme 210.9548 601.1773   39.58267  372.4930    30.94709
shan_ict = select(cluster_vars, c(2:6))
head(shan_ict, 10)
          RADIO_PR    TV_PR LLPHONE_PR MPHONE_PR COMPUTER_PR
Mongmit   286.1852 554.1313   35.30618  260.6944    12.15939
Pindaya   417.4647 505.1300   19.83584  162.3917    12.88190
Ywangan   484.5215 260.5734   11.93591  120.2856     4.41465
Pinlaung  231.6499 541.7189   28.54454  249.4903    13.76255
Mabein    449.4903 708.6423   72.75255  392.6089    16.45042
Kalaw     280.7624 611.6204   42.06478  408.7951    29.63160
Pekon     318.6118 535.8494   39.83270  214.8476    18.97032
Lawksawk  387.1017 630.0035   31.51366  320.5686    21.76677
Nawnghkio 349.3359 547.9456   38.44960  323.0201    15.76465
Kyaukme   210.9548 601.1773   39.58267  372.4930    30.94709

Data Standardization

In most cases, cluster analysis will make use of many variables. Their differing value ranges are not uncommon. It is helpful to standardize the input variables before performing cluster analysis in order to prevent the cluster analysis result from being based on clustering variables with bias values.

Min-Max standardization

The code below uses the heatmaply package’s normalize() function to standardize the clustering variables using the Min-Max approach. he summary() function is used to show the summary statistics for the standardized clustering variables.

shan_ict.std_minmax = normalize(shan_ict)
summary(shan_ict.std_minmax)
    RADIO_PR          TV_PR          LLPHONE_PR       MPHONE_PR     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.2544   1st Qu.:0.4600   1st Qu.:0.1123   1st Qu.:0.2199  
 Median :0.4097   Median :0.5523   Median :0.1948   Median :0.3846  
 Mean   :0.4199   Mean   :0.5416   Mean   :0.2703   Mean   :0.3972  
 3rd Qu.:0.5330   3rd Qu.:0.6750   3rd Qu.:0.3746   3rd Qu.:0.5608  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
  COMPUTER_PR     
 Min.   :0.00000  
 1st Qu.:0.09598  
 Median :0.17607  
 Mean   :0.23692  
 3rd Qu.:0.29868  
 Max.   :1.00000  

The values range of the Min-max standardized clustering variables are between 0 and 1 now.

Z-score standardization

The Base R function scale() (ref) makes standardizing Z-scores simple. The Z-score approach will be used to standardize the clustering variables below. We use the describe() function of the psych package here because we want to look at the standard deviation of the variable

shan_ict.std_z = scale(shan_ict)
describe(shan_ict.std_minmax)
            vars  n mean   sd median trimmed  mad min max range  skew kurtosis
RADIO_PR       1 55 0.42 0.23   0.41    0.41 0.21   0   1     1  0.48    -0.27
TV_PR          2 55 0.54 0.22   0.55    0.55 0.17   0   1     1 -0.38    -0.23
LLPHONE_PR     3 55 0.27 0.23   0.19    0.24 0.15   0   1     1  1.37     1.49
MPHONE_PR      4 55 0.40 0.25   0.38    0.38 0.25   0   1     1  0.48    -0.34
COMPUTER_PR    5 55 0.24 0.23   0.18    0.20 0.15   0   1     1  1.80     2.96
              se
RADIO_PR    0.03
TV_PR       0.03
LLPHONE_PR  0.03
MPHONE_PR   0.03
COMPUTER_PR 0.03

Note: Z-score standardization method should only be used if we would assume all variables come from some normal distribution.

Visualising the standardize clustering variables

It is a good idea to visualize the distribution graphical of the standardized clustering variables in addition to evaluating the summary statistics of those variables.

r = ggplot(data=ict_derived, aes(x=`RADIO_PR`)) + 
    geom_histogram(bins=20, color="black", fill="light blue") +
      
  labs(x = "Radio", y = "Frequency") +

  geom_vline(aes(xintercept = mean(ict_derived$RADIO_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  
  geom_vline(aes(xintercept=median(ict_derived$RADIO_PR)),   
               color="brown", linetype="dashed", linewidth=1)

shan_ict_s_df = as.data.frame(shan_ict.std_minmax)

s = ggplot(data=shan_ict_s_df, aes(x=`RADIO_PR`)) + 
    geom_histogram(bins=20, color="black", fill="light blue") +
      
  labs(x = "Radio", y = "Frequency") +

  geom_vline(aes(xintercept = mean(shan_ict_s_df$RADIO_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  
  geom_vline(aes(xintercept=median(shan_ict_s_df$RADIO_PR)),   
               color="brown", linetype="dashed", linewidth=1) + ggtitle("Min-Max Standardization")


shan_ict_z_df = as.data.frame(shan_ict.std_z)

z = ggplot(data=shan_ict_z_df, aes(x=`RADIO_PR`)) + 
    geom_histogram(bins=20, color="black", fill="light blue") +
      
  labs(x = "Radio", y = "Frequency") +

  geom_vline(aes(xintercept = mean(shan_ict_z_df$RADIO_PR)),   
               color="red", linetype="dashed", linewidth=1) +
  
  geom_vline(aes(xintercept=median(shan_ict_z_df$RADIO_PR)),   
               color="brown", linetype="dashed", linewidth=1) + ggtitle("Z-score Standardization")

ggarrange(r, s, z,
          ncol = 2,
          nrow = 2)

Keep in mind that following data standardization, the clustering variables’ general distribution will change. Therefore, it is advised against performing data standardization if the clustering variables’ range of values is not particularly wide.

Determine the proximity matrix.

Numerous packages in R offer routines to compute distance matrices. With R’s dist() function, we shall compute the proximity matrix.

The six distance proximity calculations that are supported by dist() are the euclidean, maximum, manhattan, canberra, binary, and minkowski methods. Euclidean proximity matrix is the default.

proxmat = dist(shan_ict, method="euclidean")
proxmat 
             Mongmit   Pindaya   Ywangan  Pinlaung    Mabein     Kalaw
Pindaya    171.86828                                                  
Ywangan    381.88259 257.31610                                        
Pinlaung    57.46286 208.63519 400.05492                              
Mabein     263.37099 313.45776 529.14689 312.66966                    
Kalaw      160.05997 302.51785 499.53297 181.96406 198.14085          
Pekon       59.61977 117.91580 336.50410  94.61225 282.26877 211.91531
Lawksawk   140.11550 204.32952 432.16535 192.57320 130.36525 140.01101
Nawnghkio   89.07103 180.64047 377.87702 139.27495 204.63154 127.74787
Kyaukme    144.02475 311.01487 505.89191 139.67966 264.88283  79.42225
Muse       563.01629 704.11252 899.44137 571.58335 453.27410 412.46033
Laihka     141.87227 298.61288 491.83321 101.10150 345.00222 197.34633
Mongnai    115.86190 258.49346 422.71934  64.52387 358.86053 200.34668
Mawkmai    434.92968 437.99577 397.03752 398.11227 693.24602 562.59200
Kutkai      97.61092 212.81775 360.11861  78.07733 340.55064 204.93018
Mongton    192.67961 283.35574 361.23257 163.42143 425.16902 267.87522
Mongyai    256.72744 287.41816 333.12853 220.56339 516.40426 386.74701
Mongkaing  503.61965 481.71125 364.98429 476.29056 747.17454 625.24500
Lashio     251.29457 398.98167 602.17475 262.51735 231.28227 106.69059
Mongpan    193.32063 335.72896 483.68125 192.78316 301.52942 114.69105
Matman     401.25041 354.39039 255.22031 382.40610 637.53975 537.63884
Tachileik  529.63213 635.51774 807.44220 555.01039 365.32538 373.64459
Narphan    406.15714 474.50209 452.95769 371.26895 630.34312 463.53759
Mongkhet   349.45980 391.74783 408.97731 305.86058 610.30557 465.52013
Hsipaw     118.18050 245.98884 388.63147  76.55260 366.42787 212.36711
Monghsat   214.20854 314.71506 432.98028 160.44703 470.48135 317.96188
Mongmao    242.54541 402.21719 542.85957 217.58854 384.91867 195.18913
Nansang    104.91839 275.44246 472.77637  85.49572 287.92364 124.30500
Laukkaing  568.27732 726.85355 908.82520 563.81750 520.67373 427.77791
Pangsang   272.67383 428.24958 556.82263 244.47146 418.54016 224.03998
Namtu      179.62251 225.40822 444.66868 170.04533 366.16094 307.27427
Monghpyak  177.76325 221.30579 367.44835 222.20020 212.69450 167.08436
Konkyan    403.39082 500.86933 528.12533 365.44693 613.51206 444.75859
Mongping   265.12574 310.64850 337.94020 229.75261 518.16310 375.64739
Hopong     136.93111 223.06050 352.85844  98.14855 398.00917 264.16294
Nyaungshwe  99.38590 216.52463 407.11649 138.12050 210.21337  95.66782
Hsihseng   131.49728 172.00796 342.91035 111.61846 381.20187 287.11074
Mongla     384.30076 549.42389 728.16301 372.59678 406.09124 260.26411
Hseni      189.37188 337.98982 534.44679 204.47572 213.61240  38.52842
Kunlong    224.12169 355.47066 531.63089 194.76257 396.61508 273.01375
Hopang     281.05362 443.26362 596.19312 265.96924 368.55167 185.14704
Namhkan    386.02794 543.81859 714.43173 382.78835 379.56035 246.39577
Kengtung   246.45691 385.68322 573.23173 263.48638 219.47071  88.29335
Langkho    164.26299 323.28133 507.78892 168.44228 253.84371  67.19580
Monghsu    109.15790 198.35391 340.42789  80.86834 367.19820 237.34578
Taunggyi   399.84278 503.75471 697.98323 429.54386 226.24011 252.26066
Pangwaun   381.51246 512.13162 580.13146 356.37963 523.44632 338.35194
Kyethi     202.92551 175.54012 287.29358 189.47065 442.07679 360.17247
Loilen     145.48666 293.61143 469.51621  91.56527 375.06406 217.19877
Manton     430.64070 402.42888 306.16379 405.83081 674.01120 560.16577
Mongyang   309.51302 475.93982 630.71590 286.03834 411.88352 233.56349
Kunhing    173.50424 318.23811 449.67218 141.58836 375.82140 197.63683
Mongyawng  214.21738 332.92193 570.56521 235.55497 193.49994 173.43078
Tangyan    195.92520 208.43740 324.77002 169.50567 448.59948 348.06617
Namhsan    237.78494 228.41073 286.16305 214.33352 488.33873 385.88676
               Pekon  Lawksawk Nawnghkio   Kyaukme      Muse    Laihka
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk   157.51129                                                  
Nawnghkio  113.15370  90.82891                                        
Kyaukme    202.12206 186.29066 157.04230                              
Muse       614.56144 510.13288 533.68806 434.75768                    
Laihka     182.23667 246.74469 211.88187 128.24979 526.65211          
Mongnai    151.60031 241.71260 182.21245 142.45669 571.97975 100.53457
Mawkmai    416.00669 567.52693 495.15047 512.02846 926.93007 429.96554
Kutkai     114.98048 224.64646 147.44053 170.93318 592.90743 144.67198
Mongton    208.14888 311.07742 225.81118 229.28509 634.71074 212.07320
Mongyai    242.52301 391.26989 319.57938 339.27780 763.91399 264.13364
Mongkaing  480.23965 625.18712 546.69447 586.05094 995.66496 522.96309
Lashio     303.80011 220.75270 230.55346 129.95255 313.15288 238.64533
Mongpan    243.30037 228.54223 172.84425 110.37831 447.49969 210.76951
Matman     368.25761 515.39711 444.05061 505.52285 929.11283 443.25453
Tachileik  573.39528 441.82621 470.45533 429.15493 221.19950 549.08985
Narphan    416.84901 523.69580 435.59661 420.30003 770.40234 392.32592
Mongkhet   342.08722 487.41102 414.10280 409.03553 816.44931 324.97428
Hsipaw     145.37542 249.35081 176.09570 163.95741 591.03355 128.42987
Monghsat   225.64279 352.31496 289.83220 253.25370 663.76026 158.93517
Mongmao    293.70625 314.64777 257.76465 146.09228 451.82530 185.99082
Nansang    160.37607 188.78869 151.13185  60.32773 489.35308  78.78999
Laukkaing  624.82399 548.83928 552.65554 428.74978 149.26996 507.39700
Pangsang   321.81214 345.91486 287.10769 175.35273 460.24292 214.19291
Namtu      165.02707 260.95300 257.52713 270.87277 659.16927 185.86794
Monghpyak  190.93173 142.31691  93.03711 217.64419 539.43485 293.22640
Konkyan    421.48797 520.31264 439.34272 393.79911 704.86973 351.75354
Mongping   259.68288 396.47081 316.14719 330.28984 744.44948 272.82761
Hopong     138.86577 274.91604 204.88286 218.84211 648.68011 157.48857
Nyaungshwe 139.31874 104.17830  43.26545 126.50414 505.88581 201.71653
Hsihseng   105.30573 257.11202 209.88026 250.27059 677.66886 175.89761
Mongla     441.20998 393.18472 381.40808 241.58966 256.80556 315.93218
Hseni      243.98001 171.50398 164.05304  81.20593 381.30567 204.49010
Kunlong    249.36301 318.30406 285.04608 215.63037 547.24297 122.68682
Hopang     336.38582 321.16462 279.84188 154.91633 377.44407 230.78652
Namhkan    442.77120 379.41126 367.33575 247.81990 238.67060 342.43665
Kengtung   297.67761 209.38215 208.29647 136.23356 330.08211 258.23950
Langkho    219.21623 190.30257 156.51662  51.67279 413.64173 160.94435
Monghsu    113.84636 242.04063 170.09168 200.77712 633.21624 163.28926
Taunggyi   440.66133 304.96838 344.79200 312.60547 250.81471 425.36916
Pangwaun   423.81347 453.02765 381.67478 308.31407 541.97887 351.78203
Kyethi     162.43575 317.74604 267.21607 328.14177 757.16745 255.83275
Loilen     181.94596 265.29318 219.26405 146.92675 560.43400  59.69478
Manton     403.82131 551.13000 475.77296 522.86003 941.49778 458.30232
Mongyang   363.58788 363.37684 323.32123 188.59489 389.59919 229.71502
Kunhing    213.46379 278.68953 206.15773 145.00266 533.00162 142.03682
Mongyawng  248.43910 179.07229 220.61209 181.55295 422.37358 211.99976
Tangyan    167.79937 323.14701 269.07880 306.78359 736.93741 224.29176
Namhsan    207.16559 362.84062 299.74967 347.85944 778.52971 273.79672
             Mongnai   Mawkmai    Kutkai   Mongton   Mongyai Mongkaing
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk                                                              
Nawnghkio                                                             
Kyaukme                                                               
Muse                                                                  
Laihka                                                                
Mongnai                                                               
Mawkmai    374.50873                                                  
Kutkai      91.15307 364.95519                                        
Mongton    131.67061 313.35220 107.06341                              
Mongyai    203.23607 178.70499 188.94166 159.79790                    
Mongkaing  456.00842 133.29995 428.96133 365.50032 262.84016          
Lashio     270.86983 638.60773 289.82513 347.11584 466.36472 708.65819
Mongpan    178.09554 509.99632 185.18173 200.31803 346.39710 563.56780
Matman     376.33870 147.83545 340.86349 303.04574 186.95158 135.51424
Tachileik  563.95232 919.38755 568.99109 608.76740 750.29555 967.14087
Narphan    329.31700 273.75350 314.27683 215.97925 248.82845 285.65085
Mongkhet   275.76855 115.58388 273.91673 223.22828 104.98924 222.60577
Hsipaw      52.68195 351.34601  51.46282  90.69766 177.33790 423.77868
Monghsat   125.25968 275.09705 154.32012 150.98053 127.35225 375.60376
Mongmao    188.29603 485.52853 204.69232 206.57001 335.61300 552.31959
Nansang     92.79567 462.41938 130.04549 199.58124 288.55962 542.16609
Laukkaing  551.56800 882.51110 580.38112 604.66190 732.68347 954.11795
Pangsang   204.25746 484.14757 228.33583 210.77938 343.30638 548.40662
Namtu      209.35473 427.95451 225.28268 308.71751 278.02761 525.04057
Monghpyak  253.26470 536.71695 206.61627 258.04282 370.01575 568.21089
Konkyan    328.82831 339.01411 310.60810 248.25265 287.87384 380.92091
Mongping   202.99615 194.31049 182.75266 119.86993  65.38727 257.18572
Hopong      91.53795 302.84362  73.45899 106.21031 124.62791 379.37916
Nyaungshwe 169.63695 502.99026 152.15482 219.72196 327.13541 557.32112
Hsihseng   142.36728 329.29477 128.21054 194.64317 162.27126 411.59788
Mongla     354.10985 686.88950 388.40984 411.06668 535.28615 761.48327
Hseni      216.81639 582.53670 229.37894 286.75945 408.23212 648.04408
Kunlong    202.92529 446.53763 204.54010 270.02165 299.36066 539.91284
Hopang     243.00945 561.24281 263.31986 273.50305 408.73288 626.17673
Namhkan    370.05669 706.47792 392.48568 414.53594 550.62819 771.39688
Kengtung   272.28711 632.54638 279.19573 329.38387 460.39706 692.74693
Langkho    174.67678 531.08019 180.51419 236.70878 358.95672 597.42714
Monghsu     84.11238 332.07962  62.60859 107.04894 154.86049 400.71816
Taunggyi   448.55282 810.74692 450.33382 508.40925 635.94105 866.21117
Pangwaun   312.13429 500.68857 321.80465 257.50434 394.07696 536.95736
Kyethi     210.50453 278.85535 184.23422 222.52947 137.79420 352.06533
Loilen      58.41263 388.73386 131.56529 176.16001 224.79239 482.18190
Manton     391.54062 109.08779 361.82684 310.20581 195.59882  81.75337
Mongyang   260.39387 558.83162 285.33223 295.60023 414.31237 631.91325
Kunhing    110.55197 398.43973 108.84990 114.03609 238.99570 465.03971
Mongyawng  275.77546 620.04321 281.03383 375.22688 445.78964 700.98284
Tangyan    180.37471 262.66006 166.61820 198.88460 109.08506 348.56123
Namhsan    218.10003 215.19289 191.32762 196.76188  77.35900 288.66231
              Lashio   Mongpan    Matman Tachileik   Narphan  Mongkhet
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk                                                              
Nawnghkio                                                             
Kyaukme                                                               
Muse                                                                  
Laihka                                                                
Mongnai                                                               
Mawkmai                                                               
Kutkai                                                                
Mongton                                                               
Mongyai                                                               
Mongkaing                                                             
Lashio                                                                
Mongpan    172.33279                                                  
Matman     628.11049 494.81014                                        
Tachileik  311.95286 411.03849 890.12935                              
Narphan    525.63854 371.13393 312.05193 760.29566                    
Mongkhet   534.44463 412.17123 203.02855 820.50164 217.28718          
Hsipaw     290.86435 179.52054 344.45451 576.18780 295.40170 253.80950
Monghsat   377.86793 283.30992 313.59911 677.09508 278.21548 167.98445
Mongmao    214.23677 131.59966 501.59903 472.95568 331.42618 375.35820
Nansang    184.47950 144.77393 458.06573 486.77266 398.13308 360.99219
Laukkaing  334.65738 435.58047 903.72094 325.06329 708.82887 769.06406
Pangsang   236.72516 140.23910 506.29940 481.31907 316.30314 375.58139
Namtu      365.88437 352.91394 416.65397 659.56458 494.36143 355.99713
Monghpyak  262.09281 187.85699 470.46845 444.04411 448.40651 462.63265
Konkyan    485.51312 365.87588 392.40306 730.92980 158.82353 254.24424
Mongping   454.52548 318.47482 201.65224 727.08969 188.64567 113.80917
Hopong     345.31042 239.43845 291.84351 632.45718 294.40441 212.99485
Nyaungshwe 201.58191 137.29734 460.91883 445.81335 427.94086 417.08639
Hsihseng   369.00833 295.87811 304.02806 658.87060 377.52977 256.70338
Mongla     179.95877 253.20001 708.17595 347.33155 531.46949 574.40292
Hseni       79.41836 120.66550 564.64051 354.90063 474.12297 481.88406
Kunlong    295.23103 288.03320 468.27436 595.70536 413.07823 341.68641
Hopang     170.63913 135.62913 573.55355 403.82035 397.85908 451.51070
Namhkan    173.27153 240.34131 715.42102 295.91660 536.85519 596.19944
Kengtung    59.85893 142.21554 613.01033 295.90429 505.40025 531.35998
Langkho    115.18145  94.98486 518.86151 402.33622 420.65204 428.08061
Monghsu    325.71557 216.25326 308.13805 605.02113 311.92379 247.73318
Taunggyi   195.14541 319.81385 778.45810 150.84117 684.20905 712.80752
Pangwaun   362.45608 232.52209 523.43600 540.60474 264.64997 407.02947
Kyethi     447.10266 358.89620 233.83079 728.87329 374.90376 233.25039
Loilen     268.92310 207.25000 406.56282 573.75476 354.79137 284.76895
Manton     646.66493 507.96808  59.52318 910.23039 280.26395 181.33894
Mongyang   209.33700 194.93467 585.61776 448.79027 401.39475 445.40621
Kunhing    255.10832 137.85278 403.66587 532.26397 281.62645 292.49814
Mongyawng  172.70139 275.15989 601.80824 432.10118 572.76394 522.91815
Tangyan    429.84475 340.39128 242.78233 719.84066 348.84991 201.49393
Namhsan    472.04024 364.77086 180.09747 754.03913 316.54695 170.90848
              Hsipaw  Monghsat   Mongmao   Nansang Laukkaing  Pangsang
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk                                                              
Nawnghkio                                                             
Kyaukme                                                               
Muse                                                                  
Laihka                                                                
Mongnai                                                               
Mawkmai                                                               
Kutkai                                                                
Mongton                                                               
Mongyai                                                               
Mongkaing                                                             
Lashio                                                                
Mongpan                                                               
Matman                                                                
Tachileik                                                             
Narphan                                                               
Mongkhet                                                              
Hsipaw                                                                
Monghsat   121.78922                                                  
Mongmao    185.99483 247.17708                                        
Nansang    120.24428 201.92690 164.99494                              
Laukkaing  569.06099 626.44910 404.00848 480.60074                    
Pangsang   205.04337 256.37933  57.60801 193.36162 408.04016          
Namtu      229.44658 231.78673 365.03882 217.61884 664.06286 392.97391
Monghpyak  237.67919 356.84917 291.88846 227.52638 565.84279 315.11651
Konkyan    296.74316 268.25060 281.87425 374.70456 635.92043 274.81900
Mongping   168.92101 140.95392 305.57166 287.36626 708.13447 308.33123
Hopong      62.86179 100.45714 244.16253 167.66291 628.48557 261.51075
Nyaungshwe 169.92664 286.37238 230.45003 131.18943 520.24345 257.77823
Hsihseng   136.54610 153.49551 311.98001 193.53779 670.74564 335.52974
Mongla     373.47509 429.00536 216.24705 289.45119 202.55831 217.88123
Hseni      231.48538 331.22632 184.67099 136.45492 391.74585 214.66375
Kunlong    205.10051 202.31862 224.43391 183.01388 521.88657 258.49342
Hopang     248.72536 317.64824  78.29342 196.47091 331.67199  92.57672
Namhkan    382.79302 455.10875 223.32205 302.89487 196.46063 231.38484
Kengtung   284.08582 383.72138 207.58055 193.67980 351.48520 229.85484
Langkho    183.05109 279.52329 134.50170  99.39859 410.41270 167.65920
Monghsu     58.55724 137.24737 242.43599 153.59962 619.01766 260.52971
Taunggyi   462.31183 562.88102 387.33906 365.04897 345.98041 405.59730
Pangwaun   298.12447 343.53898 187.40057 326.12960 470.63605 157.48757
Kyethi     195.17677 190.50609 377.89657 273.02385 749.99415 396.89963
Loilen      98.04789 118.65144 190.26490  94.23028 535.57527 207.94433
Manton     359.60008 317.15603 503.79786 476.55544 907.38406 504.75214
Mongyang   267.10497 312.64797  91.06281 218.49285 326.19219 108.37735
Kunhing     90.77517 165.38834 103.91040 128.20940 500.41640 123.18870
Mongyawng  294.70967 364.40429 296.40789 191.11990 454.80044 336.16703
Tangyan    167.69794 144.59626 347.14183 249.70235 722.40954 364.76893
Namhsan    194.47928 169.56962 371.71448 294.16284 760.45960 385.65526
               Namtu Monghpyak   Konkyan  Mongping    Hopong Nyaungshwe
Pindaya                                                                
Ywangan                                                                
Pinlaung                                                               
Mabein                                                                 
Kalaw                                                                  
Pekon                                                                  
Lawksawk                                                               
Nawnghkio                                                              
Kyaukme                                                                
Muse                                                                   
Laihka                                                                 
Mongnai                                                                
Mawkmai                                                                
Kutkai                                                                 
Mongton                                                                
Mongyai                                                                
Mongkaing                                                              
Lashio                                                                 
Mongpan                                                                
Matman                                                                 
Tachileik                                                              
Narphan                                                                
Mongkhet                                                               
Hsipaw                                                                 
Monghsat                                                               
Mongmao                                                                
Nansang                                                                
Laukkaing                                                              
Pangsang                                                               
Namtu                                                                  
Monghpyak  346.57799                                                   
Konkyan    478.37690 463.39594                                         
Mongping   321.66441 354.76537 242.02901                               
Hopong     206.82668 267.95563 304.49287 134.00139                     
Nyaungshwe 271.41464 103.97300 432.35040 319.32583 209.32532           
Hsihseng   131.89940 285.37627 383.49700 199.64389  91.65458  225.80242
Mongla     483.49434 408.03397 468.09747 512.61580 432.31105  347.60273
Hseni      327.41448 200.26876 448.84563 395.58453 286.41193  130.86310
Kunlong    233.60474 357.44661 329.11433 309.05385 219.06817  285.13095
Hopang     408.24516 304.26577 348.18522 379.27212 309.77356  247.19891
Namhkan    506.32466 379.50202 481.59596 523.74815 444.13246  333.32428
Kengtung   385.33554 221.47613 474.82621 442.80821 340.47382  177.75714
Langkho    305.03473 200.27496 386.95022 343.96455 239.63685  128.26577
Monghsu    209.64684 232.17823 331.72187 158.90478  43.40665  173.82799
Taunggyi   518.72748 334.17439 650.56905 621.53039 513.76415  325.09619
Pangwaun   517.03554 381.95144 263.97576 340.37881 346.00673  352.92324
Kyethi     186.90932 328.16234 400.10989 187.43974 136.49038  288.06872
Loilen     194.24075 296.99681 334.19820 231.99959 124.74445  206.40432
Manton     448.58230 502.20840 366.66876 200.48082 310.58885  488.79874
Mongyang   413.26052 358.17599 329.39338 387.80686 323.35704  294.29500
Kunhing    296.43996 250.74435 253.74202 212.59619 145.15617  189.97131
Mongyawng  262.24331 285.56475 522.38580 455.59190 326.59925  218.12104
Tangyan    178.69483 335.26416 367.46064 161.67411 106.82328  284.14692
Namhsan    240.95555 352.70492 352.20115 130.23777 132.70541  315.91750
            Hsihseng    Mongla     Hseni   Kunlong    Hopang   Namhkan
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk                                                              
Nawnghkio                                                             
Kyaukme                                                               
Muse                                                                  
Laihka                                                                
Mongnai                                                               
Mawkmai                                                               
Kutkai                                                                
Mongton                                                               
Mongyai                                                               
Mongkaing                                                             
Lashio                                                                
Mongpan                                                               
Matman                                                                
Tachileik                                                             
Narphan                                                               
Mongkhet                                                              
Hsipaw                                                                
Monghsat                                                              
Mongmao                                                               
Nansang                                                               
Laukkaing                                                             
Pangsang                                                              
Namtu                                                                 
Monghpyak                                                             
Konkyan                                                               
Mongping                                                              
Hopong                                                                
Nyaungshwe                                                            
Hsihseng                                                              
Mongla     478.66210                                                  
Hseni      312.74375 226.82048                                        
Kunlong    231.85967 346.46200 276.19175                              
Hopang     370.01334 147.02444 162.80878 271.34451                    
Namhkan    492.09476  77.21355 212.11323 375.73885 146.18632          
Kengtung   370.72441 202.45004  66.12817 317.14187 164.29921 175.63015
Langkho    276.27441 229.01675  66.66133 224.52741 134.24847 224.40029
Monghsu     97.82470 424.51868 262.28462 239.89665 301.84458 431.32637
Taunggyi   528.14240 297.09863 238.19389 471.29032 329.95252 257.29147
Pangwaun   433.06326 319.18643 330.70182 392.45403 206.98364 310.44067
Kyethi      84.04049 556.02500 388.33498 298.55859 440.48114 567.86202
Loilen     158.84853 338.67408 227.10984 166.53599 242.89326 364.90647
Manton     334.87758 712.51416 584.63341 479.76855 577.52046 721.86149
Mongyang   382.59743 146.66661 210.19929 247.22785  69.25859 167.72448
Kunhing    220.15490 306.47566 206.47448 193.77551 172.96164 314.92119
Mongyawng  309.51462 315.57550 173.86004 240.39800 290.51360 321.21112
Tangyan     70.27241 526.80849 373.07575 268.07983 412.22167 542.64078
Namhsan    125.74240 564.02740 411.96125 310.40560 440.51555 576.42717
            Kengtung   Langkho   Monghsu  Taunggyi  Pangwaun    Kyethi
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk                                                              
Nawnghkio                                                             
Kyaukme                                                               
Muse                                                                  
Laihka                                                                
Mongnai                                                               
Mawkmai                                                               
Kutkai                                                                
Mongton                                                               
Mongyai                                                               
Mongkaing                                                             
Lashio                                                                
Mongpan                                                               
Matman                                                                
Tachileik                                                             
Narphan                                                               
Mongkhet                                                              
Hsipaw                                                                
Monghsat                                                              
Mongmao                                                               
Nansang                                                               
Laukkaing                                                             
Pangsang                                                              
Namtu                                                                 
Monghpyak                                                             
Konkyan                                                               
Mongping                                                              
Hopong                                                                
Nyaungshwe                                                            
Hsihseng                                                              
Mongla                                                                
Hseni                                                                 
Kunlong                                                               
Hopang                                                                
Namhkan                                                               
Kengtung                                                              
Langkho    107.16213                                                  
Monghsu    316.91914 221.84918                                        
Taunggyi   186.28225 288.27478 486.91951                              
Pangwaun   337.48335 295.38434 343.38498 497.61245                    
Kyethi     444.26274 350.91512 146.61572 599.57407 476.62610          
Loilen     282.22935 184.10672 131.55208 455.91617 331.69981 232.32965
Manton     631.99123 535.95620 330.76503 803.08034 510.79265 272.03299
Mongyang   217.08047 175.35413 323.95988 374.58247 225.25026 453.86726
Kunhing    245.95083 146.38284 146.78891 429.98509 229.09986 278.95182
Mongyawng  203.87199 186.11584 312.85089 287.73864 475.33116 387.71518
Tangyan    429.95076 332.02048 127.42203 592.65262 447.05580  47.79331
Namhsan    466.20497 368.20978 153.22576 631.49232 448.58030  68.67929
              Loilen    Manton  Mongyang   Kunhing Mongyawng   Tangyan
Pindaya                                                               
Ywangan                                                               
Pinlaung                                                              
Mabein                                                                
Kalaw                                                                 
Pekon                                                                 
Lawksawk                                                              
Nawnghkio                                                             
Kyaukme                                                               
Muse                                                                  
Laihka                                                                
Mongnai                                                               
Mawkmai                                                               
Kutkai                                                                
Mongton                                                               
Mongyai                                                               
Mongkaing                                                             
Lashio                                                                
Mongpan                                                               
Matman                                                                
Tachileik                                                             
Narphan                                                               
Mongkhet                                                              
Hsipaw                                                                
Monghsat                                                              
Mongmao                                                               
Nansang                                                               
Laukkaing                                                             
Pangsang                                                              
Namtu                                                                 
Monghpyak                                                             
Konkyan                                                               
Mongping                                                              
Hopong                                                                
Nyaungshwe                                                            
Hsihseng                                                              
Mongla                                                                
Hseni                                                                 
Kunlong                                                               
Hopang                                                                
Namhkan                                                               
Kengtung                                                              
Langkho                                                               
Monghsu                                                               
Taunggyi                                                              
Pangwaun                                                              
Kyethi                                                                
Loilen                                                                
Manton     419.06087                                                  
Mongyang   246.76592 585.70558                                        
Kunhing    130.39336 410.49230 188.89405                              
Mongyawng  261.75211 629.43339 304.21734 295.35984                    
Tangyan    196.60826 271.82672 421.06366 249.74161 377.52279          
Namhsan    242.15271 210.48485 450.97869 270.79121 430.02019  63.67613

Computing hierarchical clustering

Numerous R packages include the hierarchical clustering function. The R stats function hclust() will be used in this practical exercise.

The cluster was computed using the agglomeration approach by hclust(). There are 8 clustering methods that can be used: ward.D, ward. D2, single, complete, mcquitty (WPGMA), centroid (WPGMC), and average (UPGMA) (UPGMC).

The code below uses the ward.D method to do a hierarchical cluster analysis. An object of class hclust, which describes the tree generated by the clustering process, is where the hierarchical clustering output is stored. We can then plot the tree using plot() of R graphics

hclust_ward_d = hclust(proxmat, method="ward.D")
plot(hclust_ward_d, cex=0.6) #scale down plot to 0.6x in order to see township name

Selecting the optimal clustering algorithm

Finding stronger clustering structures is a challenge when performing hierarchical clustering. Using the agnes() function of the cluster package will address the issue.

It performs similar operations to hclus(), but agnes() also provides the agglomerative coefficient, which gauges the degree of clustering structure present

values closer to 1 suggest strong clustering structure

All hierarchical clustering algorithms’ agglomerative coefficients will be calculated using the code below.

m = c("average", "single", "complete", "ward")
names(m) = c("average", "single", "complete", "ward")

ac = function(y) {
  agnes(shan_ict, method=y)$ac
}

map_dbl(m,ac)
  average    single  complete      ward 
0.8131144 0.6628705 0.8950702 0.9427730 

According to the results shown above, Ward’s approach offers the greatest clustering structure out of the four examined methods. Consequently, only Ward’s technique will be applied in the analysis that follows.

Determining Optimal Clusters

The choice of the best clusters to keep is a technical problem for data analysts when undertaking clustering analysis.

To identify the ideal clusters, there are 3 widely utilized techniques:

The gap statistic contrasts the overall intra-cluster variation for various values of k with the values that would be predicted under a null reference distribution for the data. The value that maximizes the gap statistic will be used to estimate the best clusters (i.e., that yields the largest gap statistic). In other words, the clustering structure is very different from a randomly distributed, uniform distribution of points.

To compute the gap statistic, clusGap() of cluster package will be used

set.seed(12345)
gap_stat = clusGap(shan_ict, FUN=hcut, nstart=25, K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = shan_ict, FUNcluster = hcut, K.max = 10, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 1
          logW   E.logW       gap     SE.sim
 [1,] 8.407129 8.680794 0.2736651 0.04460994
 [2,] 8.130029 8.350712 0.2206824 0.03880130
 [3,] 7.992265 8.202550 0.2102844 0.03362652
 [4,] 7.862224 8.080655 0.2184311 0.03784781
 [5,] 7.756461 7.978022 0.2215615 0.03897071
 [6,] 7.665594 7.887777 0.2221833 0.03973087
 [7,] 7.590919 7.806333 0.2154145 0.04054939
 [8,] 7.526680 7.731619 0.2049390 0.04198644
 [9,] 7.458024 7.660795 0.2027705 0.04421874
[10,] 7.377412 7.593858 0.2164465 0.04540947

Also note that the hcut function used is from factoextra package.

Next, we can visualize the plot by using fviz_gap_stat() of factoextra package.

fviz_gap_stat(gap_stat)

According to the gap statistic graph above, keeping 1 cluster is the optimal quantity. However, keeping only one cluster is illogical. The 6-cluster, which is the largest gap statistic according to the gap statistic graph, should be the next-best cluster to choose.

In addition to these widely-used methods, the NbClust package, published by Charrad et al. in 2014, offers 30 indices for figuring out the appropriate number of clusters and suggests to users the best clustering scheme based on the various outcomes obtained by varying different combinations of the number of clusters, distance measures, and clustering methods.

Interpreting the dendrograms

Each leaf on the dendrogram shown above represents a single observation. As we climb the tree, comparable observations join together to form branches, which are then fused at a higher level.

The vertical axis’s display of the height of the fusion shows how similar or unlike two observations are.

Less similarity exists between the observations as the height of the fusion increases. Be aware that only the height at which the branches comprising the two observations are initially fused can be utilized to determine how close two observations are to one another.

Two observations cannot be compared for resemblance based on how close they are to one another along the horizontal axis.

Using R stats’ rect.hclust() function, the dendrogram can alternatively be shown with a border around the chosen clusters. The rectangles’ borders can be colored using the option border.

plot(hclust_ward_d, cex=0.6)
rect.hclust(hclust_ward_d, k = 6, border = 2:5)

Visually-driven hierarchical clustering analysis

In this section, we will learn how to perform visually-driven hiearchical clustering analysis by using heatmaply package. With heatmaply, we are able to build both highly interactive cluster heatmap or static cluster heatmap.

Transforming the data frame into a matrix

Although the data was imported into a data frame, a data matrix is required to create a heatmap. The shan_ict data frame will be converted into a data matrix using the code below.

shan_ict_mat = data.matrix(shan_ict)

Plotting interactive cluster heatmap using heatmaply()

heatmaply(normalize(shan_ict_mat), 
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors=Blues,
          k_row = 6,
          margins= c(NA, 50, 50, NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main = "Segmentation of Shan State by ICT indicators",
          xlab = "ICT Indicators",
          ylab = "Township of Shan State"
          
          )

Mapping the clusters formed

Following a thorough analysis of the dendragram shown above, we chose to keep six groups. The code below will use R Base’s cutree() function to create a 6-cluster model.

groups = as.factor(cutree(hclust_ward_d, k=6))

Groups are the output. It is a list object.

The groups object needs to be added to the shan_sf simple feature object in order to visualize the clusters.

The following code snippet forms the join in 3 steps:

  1. The object representing the groups list will be transformed into a matrix;

  2. shan_sf is appended with the groups matrix using cbind() to create the simple feature object shan_sf cluster;

  3. The as.matrix.groups column is renamed to CLUSTER using the dplyr package’s rename() function.

shan_sf_cluster = cbind(shan_sf, as.matrix(groups)) %>%
  rename(`CLUSTER` = `as.matrix.groups.`)

Next we use the tmap functions to plot the cloropleth map showing the clusters

 tm_shape(shan_sf_cluster) + 
  tm_polygons("CLUSTER") +
  tm_borders(alpha = 0.5) 

The clusters are quite fractured, as shown by the choropleth map above. When non-spatial clustering algorithms like the hierarchical cluster analysis method are used, this is one of the main limitations.

Spatially Constrained Clustering - SKATER approach

We will discover how to use the skater() method of the spdep package to derive a geographically limited cluster in this section.

Converting into SpatialPolygonsDataFrame

We must first transform shan_sf into a spatial polygons data frame. Because only SP objects (SpatialPolygonDataFrame) are supported by the SKATER function, this is.

The code below turns shan_sf into a SpatialPolygonDataFrame named shan_sf by using the as_Spatial() function of the sf package.

shan_sp = as_Spatial(shan_sf)

Computing Neighbour List

The neighbours list from the polygon list will then be computed using the poly2nb() function of the spdep package.

shan.nb = poly2nb(shan_sp)
summary(shan.nb)
Neighbour list object:
Number of regions: 55 
Number of nonzero links: 264 
Percentage nonzero weights: 8.727273 
Average number of links: 4.8 
Link number distribution:

 2  3  4  5  6  7  8  9 
 5  9  7 21  4  3  5  1 
5 least connected regions:
3 5 7 9 47 with 2 links
1 most connected region:
8 with 9 links

With the help of the code below, we can plot the neighbors list on shan_sp

We plot this graph on top of the map now that we can also plot the community area boundaries. The bounds are given in the first plot command. 

The plot of the neighbor list object is then displayed, using coordinates to extract the polygon centroids from the original SpatialPolygonDataFrame (Shan state township boundaries). 

These serve as the nodes in the representation of the graph. In order to plot the network on top of the limits, we additionally specify add=TRUE and set the color to blue.

plot(shan_sp, border=grey(0.6))

plot(shan.nb, coordinates(shan_sp), col="red", add=TRUE)

Be aware that some of the areas will be trimmed if you we plot the network first and then the borders. This is so because the first plot’s attributes determine the plotting area. In this instance, we plot the border map first because it is larger than the graph.

Computing minimum spanning tree

Calculating edge costs

The cost of each edge is determined using nbcosts() from the spdep package. Its nodes are separated by this distance. This function uses a data.frame with observations vectors in each node to calculate the distance.

lcosts = nbcosts(shan.nb, shan_ict)

This calculates the pairwise dissimilarity between each observation’s values for the five variables and those for its neighboring observation (from the neighbour list). In essence, this is the idea of a generalized weight for a matrix of spatial weights.

Next, in a manner similar to how we calculated the inverse of distance weights, we will include these costs into a weights object. In other words, we specify the recently computed lcosts as the weights in order to transform the neighbour list into a list weights object.

The code below demonstrates how to accomplish this using the nb2listw() function of the spdep package. To ensure that the cost values are not row-standardized, note that we have specified the style as B to use binary weights.

shan.w = nb2listw(shan.nb, lcosts, style="B")
summary(shan.w)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 55 
Number of nonzero links: 264 
Percentage nonzero weights: 8.727273 
Average number of links: 4.8 
Link number distribution:

 2  3  4  5  6  7  8  9 
 5  9  7 21  4  3  5  1 
5 least connected regions:
3 5 7 9 47 with 2 links
1 most connected region:
8 with 9 links

Weights style: B 
Weights constants summary:
   n   nn       S0       S1        S2
B 55 3025 76267.65 58260785 522016004

Computing minimum spanning tree

The minimum spanning tree is computed by using mstree() of spdep package as shown in the code below. We can check its class and dimensions by using class() and dim()

shan.mst = mstree(shan.w)
class(shan.mst)
[1] "mst"    "matrix"
dim(shan.mst)
[1] 54  3

Note that the dimension is 54 and not 55. This is because the minimum spanning tree consists on n-1 edges (links) in order to traverse all nodes.

We can display the content of shan.mst by using head()

head(shan.mst)
     [,1] [,2]      [,3]
[1,]   31   25 229.44658
[2,]   25   10 163.95741
[3,]   10    1 144.02475
[4,]   10    9 157.04230
[5,]    9    8  90.82891
[6,]    8    6 140.01101

The MST plot method includes a mechanism to display the nodes’ observation numbers in addition to the edge. We once again plot these along with the township lines. We can see how the initial neighbor list is condensed to a single edge that passes through every node while linking each one.

plot(shan_sp, border=gray(0.6))
plot.mst(shan.mst, coordinates(shan_sp), col="blue", 
         cex.lab=0.7, cex.circles=0.05, add=TRUE)

Computing spatially constrained clusters using SKATER method

We can compute the spatially constrained cluster using skater() of the spdep package.

clust6 = skater(edge=shan.mst[,1:2], #1st 2 col of MST
                data = shan_ict, #data matrix
                method = "euclidean",
                ncuts = 5 #number of cuts
                )

Required inputs for the skater() function. 

  • Data matrix (to update the costs while units are being grouped),

  • the number of cuts

  • the first two columns of the MST matrix

Note: It is configured to be one less than the total number of clusters.

As a result, the value supplied is actually one less than the number of clusters, or the number of cuts in the graph

We can display the content of the result using str()

str(clust6)
List of 8
 $ groups      : num [1:55] 3 3 6 3 3 3 3 3 3 3 ...
 $ edges.groups:List of 6
  ..$ :List of 3
  .. ..$ node: num [1:22] 13 48 54 55 45 37 34 16 25 31 ...
  .. ..$ edge: num [1:21, 1:3] 48 55 54 37 34 16 45 31 13 13 ...
  .. ..$ ssw : num 3423
  ..$ :List of 3
  .. ..$ node: num [1:18] 47 27 53 38 42 15 41 51 43 32 ...
  .. ..$ edge: num [1:17, 1:3] 53 15 42 38 41 51 15 27 15 43 ...
  .. ..$ ssw : num 3759
  ..$ :List of 3
  .. ..$ node: num [1:11] 2 6 8 1 36 4 10 9 46 5 ...
  .. ..$ edge: num [1:10, 1:3] 6 1 8 36 4 6 8 10 10 9 ...
  .. ..$ ssw : num 1458
  ..$ :List of 3
  .. ..$ node: num [1:2] 44 20
  .. ..$ edge: num [1, 1:3] 44 20 95
  .. ..$ ssw : num 95
  ..$ :List of 3
  .. ..$ node: num 23
  .. ..$ edge: num[0 , 1:3] 
  .. ..$ ssw : num 0
  ..$ :List of 3
  .. ..$ node: num 3
  .. ..$ edge: num[0 , 1:3] 
  .. ..$ ssw : num 0
 $ not.prune   : NULL
 $ candidates  : int [1:6] 1 2 3 4 5 6
 $ ssto        : num 12613
 $ ssw         : num [1:6] 12613 10977 9962 9540 9123 ...
 $ crit        : num [1:2] 1 Inf
 $ vec.crit    : num [1:55] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "class")= chr "skater"

The groups vector, which contains the labels of the cluster to which each observation belongs, is the most interesting part of this list structure (as before, the label itself is arbitrary). 

The summary for each of the clusters in the edges.groups list is then provided. To show the impact of each cut on the overall criterion, sum of squares measurements are given as ssto for the total and ssw for each cut individually.

We can check the cluster assignment by using the code chunk below.

ccs6 = clust6$groups
ccs6
 [1] 3 3 6 3 3 3 3 3 3 3 2 1 1 1 2 1 1 1 2 4 1 2 5 1 1 1 2 1 2 2 1 2 2 1 1 3 1 2
[39] 2 2 2 2 2 4 1 3 2 1 1 1 2 1 2 1 1

Using the table command, we can determine how many observations are contained in each cluster. Additionally, we can observe that each vector in the lists found in edges.groups has this dimension. For instance, the first list has a node with a dimension of 22, which corresponds to the first cluster’s observation count.

table(ccs6)
ccs6
 1  2  3  4  5  6 
22 18 11  2  1  1 

Finally, we can also plot the pruned tree that shows the five clusters on top of the townshop area.

plot(shan_sp, border=gray(.5))
plot(clust6,
     coordinates(shan_sp),
     cex.lab=0.5,
     groups.colors=c("red","green","blue", "brown", "orange"),
     cex.circles=0.005, 
     add=TRUE
     )

Visualizing the clusters in choropleth map

The code below is used to plot the newly derived clusters by using the SKATER method

groups_mat = as.matrix(clust6$groups)
shan_sf_spatialcluster = cbind(shan_sf_cluster, as.factor(groups_mat)) %>% 
  rename(`SP_CLUSTER` = `as.factor.groups_mat.`)

tm_shape(shan_sf_spatialcluster) + 
  tm_fill("SP_CLUSTER") +
  tm_borders(alpha = 0.5) 

For easy comparison, it will be better to place both the hierarchical clustering and spatially constrained hierarchical clustering maps next to each other.

hclust.map.df = shan_sf_spatialcluster
shclust.map.df = shan_sf_spatialcluster

hclust.map = tm_shape(hclust.map.df) + 
  tm_fill("CLUSTER", palette = "Pastel1") +
  tm_borders(alpha = 0.5) 
  
shclust.map = tm_shape(shclust.map.df) + 
  tm_fill("SP_CLUSTER", palette = "Pastel1") +
  tm_borders(alpha = 0.5) 
  
tmap_arrange(hclust.map, shclust.map,
             asp=NA, ncol=2)

Spatially Constrained Clustering - ClustGeo Method

We will discover how to do both spatially constrained and non-spatially constrained hierarchical cluster analyses using functions available by the ClustGeo package

5.9.1 Ward-like hierarchical clustering: ClustGeo

Similar to the hclust() method discussed in the previous section, the ClustGeo package contains a function named hclustgeo() to carry out a typical Ward-like hierarchical clustering.

The sole input requirement for non-spatially constrained hierarchical clustering is a dissimilarity matrix, as can be seen in the code snippet below.

Note that the dissimilarity matrix must be an object of class dist, i.e. an object obtained with the function dist(), refer to section of Determine the proximity matrix.

nongeo_cluster = hclustgeo(proxmat)
plot(nongeo_cluster, cex=0.5)
rect.hclust(nongeo_cluster, k=6, border = 2:5)

Mapping the clusters formed

Similarly, by applying the techniques we discovered in Mapping the clusters formed, we may plot the clusters on a categorical area shaded map.

groups = as.factor(cutree(nongeo_cluster, k=6))

shan_sf_ngeo_cluster = cbind(shan_sf, as.matrix(groups)) %>%
  rename(`CLUSTER` = `as.matrix.groups.`)

default_ngeo_clustered_map = tm_shape(shan_sf_ngeo_cluster) + 
  tm_polygons("CLUSTER") +
  tm_borders(alpha = 0.5) 

default_ngeo_clustered_map

Spatially Constrained Hierarchical Clustering

Prior to starting the spatially constrained hierarchical clustering process, a spatial distance matrix will need to be derived by using st_distance() of sf package.

dist = st_distance(shan_sf, shan_sf)
distmat = as.dist(dist)

as.dist() is used to convert the data frame into a matrix.

Our goal is to as much as possible retain the attribute homogeneity as much as possible but also introduce spatial homogeneity. In order to balance this, we can alpha using the choicealpha() function.

cr = choicealpha(proxmat, distmat, range.alpha = seq(0,1, 0.05), K=6, graph=T)

As the data is highly skewed as we have explored in our EDA, we will use the QNorm graph based on normalized values.

We can observe that when alpha is raised from 0.1 to 0.2, the homogeneity of d1 (which is our spatial relationship (contiguity matrix), gain significantly from about 30% to 65%. And for d0 (which is our attribute), the homogeneity loss is slightly less than 20%.

At alpha = 0.3, the homogeneity of d1 continues to raise to about 75%, while d0 homogeneity loss is about 20%.

We will use alpha as 0.3 to build our cluster

clustG = hclustgeo(proxmat, distmat, alpha = 0.3)

We then use cutree() to derive the cluster object, and join it back to the shan_sf polygon feature data frame by using cbind()

groups = as.factor(cutree(clustG, k=6))
shan_sf_Gcluster  = cbind(shan_sf, as.matrix(groups)) %>%
  rename(`CLUSTER` = `as.matrix.groups.`)

We can now plot the map of the newly delineated spatially constraints clusters using functions of tmap

spatially_constraint_clustered_map = 
  tm_shape(shan_sf_Gcluster) + 
  tm_polygons("CLUSTER") +
  tm_borders(alpha = 0.5) 

spatially_constraint_clustered_map

We can now compare the maps by placing it side by side

tmap_arrange(default_ngeo_clustered_map,spatially_constraint_clustered_map)

Analysis

We can see that without considering spatial relationships, the clusters are not linked spatially. Take for example cluster 2 of default_ngeo_clustered_map, without spatial constraints, the regions forming the clusters are all over the place; One of the region is in the far west, another at the south and so on. This is sub optimal as there may be real world consideration that that necessitates the regionalization of geographies into distinct regions, such as urban planning.

In contrast, the clusters of the spatially_constraint_clustered_map, are mostly contiguous, except for a few regions that may be outliers causing such patterns.

Visual Interpretation of Clusters

We can reveal the distribution of a clustering variable (i.e RADIO_PR) by cluster with a box plot using the ggplot2 package

ggplot(data = shan_sf_ngeo_cluster, 
        aes(x=CLUSTER, y=RADIO_PR)) + 
        geom_boxplot()

The boxplot discloses The highest mean Radio Ownership Per Thousand Households is shown by Cluster 3. Then come Clusters 2, 1, 4, 6, and 5.

See dark line in the middle of box plot

5.10.2 Multivariate Visualization

Studies in the past have demonstrated how parallel coordinate plots can be used to reveal clustering variables effectively.

We can employ the ggparcoord() function of the GGally package to achieve this.

ggparcoord(data = shan_sf_ngeo_cluster,
           columns=c(25:30),
           scale = "globalminmax",
           alphaLines = 0.2,
           boxplot = T,
           title = "Multiple Parallel Coordinates Plots of ICT Variables by Cluster") +
          facet_grid(~ CLUSTER) + 
          theme(axis.text.x = element_text(angle = 30, size = 8 ))

The parallel coordinate plot above demonstrates that the largest percentage of TV and mobile phone ownership is seen among homes in townships in Cluster 4. In contrast, households in Cluster 5 typically own the lowest of all the five ICT.

It should be noted that the ggparcoor() function offers a number of ways to scale the clustering variables. As follows:

  • std: univariately, subtract mean and divide by standard deviation.

  • robust: univariately, subtract median and divide by median absolute deviation.

  • uniminmax: univariately, scale so the minimum of the variable is zero, and the maximum is one.

  • globalminmax: no scaling is done; the range of the graphs is defined by the global minimum and the global maximum.

  • center: use uniminmax to standardize vertical height, then center each variable at a value specified by the scaleSummary param.

  • centerObs: use uniminmax to standardize vertical height, then center each variable at the value of the observation specified by the centerObsID param

There isn’t a single best scaling technique. We should investigate them and decide which one best suits the analysis requirements.

Last but not least, in order to supplement the visual interpretation, we can likewise generate summary statistics like mean, median, sd, etc. The group_by() and summarise() dplyr functions are used in the code snippet below to calculate the mean values of the clustering variables.

We will use kable() of knitr package to list the values

shan_sf_ngeo_cluster %>%
  st_set_geometry(NULL)  %>% #remove spatial features
  group_by(CLUSTER) %>%
  summarise(mean_RADIO_PR = mean(RADIO_PR),
            median_RADIO_PR = median(RADIO_PR),
            sd_RADIO_PR = sd(RADIO_PR),
            
            mean_TV_PR = mean(TV_PR),
            median_TV_PR = median(TV_PR),
            sd_TV_PR = sd(TV_PR),
            
            
            mean_LLPHONE_PR = mean(LLPHONE_PR),
            median_LLPHONE_PR = median(LLPHONE_PR),
            sd_LLPHONE_PR = sd(LLPHONE_PR),
            
            
            mean_MPHONE_PR = mean(MPHONE_PR),
            median_MPHONE_PR = median(MPHONE_PR),
            sd_MPHONE_PR = sd(MPHONE_PR),
            
            mean_COMPUTER_PR = mean(COMPUTER_PR),
            median_COMPUTER_PR = median(COMPUTER_PR),
            sd_COMPUTER_PR = sd(COMPUTER_PR),
            
            )  %>%
            kable()
CLUSTER mean_RADIO_PR median_RADIO_PR sd_RADIO_PR mean_TV_PR median_TV_PR sd_TV_PR mean_LLPHONE_PR median_LLPHONE_PR sd_LLPHONE_PR mean_MPHONE_PR median_MPHONE_PR sd_MPHONE_PR mean_COMPUTER_PR median_COMPUTER_PR sd_COMPUTER_PR
1 220.7496 204.77685 73.35035 520.8871 525.8778 63.40016 44.21883 33.768563 41.059883 245.7741 256.8494 55.91283 20.474406 20.357389 6.802811
2 236.5797 227.34993 113.15974 401.5410 418.8207 80.52002 23.90341 22.842511 6.928281 134.1802 127.7073 39.13840 11.498389 10.757804 5.259506
3 300.0958 278.08501 76.42307 611.4077 611.6204 70.99142 52.19443 42.497972 27.661573 392.2518 392.6089 53.98977 28.997656 25.141930 14.404024
4 195.8290 155.43315 137.20670 743.7454 757.3437 69.94159 98.99691 89.388902 31.419089 650.5613 654.7088 79.48668 65.530210 71.883205 27.525848
5 124.0485 97.44231 77.58482 224.1195 219.4093 64.83833 37.98043 8.438819 59.701102 131.9753 54.7963 124.95266 6.682603 6.853583 2.657472
6 98.6192 97.44765 25.12016 499.1921 512.6855 80.69958 74.54971 81.746586 31.394566 468.1360 469.0187 37.05115 21.021828 12.847119 17.773606