**K-Means**

Based the on the data we have after imputation, I did the normalization in two different methods. We need to normalize data because different sample points are generated in slightly different environments, which may cause the bias in the value of the gene expressions. For instance, if two cells have a different amount of chemical indicator on them, the values we get from the chemical experiment can be different. This bias can be fixed by normalizing the gene expressions inside one sample/patient/observation.

Using standard normalization method. Among 8560 gene expressions of one data point, we can get the max and min. We set max(scaled) as 1 and max(scaled) as -1. Then we normalize all other data according to:

Then we have:

Distance measure we use in this part includes: dot product, Euclidean metric, cosine similarity. For each distance measure, we calculate and plot for K-Means in which k goes from 2 to 100:

- The average in-cluster similarity of the k clusters – call it S;
- The average between-cluster similarity across any pair of the k clusters – called it D;
- the ratio of S/D;

Dot Products:

Using Matlab, run k-means algorithm using k from 2 to 100 with dot product as the similarity measure. Compute S, D and S/D.

We can see that when k increases, S increases and D decreases. Dot product is a choice of similarity measure. It depends on the angle between the two vectors as well as the norms of them. Although we have already normalized data one sample by one sample, norms of genes are still different from each other, which may influence the comparison of dot `products. As we can see from the plot, S majorly goes up, and the several peaks between 2 and 100 may be caused by bigger norms. This also generates the peaks on the plot of S/D. S/D also increases while D decreases when k goes from 2 to 100. When we extract the norm of the two vectors from it, we have cosine similarity measure.

Cosine similarity:

Cosine similarity is also widely used in the k-means algorithm. This only measures the cosine value of the angle between the two vectors, so it is not biased by the norms of the two vectors. Using Matlab, run k-means algorithm using k from 2 to 100 with cosine similarity as the similarity measure. Compute S, D and S/D.

As can be seen from the plots, we don’t have extremely large peaks like we had in dot product plots anymore. We can see that when k increases from 2 to 100, S increases and D decreases mostly. S/D also increases progressively, although there are some peaks between k = 10 and k = 25.

Euclidean metric:

In this metric, we can see that bigger Euclidean distance indicates smaller similarity. When two vectors are the same, the Euclidean distance is 0, and the Euclidean similarity metric reaches the maximum value 1. Adding 1 to distance is to avoid “divided by 0” situation. If we plot S, D, and S/D using Euclidean similarity metric, we have the following plot.

When we increase k from 2 to 100, S goes up, and D goes down. S/D increases like it did when we use dot product and cosine similarity.

Discussion:

S is the average in-cluster similarity of the k clusters. It is computed as follows:

D is the average between-cluster similarity of the k clusters. It is computed as follows:

S would mostly go up when we increase k because there are fewer points (considering average number) in one cluster. Fewer points mean less variety in the collection. Although we are taking average pairwise similarity, smaller cluster area often indicates higher average pairwise similarity. Meanwhile, D goes down mostly when we increase k, because the more clusters we are going to build, the more average difference there will be among those centers. As can be seen from the plots, no matter which similarity measure we use, S/D value increases when k goes from 2 to 100. So I believe the separation quality of different clusters is enhanced when we change k from 2 to 100. However, this S/D can be increased when k keeps increases, and the “optimal” k will be found later.

*How to choose the best similarity measure for implementing k-means?** *

S/D is a valid performance measure, and there is also another performance metric: a within-cluster sum of squared error (SSE). The SSE is also calculated in the Euclidean distance “style”. This error can be minimized by using Euclidean distance/Euclidean metric as the similarity measure to implement k-means clustering algorithm, because of each iteration of k-means with such implementation, we label the cluster membership by measuring their Euclidean distance from the centroids from the last iteration. Compared to other two distance measures used above, dot product and cosine similarity, Euclidean distance/Euclidean metric are more unbiased. The dot product is biased by 2-norms of the vectors involved, so unless all data points have the same norms, we cannot use dot product as the similarity measure in implementing k-means algorithm. What is more, the cosine similarity extracts the bias of different norms, but it only measures the angle between two vectors. Then, so can see that (1,0) and (10000000,0) are more similar than (1,0) and (1,1). Also, data points (10000000,0), (100000,0), and (10,0) have the same similarity with (1,0), which is not effective. As a result, Euclidean distance, I believe, is the best choice for implementing k-means algorithm among these 3 similarity measures.

*How to choose the k for implementing k-means: Number of classes? The number of objects?*

Choosing a good k is essential to the algorithm. If we know how many classes there are, we can easily set k as the __number of classes__. But in this case, we don’t know how many classes are there. We can also pick k based on the number of points. We can use the square root of a half of the number of points, which in this case is 65. This is a rough method to make clustering results look good visually, and a more precise method is indicated as following—-Elbow method:

As can be seen from the plot for all similarity measure above, the S/D kept rising when k increases from 2 to 100. Hence, it is rational to believe that S/D will keep increasing by rising k beyond 100. In order to find an “optimal” k for k-means in this case. I ran another test in which k goes from 2 to 2000, and recorded the S/D values and SSE values. By doing this I gained a more extensive view of how S/D and SSE change when we raise k.

As can be seen from the plot, we can see that the SSE quickly drops exponentially when we increase k from 2 to 80, but then it decreases slowly and almost linearly, which is also called by “elbow effect”. The elbow usually represents where we start to have diminishing returns by increasing k. That almost horizontal line in the closer look plot is the same trending line presented in the first SSE plot. 30 is a crucial point from which the decreasing rate changes significantly. This indicates that we may want to choose k from 30 to 60. We want k to be small but we want higher performance. Now we use S/D plots to further validate the idea.

According to the plot of S/D, we can see k=35 is also approaching the point when S/D stops increasing quickly. The trending line helps to locate the turning point. This is a good value for k using __“elbow method”__. When performance stops raising quickly, our goal is to choose a small value of k that still has a low SSE/high S/D separation.

**PCA(Using genes as features and patients as instances)**

First, use Matlab to run PCA on AD data, and then plot the sorted eigenvalues; plot the accumulative information up to k eigenvalues as a function of the number of k.

As can be seen from the plot, we can use k=12 so that we retain 80.48% of the accumulative information. Top eigenvalues are significantly larger than those small eigenvalues, which means that they occupy most information weights. Plot the AD cases (labeled in red) and normal controls (labeled in green) in a 3D figure where the 3 coordinates correspond to the first 3 largest eigenvectors.

The separation is not clear, but we can still see that in the plot, AD data points are roughly to the “left”, and the ctrl data points are closer to the “right”. __It is not reliable if we only judge by observation__. Run a fast classification with 10-cross-validation SVM on these data points. Different kernels were tried, and here only presents the best classification result with a WEKA report:

=== Summary ===

Correctly Classified Instances 183 48.6702 %

Incorrectly Classified Instances 193 51.3298 %

Kappa statistic -0.0266

Mean absolute error 0.5133

Root mean squared error 0.7164

Relative absolute error 102.6564 %

Root relative squared error 143.2849 %

Total Number of Instances 376

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.436 0.463 0.485 0.436 0.459 -0.027 0.487 0.494 N

0.537 0.564 0.488 0.537 0.511 -0.027 0.487 0.494 P

Weighted Avg. 0.487 0.513 0.487 0.487 0.485 -0.027 0.487 0.494

=== Confusion Matrix ===

a b <– classified as

82 106 | a = N

87 101 | b = P

The classification above offers a quantified view of how these data points are separated. Since the classification accuracy is nearly 50%, we can see that the separation between two classes is not clear enough. The most obvious cause of this is that the accumulative information retained by these 3 eigenvectors is 0.6688.Then use top 2 eigenvectors, data points from two data sets are plotted as follows:

The separation is not clear but we can still see that AD data points are mostly to the “left” while control data points are closer to the “right”. It is not reliable if we only judge by observation. Run a fast classification with 10-cross-validation SVM on these data points. Different kernels were tried, and here only presents the best classification result with a WEKA report:

=== Summary ===

Correctly Classified Instances 153 43.4659 %

Incorrectly Classified Instances 199 56.5341 %

Kappa statistic -0.1307

Mean absolute error 0.5653

Root mean squared error 0.7519

Relative absolute error 113.0601 %

Root relative squared error 150.367 %

Total Number of Instances 352

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0.335 0.466 0.418 0.335 0.372 -0.133 0.435 0.473 P

0.534 0.665 0.445 0.534 0.486 -0.133 0.435 0.471 N

Weighted Avg. 0.435 0.565 0.432 0.435 0.429 -0.133 0.435 0.472

=== Confusion Matrix ===

a b <– classified as

59 117 | a = P

82 94 | b = N

After running SVM classification for the 2D data set, we can see that the best accuracy we can get is 43.4659 %. The classification accuracy is even worse than 3D condition. This can be explained by the fact that he accumulative information retained by these two eigenvectors is 0.6246, which is less than that in 3D condition.

We didn’t have to visualize the classification to pick out some good examples of data points labeled as TP, TN, FP, or FN. Using 2D plot as an example, those black points in the right part are mostly TN, and those red dots surrounded by black dots are mostly labeled as FN. Similarly, those red dots gathering in the left part of the plot are mostly TP, while those black dots surrounded by them are mostly FP.

__However, there are some shared causes for low classification accuracy/unsatisfying separation for both 2D and 3D models/plots.__ PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The eigenvalues (weights for principal components) reflect the information weight. So when we just used the top 3 eigenvectors, we used 66.88% of the information. For top 2 eigenvectors, we have 62.64% of the information. We cannot draw 4D plot so we had to stop by 3D. __The information retained is not enough__ to derive clear separation out of it, although we can see different centroids locations for different datasets. Aside from ratio of information retained, we still need to consider the number of the instances and the number of features. There are 8560 features and 364 instances. This is a typical case in real problems, which data miners are often haunted by the __curse of dimension__. When the number of instances is less than 1/10 of the number of features, it is really hard to see the real patterns, not to mention there are still __unknown features__ in most cases. When more data points can be added to the plot, the real patterns may be dug out to a higher level. We can see roughly from the plots that data points from two data sets have different centroids, but these centroids may be changed if we add more instances to the plot, and so does the separation.

**PROJECT END**

## Leave a Reply