In this project, I will use multiple data mining & machine learning techniques and models to analyze the patterns embedded in the comparison of two datasets: AD data and control data. It is a real dataset of more than 8000 genes of 176 patients of Alzheimer’s disease (in text file case.gex) and 188 age-matched normal people, or controls (in text file ctrl.gex), which can be found from my Github. More information can be found from the original paper of the disease study.
There are substantial amount of data missing in both data sets, so the first step of the project is the imputation. 3 different imputation schemes came to my mind: mean, median, and K-NN. In order to evaluable the accuracy of these three methods, I randomly select 5% of the data entries and mark them as if they were missing. We then apply all imputation methods to impute the “missing data”. We then compute the average difference between the imputed values and the actual values, and use this average difference as the performance of the method. I tested my methods only on the control dataset. The reason is that these two datasets need to be handled separately as patients (cases) and controls, and they should have some (unknown) genetic differences.
I preprocess the data by reading line by line to get an 8562*189, then cut the irrelevant lines to get an 8560*189. After that I rotate the dataset to put 8560 features as columns. As for mean imputation, I calculate the mean for each feature with known data (not ‘NA’), and then calculate average difference with missing data points. As for median imputation, I find the median for each feature by put all values of one feature in a list, and then sort and identify the median. Then I calculate the average difference as I did in mean imputation.
As for K-NN, I build a classification for each missing point. For example, in instance, feature 1 to feature 428 are missing, then I use feature 429 to feature 8560 to build the model. Each of these 428 missing data points is considered as a label, and feature 429-8560 are considered as ‘features’ in K-NN model. I use all other 187 instances and 429-8560 features to find K-NN for each missing point, then use the majority vote of these K neighbors as the prediction. After finding the prediction generated by K-NN for each missing point, average difference can be calculated.
Wait for roughly half an hour till all three methods are done, and the average differences of three methods will be displayed as following:
Mean imputation: 0.316922184436
Median imputation: 0.322468263654
KNN imputation: 0.0222101059161 with K =22
It loops through k = 5 to k = 50 to find the optimal K for imputation. The best K in after my test is 22, I think it is appropriately large to produce small average difference.
As the outcome indicates after hours, K-NN imputation has the smallest average difference. Mean imputation and Median imputation has similar outcomes, which may be because Mean is close to Median in this dataset.
After completing this part, my understanding of testing the performance of a data mining algorithm has been improved. There are multiple ways to build a model, to impute missing data, or to handle unknown information. What we can do is to try out several different ways to do that, can compare the performance among these chosen methods. Although sometimes the comparison can be hard to perform, because we don’t know that right answer for some real missing data value. However, we can do tests based on known data to gain a close estimation of the performance of each method. I believe the model chosen for the case or the algorithm used for the problem can be continuously improved when more training data is offered. Although the model is made to fit in the known patterns, more training can normally lead to higher accuracy of predicting unknown instances. ‘I have seen too much, so I can predict unknown better as an elder.’ A generalized summary for testing the performance is to use known data as problems that we’ve already known the answer to test our algorithm, and to see how accurate their predictions are. Then use future data as training data to refine those methods. Code for imputation written in python can be found on my Github.
Associate Rules Mining
We are going to use the AR method to find possible gene regulation in Alzheimer’s disease using the AD data. First, in order to convert gene expressions to buy or no buy that can be used in Apriori Algorithm. Considering that 8560 genes are too many to generating frequent item sets, I use 1800 genes to perform the associate rules mining.
We first need to generate basket (transaction) datasets from the gene expression data we have. Since we have datasets for AD cases and normal controls, we can have two sets of basket datasets, respectively, one for cases (in case.gex) and the other for controls (in ctrl.gex). In one dataset (for AD or controls), each gene can be viewed as an item and each person can be viewed as a transaction. We now consider or determine whether a transaction (person) has a particular item (gene or gene expressed) as follows. First, a person having a particular gene means that she has that gene expressed to have abundance above a certain level. Since the genes that are not highly expressed may not be interesting, we will discard them from our baskets. Therefore, for each feature (gene), we will only consider the x% people whose expressions of that gene are among the top x% most abundant. We use x=10%. This means that we turn the gene expression matrix of the given data into a new 1/0 matrix, where (i, j) = 1 if the i-th person has the expression of the j-th gene (item) among the top x% of all the people, or 0 otherwise. Take 10% and the 187 individuals in the control dataset as an example. In this case, we should expect to see 18 entries (or people) have 1 for a particular gene or 0 for the rest. To be precise – you sort each gene across all people and take the x% as 1 and the rest 0. The preprocessing of the data can be found on my Github.
Here starts the Apriori Algorithm part: using python to do Apriori algorithm (for 3 days) on the case and ctrl dataset after preprocessing. Using the outcomes for 5 * 4 = 20 different combinations of min support and min confidence to plot the below figures for the two datasets. We can find out that the number of item sets and AR found by the Apriori algorithm increases exponentially, which can be seen from the mostly linear lines from the log10 plots.
Each association rule consists of two parts, and antecedent and a consequent. Based on the association rules we have, we can further study the genes involved in these association rules. Association rules in different datasets can also have different meanings. On one hand, in the ctrl.gex dataset, we have the association rules about the genes of people who doesn’t have AD. Similarly, the association rules we discovered in the case.gex dataset tell us about the genes of people who has AD. Using the association rule [R] (‘GI_14249467-I’, ‘GI_10864030-S’, ‘GI_13376528-S’, ‘GI_19924114-I’) => (‘GI_20336255-I’,) as an example, we know that among people who has AD, we know that if their gene ‘GI_14249467-I’, ‘GI_10864030-S’, ‘GI_13376528-S’, ‘GI_19924114-I’ are positive, they are very likely (higher than min confidence) to have gene ‘GI_20336255-I’ as positive. With such rules, we can have better understanding of the gene regulations of people who has or does not have AD. Now we make comparison of these 50 association rules with highest confidence from the two datasets. By comparison we can see that among 50 most confident rules in these two datasets, there are no rules in common. This indicates a big difference in gene regulations between people who have or don’t have AD. This may help researchers to study about the AD for questions like: Who are more likely to have AD? How to figure out if one has AD with genetic method? How to cure AD with genetic method? By further study, we may use these association rules to help find better genetically medical solution for AD.
Feature Selection with Random Forest Model
Using the two sets of AD and control data as the positive and negative training data (no test data) to build the following RFs, each of which has 500 trees:
- Fix the number of features (genes) at 70% of the total number of features, vary the number of instances from 50% to 90%, with an increment of 5%.
- Fix the number of instances at 70% of the total number of instances, vary the number of features (genes) from 50% to 90%, with an increment of 5%.
I calculated the frequencies of 8560 genes in different RFs with different configurations. They I plot them together to see the frequency distribution of different genes.
As the plot shows, the frequency of genes in RF differs from each other. Some genes appear in RF frequently, while some of them may hardly be used to as the point to split from. Since in each tree inside the RF, the best feature to split is chosen by ID3, which means the split feature is chosen if it can decrease the entropy the most. It also means that this gene is more decisive in judging whether this instance should belong to positive class or negative class. By ‘decisive’ I mean the gene we chose to split from is the most ‘important’ or ‘fundamental’ one among the features that we still have in the tree. The more decisive the feature is, the more information gain we can get if we split from this feature. The core idea of building decision tree is to split from the point from which we can get the most information gain, so when we build decision trees we are actually using the feature from the most important one to the least important one in the feature set.
Since in different trees we used in RF features are chosen randomly, different genes can have same ‘chance’ to appear in RF comparing to each other. As a result, higher frequencies imply that these genes are more ‘decisive’. Some conditions of these features can help us make better prediction for the label of an instance.If we want to compare two genes based on their frequencies, we can tell which one is the more ‘important’ one by their frequencies in RF. If one gene is often used to split from and label instances, and the other is seldom used, we can tell the one with higher frequency is more important.
For instance, if we want to decide if an animal is a dog. We may want to split by features like ‘if it is a mammal’ or ‘if it has 4 legs’ before we consider ‘the color of their fur’ or ‘if it has tails’. A dog must have 4 legs, and it must be a mammal, so we consider these features as important features. However, a dog can have different color of fur, and it can have a tail or not. These features cannot be used to separate dog and not dog instances, so they will not appear in RF as frequent as the two important features. Some information about an instance is just more useful than other for us to label them, which can be seen from different frequencies they appear in RF.
By ranking the genes based on their frequency of appearing in random forests, I got the top 100 genes which can be considered as the most informative ones:
Based the on the data we have after imputation, I did the normalization in two different methods. We need to normalize data because different sample points are generated in slightly different environments, which may cause the bias in the value of the gene expressions. For instance, if two cells have different amount of chemical indicator on them, the values we get from the chemical experiment can be different. This bias can be fixed by normalizing the gene expressions inside one sample/patient/observation.
Using standard normalization method. Among 8560 gene expressions of one data point, we can get the max and min. We set max(scaled) as 1 and max(scaled) as -1. Then we normalize all other data according to:
Then we have:
Distance measure we use in this part include: dot product, Euclidean metric, cosine similarity. For each distance measure, we calculate and plot for K-Means in which k goes from 2 to 100:
- The average in-cluster similarity of the k clusters – call it S;
- The average between-cluster similarity across any pair of the k clusters – called it D;
- the ratio of S/D;
Using Matlab, run k-means algorithm using k from 2 to 100 with dot product as the similarity measure. Compute S, D and S/D.
We can see that when k increases, S increases and D decreases. Dot product is a choice of similarity measure. It depends on the angle between the two vectors as well as the norms of them. Although we have already normalized data one sample by one sample, norms of genes are still different from each other, which may influence the comparison of dot `products. As we can see from the plot, S majorly goes up, and the several peaks between 2 and 100 may be caused by bigger norms. This also generates the peaks on the plot of S/D. S/D also increases while D decreases when k goes from 2 to 100. When we extract the norm of the two vectors from it, we have cosine similarity measure.
Cosine similarity is also widely used in k-means algorithm. This only measures the cosine value of the angle between the two vectors, so it is not biased by the norms of the two vectors. Using Matlab, run k-means algorithm using k from 2 to 100 with cosine similarity as the similarity measure. Compute S, D and S/D.
As can be seen from the plots, we don’t have extremely large peaks like we had in dot product plots any more. We can see that when k increases from 2 to 100, S increases and D decreases mostly. S/D also increases progressively, although there are some peaks between k = 10 and k = 25.
In this metric, we can see that bigger Euclidean distance indicates smaller similarity. When two vectors are the same, the Euclidean distance is 0, and the Euclidean similarity metric reaches the maximum value 1. Adding 1 to distance is to avoid “divided by 0” situation. If we plot S, D, and S/D using Euclidean similarity metric, we have the following plot.
When we increase k from 2 to 100, S goes up, and D goes down. S/D increases like it did when we use dot product and cosine similarity.
S is the average in-cluster similarity of the k clusters. It is computed as following:
D is the average between-cluster similarity of the k clusters. It is computed as following:
S would mostly go up when we increase k, because there are less points (considering average number) in one cluster. Less points means less variety in the collection. Although we are taking average pairwise similarity, smaller cluster area often indicates higher average pairwise similarity. Meanwhile, D goes down mostly when we increase k, because the more clusters we are going to build, the more average difference there will be among those centers. As can be seen from the plots, no matter which similarity measure we use, S/D value increases when k goes from 2 to 100. So I believe the separation quality of different clusters is enhanced when we change k from 2 to 100. However, this S/D can be increasing when k keeps increases, and the “optimal” k will be found later.
How to choose the best similarity measure for implementing k-means?
S/D is a valid performance measure, and there is also another performance metric: within cluster sum of squared error (SSE). The SSE is also calculated in the Euclidean distance “style”. This error can be minimized by using Euclidean distance/Euclidean metric as similarity measure to implement k-means clustering algorithm, because each iteration of k-means with such implementation, we label the cluster membership by measuring their Euclidean distance from the centroids from the last iteration. Compared to other two distance measures used above, dot product and cosine similarity, Euclidean distance/Euclidean metric is more unbiased. The dot product is biased by 2-norms of the vectors involved, so unless all data points have the same norms, we cannot use dot product as the similarity measure in implementing k-means algorithm. What is more, the cosine similarity extracts the bias of different norms, but it only measures the angle between two vectors. Then, so can see than (1,0) and (10000000,0) are more similar than (1,0) and (1,1). Also, data points (10000000,0), (100000,0), and (10,0) have same similarity with (1,0), which is not effective. As a result, Euclidean distance, I believe, is the best choice of implementing k-means algorithm among these 3 similarity measures.
How to choose the k for implementing k-means: Number of classes? Number of objects?
Choosing a good k is essential to the algorithm. If we know how many classes there are, we can easily set k as the number of classes. But in this case we don’t know how many classes are there. We can also pick k based on the number of points. We can use the square root of a half of the number of points, which in this case is 65. This is a rough method to make clustering results look good visually, and a more precise method is indicated as following—-Elbow method:
As can be seen from the plot for all similarity measure above, the S/D kept rising when k increases from 2 to 100. Hence, it is rational to believe that S/D will keep increasing by rising k beyond 100. In order to find an “optimal” k for k-means in this case. I ran another test in which k goes from 2 to 2000, and recorded the S/D values and SSE values. By doing this I gained a more extensive view of how S/D and SSE change when we raise k.
As can be seen from plot, we can see that the SSE quickly drops exponentially when we increase k from 2 to 80, but then it decreases slowly and almost linearly, which is also called by “elbow effect”. The elbow usually represents where we start to have diminishing returns by increasing k. That almost horizontal line in the closer look plot is the same trending line presented in the first SSE plot. 30 is a crucial point from which the decreasing rate changes significantly. This indicates that we may want to choose k from 30 to 60. We want k to be small but we want higher performance. Now we use S/D plots to further validate the idea.
According to the plot of S/D, we can see k=35 is also approaching to the point when S/D stops increasing quickly. The trending line helps to locate the turning point. This is a good value for k using “elbow method”. When performance stops raising quickly, our goal is to choose a small value of k that still has a low SSE/high S/D separation.
PCA(Using genes as features and patients as instances)
First, use Matlab to run PCA on AD data, and then plot the sorted eigenvalues; plot the accumulative information up to k eigenvalues as a function of number of k.
As can be seen from the plot, we can use k=12 so that we retain 80.48% of accumulative information. Top eigenvalues are significantly larger than those small eigenvalues, which means that they occupy most information weights. Plot the AD cases (labeled in red) and normal controls (labeled in green) in a 3D figure where the 3 coordinates correspond to the first 3 largest eigenvectors.
The separation is not clear, but we can still see that in the plot, AD data points are roughly to the “left”, and the ctrl data points are closer to the “right”. It is not reliable if we only judge by observation. Run a fast classification with 10-cross validation SVM on these data points. Different kernels were tried, and here only presents the best classification result with a WEKA report:
=== Summary ===
Correctly Classified Instances 183 48.6702 %
Incorrectly Classified Instances 193 51.3298 %
Kappa statistic -0.0266
Mean absolute error 0.5133
Root mean squared error 0.7164
Relative absolute error 102.6564 %
Root relative squared error 143.2849 %
Total Number of Instances 376
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.436 0.463 0.485 0.436 0.459 -0.027 0.487 0.494 N
0.537 0.564 0.488 0.537 0.511 -0.027 0.487 0.494 P
Weighted Avg. 0.487 0.513 0.487 0.487 0.485 -0.027 0.487 0.494
=== Confusion Matrix ===
a b <– classified as
82 106 | a = N
87 101 | b = P
The classification above offers a quantified view of how these data points are separated. Since the classification accuracy is nearly 50%, we can see that the separation between two classes is not clear enough. The most obvious cause of this is that the accumulative information retained by these 3 eigenvectors is 0.6688.Then use top 2 eigenvectors, data points from two data sets are plotted as following:
The separation is not clear but we can still see that AD data points are mostly to the “left” while control data points are closer to the “right”. It is not reliable if we only judge by observation. Run a fast classification with 10-cross validation SVM on these data points. Different kernels were tried, and here only presents the best classification result with a WEKA report:
=== Summary ===
Correctly Classified Instances 153 43.4659 %
Incorrectly Classified Instances 199 56.5341 %
Kappa statistic -0.1307
Mean absolute error 0.5653
Root mean squared error 0.7519
Relative absolute error 113.0601 %
Root relative squared error 150.367 %
Total Number of Instances 352
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.335 0.466 0.418 0.335 0.372 -0.133 0.435 0.473 P
0.534 0.665 0.445 0.534 0.486 -0.133 0.435 0.471 N
Weighted Avg. 0.435 0.565 0.432 0.435 0.429 -0.133 0.435 0.472
=== Confusion Matrix ===
a b <– classified as
59 117 | a = P
82 94 | b = N
After running SVM classification for the 2D data set, we can see that the best accuracy we can get is 43.4659 %. The classification accuracy is even worse than 3D condition. This can be explained by the fact that he accumulative information retained by these two eigenvectors is 0.6246, which is less than that in 3D condition.
We didn’t have to visualize the classification to pick out some good examples of data points labeled as TP, TN, FP, or FN. Using 2D plot as an example, those black points in the right part are mostly TN, and those red dots surrounded by black dots are mostly labeled as FN. Similarly, those red dots gathering in the left part of the plot are mostly TP, while those black dots surrounded in them are mostly FP.
However, there are some shared causes for low classification accuracy/unsatisfying separation for both 2D and 3D models/plots. PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The eigenvalues (weights for principal components) reflect the information weight. So when we just used the top 3 eigenvectors, we used 66.88% of the information. For top 2 eigenvectors we have 62.64% of information. We cannot draw 4D plot so we had to stop by 3D. The information retained is not enough to derive clear separation out of it, although we can see different centroids locations for different data sets. Aside from ratio of information retained, we still need to consider the number of the instances and the number of features. There are 8560 features and 364 instances. This is a typical case in real problems, which data miners are often haunted by the curse of dimension. When number of instances is less than 1/10 of number of features, it is real hard to see the real patterns, not to mention there are still unknown features in most cases. When more data points can be added to the plot, the real patterns may be dig out to higher level. We can see roughly from the plots that data points from two data sets have different centroids, but these centroids may be changed if we add more instances to the plot, and so does the separation.
(Featured Image: https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjit-yC5PjUAhWIMyYKHWrtC0gQjRwIBw&url=http%3A%2F%2Fwww.foxnews.com%2Fhealth%2F2016%2F02%2F17%2Fscientists-identify-ground-zero-for-alzheimers-disease.html&psig=AFQjCNG789RXEVF0bLOJJZ27kaa45VGy6Q&ust=1499572162972404)