In this project, I will use multiple data mining & machine learning techniques and models to analyze the patterns embedded in the comparison of two datasets: AD data and control data. It is a real dataset of more than 8000 genes of 176 patients with Alzheimer’s disease (in text file case.gex) and 188 age-matched normal people, or controls (in text file ctrl.gex), which can be found on my Github. More information can be found from the original paper of the disease study.
There are a substantial amount of data missing in both data sets, so the first step of the project is the imputation. 3 different imputation schemes came to my mind: mean, median, and K-NN. In order to evaluable the accuracy of these three methods, I randomly select 5% of the data entries and mark them as if they were missing. We then apply all imputation methods to impute the “missing data”. We then compute the average difference between the imputed values and the actual values and use this average difference as the performance of the method. I tested my methods only on the control dataset. The reason is that these two datasets need to be handled separately as patients (cases) and controls, and they should have some (unknown) genetic differences.
I preprocess the data by reading line by line to get an 8562*189, then cut the irrelevant lines to get an 8560*189. After that, I rotate the dataset to put 8560 features as columns. As for mean imputation, I calculate the mean for each feature with known data (not ‘NA’) and then calculate the average difference with missing data points. As for median imputation, I find the median for each feature by put all values of one feature in a list, and then sort and identify the median. Then I calculate the average difference as I did in mean imputation.
As for K-NN, I build a classification for each missing point. For example, in an instance, feature 1 to feature 428 are missing, then I use feature 429 to feature 8560 to build the model. Each of these 428 missing data points is considered as a label, and feature 429-8560 are considered as ‘features’ in K-NN model. I use all other 187 instances and 429-8560 features to find K-NN for each missing point, then use the majority vote of these K neighbors as the prediction. After finding the prediction generated by K-NN for each missing point, the average difference can be calculated.
Wait for roughly half an hour till all three methods are done, and the average differences of three methods will be displayed as follows:
Mean imputation: 0.316922184436
Median imputation: 0.322468263654
KNN imputation: 0.0222101059161 with K =22
It loops through k = 5 to k = 50 to find the optimal K for imputation. The best K in after my test is 22, I think it is appropriately large to produce a small average difference.
As the outcome indicates after hours, K-NN imputation has the smallest average difference. Mean imputation and Median imputation has similar outcomes, which may be because Mean is close to Median in this dataset.
After completing this part, my understanding of testing the performance of a data mining algorithm has been improved. There are multiple ways to build a model, to impute missing data, or to handle unknown information. What we can do is to try out several different ways to do that, can compare the performance of these chosen methods. Although sometimes the comparison can be hard to perform because we don’t know that right answer for some real missing data value. However, we can do tests based on known data to gain a close estimation of the performance of each method. I believe the model chosen for the case or the algorithm used for the problem can be continuously improved when more training data is offered. Although the model is made to fit in the known patterns, more training can normally lead to higher accuracy in predicting unknown instances. ‘I have seen too much, so I can predict unknown better as an elder.’ A generalized summary of testing the performance is to use known data as problems that we’ve already known the answer to test our algorithm and to see how accurate their predictions are. Then use future data as training data to refine those methods. Code for imputation written in python can be found on my Github.
Associate Rules Mining
We are going to use the AR method to find possible gene regulation in Alzheimer’s disease using the AD data. First, in order to convert gene expressions to buy or no buy that can be used in Apriori Algorithm. Considering that 8560 genes are too many to generating frequent itemsets, I use 1800 genes to perform the associated rules mining.
We first need to generate basket (transaction) datasets from the gene expression data we have. Since we have datasets for AD cases and normal controls, we can have two sets of basket datasets, respectively, one for cases (in case.gex) and the other for controls (in ctrl.gex). In one dataset (for AD or controls), each gene can be viewed as an item and each person can be viewed as a transaction. We now consider or determine whether a transaction (person) has a particular item (gene or gene expressed) as follows. First, a person having a particular gene means that she has that gene expressed to have abundance above a certain level. Since the genes that are not highly expressed may not be interesting, we will discard them from our baskets. Therefore, for each feature (gene), we will only consider the x% people whose expressions of that gene are among the top x% most abundant. We use x=10%. This means that we turn the gene expression matrix of the given data into a new 1/0 matrix, where (i, j) = 1 if the i-th person has the expression of the j-th gene (item) among the top x% of all the people, or 0 otherwise. Take 10% and the 187 individuals in the control dataset as an example. In this case, we should expect to see 18 entries (or people) have 1 for a particular gene or 0 for the rest. To be precise – you sort each gene of all people and take the x% as 1 and the rest 0. The preprocessing of the data can be found on my Github.
Here starts the Apriori Algorithm part: using python to do Apriori algorithm (for 3 days) on the case and ctrl dataset after preprocessing. Using the outcomes for 5 * 4 = 20 different combinations of min support and min confidence to plot the below figures for the two datasets. We can find out that the number of itemsets and AR found by the Apriori algorithm increases exponentially, which can be seen from the mostly linear lines from the log10 plots.
Each association rule consists of two parts, and antecedent and a consequent. Based on the association rules we have, we can further study the genes involved in these association rules. Association rules in different datasets can also have different meanings. On one hand, in the ctrl.gex dataset, we have the association rules about the genes of people who don’t have AD. Similarly, the association rules we discovered in the case.gex dataset tell us about the genes of people who have AD. Using the association rule [R] (‘GI_14249467-I’, ‘GI_10864030-S’, ‘GI_13376528-S’, ‘GI_19924114-I’) => (‘GI_20336255-I’,) as an example, we know that among people who have AD, we know that if their gene ‘GI_14249467-I’, ‘GI_10864030-S’, ‘GI_13376528-S’, ‘GI_19924114-I’ are positive, they are very likely (higher than min confidence) to have gene ‘GI_20336255-I’ as positive. With such rules, we can have a better understanding of the gene regulations of people who has or does not have AD. Now we make a comparison of these 50 association rules with the highest confidence from the two datasets. By comparison, we can see that among 50 most confident rules in these two datasets, there are no rules in common. This indicates a big difference in gene regulations between people who have or don’t have AD. This may help researchers to study about the AD for questions like: Who are more likely to have AD? How to figure out if one has AD with a genetic method? How to cure AD with a genetic method? By further study, we may use these association rules to help find the better genetically medical solution for AD.
Feature Selection with Random Forest Model
Using the two sets of AD and control data as the positive and negative training data (no test data) to build the following RFs, each of which has 500 trees:
- Fix the number of features (genes) at 70% of the total number of features, vary the number of instances from 50% to 90%, with an increment of 5%.
- Fix the number of instances at 70% of the total number of instances, vary the number of features (genes) from 50% to 90%, with an increment of 5%.
I calculated the frequencies of 8560 genes in different RFs with different configurations. They I plot them together to see the frequency distribution of different genes.
As the plot shows, the frequency of genes in RF differs from each other. Some genes appear in RF frequently, while some of them may hardly be used to as the point to split from. Since in each tree inside the RF, the best feature to split is chosen by ID3, which means the split feature is chosen if it can decrease the entropy the most. It also means that this gene is more decisive in judging whether this instance should belong to positive class or negative class. By ‘decisive’ I mean the gene we chose to split from is the most ‘important’ or ‘fundamental’ one among the features that we still have in the tree. The more decisive the feature is, the more information gain we can get if we split from this feature. The core idea of building decision tree is to split from the point from which we can get the most information gain, so when we build decision trees we are actually using the feature from the most important one to the least important one in the feature set.
Since in different trees we used in RF features are chosen randomly, different genes can have same ‘chance’ to appear in RF compared to each other. As a result, higher frequencies imply that these genes are more ‘decisive’. Some conditions of these features can help us make a better prediction for the label of an instance.If we want to compare two genes based on their frequencies, we can tell which one is the more ‘important’ one by their frequencies in RF. If one gene is often used to split from and label instances and the other is seldom used, we can tell the one with a higher frequency is more important.
For instance, if we want to decide if an animal is a dog. We may want to split by features like ‘if it is a mammal’ or ‘if it has 4 legs’ before we consider ‘the color of their fur’ or ‘if it has tails’. A dog must have 4 legs, and it must be a mammal, so we consider these features as important features. However, a dog can have a different color of fur, and it can have a tail or not. These features cannot be used to separate dog and not dog instances so they will not appear in RF as frequent as the two important features. Some information about an instance is just more useful than others for us to label them, which can be seen from different frequencies they appear in RF.
By ranking the genes based on their frequency of appearing in random forests, I got the top 100 genes which can be considered as the most informative ones:
A current master student in WUSTL, department of Electrical and System Engineering.