# Why we should not over-trust data

Terms like “Big Data” and “Data Mining” are so popular these days. Businessmen use it for market analysis and investment planning, and researchers use it to expand the knowledge bound for humanity. Even in ‘Sherlock Holmes’, the main character expressed the most intriguing idea of data mining: *The world is woven from billions of lives, every strand crossing every other. What we call premonition is just movement of the web. If you could attenuate to every strand of quivering data, the future would be entirely calculable, as inevitable as mathematics.*

Undeniably, big data is a gold mine, which means people can really dig values out of it with proper “mining” tools. Machine learning methods can help a lot in the field of data mining. Actually, many people believe that those two are the same thing. By the same thing, they think that data mining is gathering training data for the machine to learn stuff, while machine learning gives more tools to mine data, such as random forest, naive Bayesian, clustering and so on. However, data is not reliable as we think it is, and data mining is not effective as we want it to be.

Bonferroni’s Principle has pointed out that “if your method of finding significant items returns significantly more items that you would expect in the actual population, you can assume most of the items you find with it are bogus”. This principle helps us to set an upper bound on the accuracy of our data mining method. When the results turn out to be so unexpected that disobeys our common sense, we should re-examine the method we used. For instance, let’s use the case that we try to catch evil-doers by their visiting hotels, as in the Section 1.2.3 of the book ‘*Mining of Massive Data Set*‘. Suppose that there are one billion people being monitored for 1000 days. Each person has a 1% probability of visiting a hotel on any given day, and hotels hold 100 people each, so there are 100, 000 hotels. We consider a group of p people evil-doers if they all stayed at the same hotel on d different days. Assuming d and p are sufficiently small, we can derive the formula of false accusations, which is also the expected number of sets of p people that will be suspected of evil-doing:

The probability of any p evil-doers deciding to visit a hotel on any given day is 10^{-2p}. When they are all in the same one hotel, the probability should be divided by 10^{5(p – 1)}, which makes 10^{-7p+5}. They need to come back to the same hotel for (d-1) times, so it should time 10^{-(d-1)(7p-5)}, which makes 10^{-7pd+5d}. Now we can choose a combination of d days in 1000 days. It should be 10^{3d}/d!. Moreover, finding a group of p people among 10^{9} people can have 10^{9p}/p! combinations.

We calculated based on the fact that no people are evil-doers, but we still found some theoretically evil-doers. Actually, a lot of them. If p=2 and d=2, we get 250,000 evildoers, which is significantly larger than the “fact”. This case shows how statistics can fool us.

What is more, data we get from the world can be labeled or unlabeled. Labeled ones are easy to understand and process, but unlabeled data is a different story. Labeled data are given meanings captured by a human. They are labeled, tagged, or ‘explained’ for later data analysis process. However, One noticeable point of such labels is that sometimes labels and tags can be subjective, which means that people can have different ideas while labeling data. As has been indicated in the article “The Unreasonable Effectiveness of Data”:

*‘’Instead, a corpus for these tasks requires skilled human annotation. Such annotation is not only slow and expensive to acquire but also difficult for experts to agree on, being bedeviled by many of the difficulties we discuss later in relation to the Semantic Web. ‘’*

For instance, an X-ray graph is just unlabeled data before the doctor examines it and labels it with ‘tumor’, ‘cancer’, etc. The X-ray itself without these labels is hardly meaningful to human. In the case of the article, manually labeling these large-scale data can be rather expensive and controversial, but unannotated data can still be useful with the statistical approach.

On the other hand, unlabeled data is raw and unprocessed. No meanings that human are familiar with are given to data. They are not ‘explained’, and objective. These data are flowing into our life but not captured or explained well enough for users to take advantage of them.

The article also described a data-based approach: The approach requires a huge database containing probabilities of ‘short sequences of consecutive words’. The occurrences of each n-gram in a corpus of billions or trillions of words are counted to build the database. Then, researchers use frequencies of observed n-grams to estimate probabilities of new n-grams. These statistical models consist of large memorized phrase tables that map from specific source to target-language phrases.

The approach is an effective and implementable one for researchers to subtract values out of unlabeled data, but the limits of the approach are obvious:

__Ontology writing:__It is expensive to encoding and reasoning with knowledge from certain fields. Different subject fields have largely different vocabularies and their own ontologies, which means that this statistical model is highly limited in dealing with such writings.__Implementation Difficulty:__Building a database-backed web service requires much more efforts than building a webpage written in natural language.__Competition:__Competition in certain domains would cause resistance against such ontology because it would cause leakage of their confidential information that keeps them ahead of their competitors.__Inaccuracy and Deception:__Such model is based on honest and self-correcting users, which means human error or intentional compromising would lead to significant error in the model. Also, the models are optimized with the best available methods, yet it is not guaranteed to be optimal

To sum up, data can be useful and valuable if we handle it with suitable and effective tools, but it is impossible to find a “perfect” tool for all kinds of data. Also, even if we found a good tool for each type of data, we would still be fooled sometimes by big data considering its volume, velocity, and variety.

*(Featured Image Source: https://goo.gl/7pcL4G)*

## Arthur Zhang View All →

A current master student in WUSTL, department of Electrical and System Engineering.