Can we trust data anymore?

In the world of the internet, data is now a ubiquitous resource. There is a great amount of trust placed in data by many consumers. Often times the word “data” is synonymous with “fact”, however, there is an intermediate step for this conversion. Facts are created by looking at data and extrapolating causality. There is room for error in this extrapolation to fact and problems may occur when these “facts” propagate to the general public. There has recently been a surge of awareness in the faults of these facts created from data, dubbed the infamous “fake news”. Over time, the trust we place in data has changed. We realize now that the general population is not as good as we thought at critically thinking about the information it is presented with.

A visualization of a social network. Licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license. (source)

The revelation I have come to is that it doesn’t matter if data is represented correctly. An observer’s opinion can be swayed by data that is accurately recorded and displayed. This is through certain facts about the data which are not presented to the consumer. For example, the classic “9/10 dentists recommend this toothpaste” statistic. Even if their research is accurate and not tampered with, what if they only surveyed dentists of a particular country? What if the only available import for toothpaste in this country is the advertised company’s toothpaste? The data is still correct, however, there is a tremendous bias in their favour. The consumer lacks the context to fully understand the data that is being presented to them. This context is that a sampling bias is present in the data that was collected. Context for data is extremely important.

Additionally, a consumer that does not have the correct education to properly make sense of certain data can also be inadvertently swayed. This goes for metrics that may appear to indicate the value of a product or service but is simply one of many metrics and is not a true testament to quality or performance. Some examples of these metrics are: video resolution, CPU clock frequency and kill count in Counter-Strike (one of the most popular eSports by viewership). Each of these metrics are misleading. For video, the resolution is only part of the equation. The actual bitrate of the video matters far more than than what resolution it is. A 4k video can still have terrible quality, regardless of its high resolution. In fact, for low bitrates a lower resolution will actually have higher quality!

The curve of bitrate to quality for various resolutions. (source)

For processors, clock frequency is one of many metrics that contribute to the overall performance of the chip. Architecture, core count and cache speed are only some of the many other metrics that determine the true performance of a CPU. Despite this, to someone who just wants a fast laptop, a high clock speed is an easy thing to look at and assume it will bring performance. Finally, kill count in Counter-Strike. In Counter-Strike a common metric for players to latch on to for player performance is the amount of kills a player has. It turns out that ADR (Average Damage per Round) is a much better representation of a player’s impact on their team. This is because a kill could be 1% damage, or 100% damage. All that matters is that the other player has been eliminated for a kill to be registered. For the same amount of kills the 1% player has contributed far less than the 100% player, but your average non-player wouldn’t know that. Although having every person go out and learn about Counter-Strike is not my motive, I would like more people to understand that they may not truly understand the information they are viewing.

Data is only half the problem. The other half is a lack of mathematically rigorous thinking in the average observer. Working in tech and being surrounded by peers that all have similar majors from similar universities has placed me in a bubble of people that think the same way. Considering all the biases and variables at play, or the incentives of who is showing you data is not something everyone does. I am writing this article to say “think twice before you take an opinion”, which is something many people say they do, but few people actually do. Be more critical of yourself. You will find that is when you get smarter.

Maturing students must be educated on common misconceptions and influencing tactics in data beyond the basic “graph with an incorrect scale”. This way, when they begin making their own decisions they can navigate through the chaotic world of fake news. Educating the public on data is of high and immediate priority. During the coronavirus pandemic I have never see so many misunderstandings of statistics at once. The only way to improve the state of fake news and poor statistics is to have a public that is educated in it. At the end of the day, what’s more dangerous than bad data is an uneducated individual that is consuming it.