Data Science is great. The idea of analyzing data for decision making has been around for many years, but the popularity of data science has exploded along with the FAANG companies’ growth in recent years. No matter your job title, experience level, or industry, I am confident that you will encounter solutions or products that are highly ‘data-driven’ or powered by Artificial Intelligenceᵗᵐ. Here are the Top 4 methods used by data scientists to fool others. As a Machine-Learning researcher and practitioner, I have made these ‘mistakes’ myself in the past, sometimes even unknowingly!
“Our model achieves an accuracy of 98.9%”
I am sure all of us have come across a similar statement to the above. In the world of data science, accuracy alone is simply not enough to indicate performance or value. First of all, accuracy can be interpreted in many ways! Depending on the task at hand, it could be the accuracy in classifying the correct sample with one prediction, or 5 predictions (Top 1 vs Top 5 Accuracy of ImageNet). By definition, the top-1 accuracy would obviously always be lower than the top-5 accuracy.
Also, accuracy can be a highly misleading metric. For example let us imagine a task where we need to detect cancerous tumors that are only found in 1% of the population. A model that simply predicts ‘no tumor’ all the time would be reported to have 99% accuracy! This is clearly a crappy model, as it would wrongly inform patients who actually have cancer. Though this is an extreme example of how accuracy fails as a metric, it can happen to varying degrees in any project. Here is a great comprehensive article all about metrics if you wish to find out more!
When presented with a highly optimistic accuracy or metric, ask about other common metrics used for the specific task!
“Garbage In, Garbage Out.” — George Fuechsel
The dangers of using poor quality data can be summed up in the quote above. GIGO is one of the guiding principles of all data scientists. Using bad data typically refers to training your model on data that is unrepresentative of real world scenarios, or that the data has some sort of unreasonable bias. Models trained on lousy data will simply have lousy performance when applied to practical situations. Even the newest, state-of-the-art models suffer from this problem! Combined with using improper metrics, GIGO can produce models that promise the moon, yet have the performance of a dumpster.
When evaluating a data scientist’s proposal, ask for a look at their dataset!
The concept of a train-test split is common knowledge among any data scientist or machine learning practitioner. Having a proper test set allows data scientists to properly evaluate the performance of their model! A test set ‘simulates’ data obtained from real-life scenarios, since it is data that is unseen by the model. There is no definite way to check if a proper train-test split has been performed, but it is still important to find out the exact procedures that have been done!
Find out how the train-test split was conducted, and the handling + training procedures of the model on the different datasets.
Nearly every product and solution in the technology space involves using Artificial Intelligence, Deep Learning, or Machine Learning. These are the hottest keywords in the data science space today, and people are hopping on the train, whether they truly have a ticket or not. AI/ML can be used in things ranging from self-driving cars to smart-lighting systems, and the complexity level can vary immensely. One common bluff is that AI models are a ‘black-box’ and hard to explain. However, any data scientist worth his/her salt will be able to explain their model in simple terms!
Always ask about the exact model used and a simple explanation of its workings of architecture!
Data science can be a complex mess for those who are not well versed in it. However, we should work toward wading through the BS so that we can all be well informed! Keep these tricks in mind and always be curious of what is going on behind the scenes and how things work. You can count on these principles to guide your journey in data science.