Hunting data outliers

Introduction

Outliers can often be the mischievous troublemakers of any dataset, throwing off your analysis and skewing your results. But what exactly is an outlier? Simply put, it’s a data point that stands out like a sore thumb—something that doesn’t quite fit with the rest of the data. Identifying and removing these anomalies is a critical first step in ensuring your analysis is accurate and reliable.

In this article, we’ll dive into one of the most effective techniques for filtering out outliers: the Interquartile Range (IQR) method. The IQR measures the spread of the middle 50% of your data by calculating the difference between the 75th percentile (Q3) and the 25th percentile (Q1). It’s simple yet powerful—a true game-changer for improving the quality of your analysis. Let’s break it down and learn how to tame those unruly outliers! Any data that matches the following criteria will be identified as an outlier:

Less than Q1 - 1.5*IQR
Greater than Q3 + 1.5*IQR

About the Dataset

For this demonstration, we’ll be using the Sleep Time Prediction dataset by Govindaram Sriram from Kaggle. This dataset is designed for machine learning models to predict sleep duration based on daily lifestyle parameters. The data includes features like workout time, reading time, phone usage time, work hours, caffeine intake, and relaxation time, with sleep time as the target variable. The data includes features like workout time, reading time, phone usage time, work hours, caffeine intake, and relaxation time, with sleep time as the target variable. It includes outliers to make models robust to noisy real-world data.

Exploratory Data Analysis

N.B: You can find all the codes here

We’ll start by exploring the data. I’ve already imported the necessary libraries and loaded the dataset into a pandas dataframe. Let’s take a look at how the data looks:

We see the data has 7 columns - WorkoutTime, ReadingTime, PhoneTime, WorkHours, CaffeineIntake, RelaxationTime and SleepTime. The data has 2000 rows/observations. We’ll now look at the mean and median for all the columns, I’ll only include the WorkoutTime and the SleepTime here.

A clear observation is all the columns have very close mean and median, except the SleepTime column, looks interesting, doesn’t it? Let’s try to visualize this with the help of a histogram.

We see the column has a comparatively larger spread with sleep times stretching as long as 20. We can now guess with confidence that this column has some outliers. Time to hunt them down!

Calculating the Range

Time for the fun part. Let’s calculate the lower and upper limit using the formulas we just mentioned:

lower = q1 - 1.5 * IQR
upper = q3 + 1.5 * IQR

Almost there, we’re in the endgame now! Let’s take a look at who those criminals are!

Filtering the Outliers

We can see that we have 64 outliers, that’s nearly 3% of the total dataset, filtering them out will most often than not, lead to a better analysis.

Conclusion

Spotting and filtering outliers is a cornerstone of effective data analysis. While the Interquartile Range (IQR) is one of the most popular tools for the job, it’s far from the only method in a data analyst’s toolkit. In the future, we’ll dive into other techniques to expand your arsenal. Until then, may your outliers be few and your insights many. Happy hunting!

Hunting outliers with IQR

Table of contents