**Towards Data Science blog**.

**Table of Contents**

Exploratory data analysis (EDA) is an important step in any data science project. We always try to get a glance of our data by computing descriptive statistics of our dataset. If you are like me, the first function you call might be Pandas **dataframe.describe()** to obtain descriptive statistics. While such analysis is important, *we often underestimate the importance of choosing the correct sample statistics/metrics/estimates*.

In this post, we will go over several metrics that you can use in your data science projects. In particular, we are going to cover several estimates of location and variability and their robustness (sensitiveness to outliers).

The following common metrics/estimates are covered in this article:

*Estimates of location*(first moment of the distribution)*mean, trimmed/truncated mean, weighted mean**median, weighted median**Estimates of variability*(second moment of the distribution)*range**variance and standard deviation**mean absolute deviation, median absolute deviation**percentiles (quantiles)*

For each metric, we will cover:

- The definition and mathematical formulation along with some insights.
- Whether the metric is robust (sensitiveness to extreme cases)
- Python implementation and an example

*The focus of this article is on the metrics and estimates used in the univariate analysis of numeric data*.

A note before we start: data scientists and business analysts usually refer to values calculated from the data as a *metric*, whereas statisticians use the term *estimates* for such values[1].

# Estimates of Location

Estimates of location are measures of the central tendency of the data (where most of the data is located). In statistics, this is usually referred to as the first moment of a distribution.

## Mean

The *arithmetic mean*, or simply *mean* or *average* is probably the most popular estimate of location. There different variants of mean, such as *weighted mean* or *trimmed/truncated mean*. You can see how they can be computed below.

* denotes the total number of observations (rows).*

Weighted mean (equation 1.2) is a variant of mean that can be used in situations where the sample data does not represent different groups in a dataset. By assigning a larger weight to groups that are under-represented, the computed weighted mean will more accurately represent all groups in our dataset.

**mean**and

**weighted mean**since neither one is a robust metric!

**Robust estimate**: A metric that is not sensitive to extreme values (outliers).

Another variant of mean is the *trimmed mean* (eq. 1.3) that is a robust estimate. This metric is used in calculating the final score in many sports where a panel of judges will each give a score. Then the lowest and the highest scores are dropped and the mean of the remaining scores are computed as a part of the final score[2]. One such example is in the international diving score system.

**sample**mean, whereas refers to the

**population**mean.

### A Use Case for the Weighted Mean

If you want to buy a smartphone or a smartwatch or any gadget where there are many options, you can use the following method to choose among various options available for a gadget.

Let’s assume you want to buy a smartphone, and the following features are important to you: 1) battery life, 2) camera quality, 3) price, and 4) the phone design. Then, you give the following weights to each one:

Let’s say you have two options an iPhone and Google’s Pixel. You can give each feature a score of some value between 1 and 10 (1 being the worst and 10 being the best). After going over some reviews, you may give the following scores to the features of each phone.

So, which phone is better for you?

And based on your feature preferences, the Google Pixel might be the better option for you!

## Median

Median is the middle of a sorted list, and it’s a robust estimate. For an ordered sequence , the median is computed as follows:

Analogous to the weighted mean, we can also have the *weighted median* that can be computed as follows for an ordered sequence with weights where .

## Mode

The mode is the value that appears most often in the data and is typically used for categorical data, and less for numeric data[1].

## Python Implementation

Let’s first import all necessary Python libraries and generate our dataset.

```
import pandas as pd
import numpy as np
from scipy import stats
import robustats
df = pd.DataFrame({
"data": [2, 1, 2, 3, 2, 2, 3, 20],
"weights": [1, 0.5, 1, 1, 1, 1, 1, 0.5] # Not necessarily add up to 1!!
})
data, weights = df["data"], df["weights"]
```

You can use NumPy’s `average()`

function to calculate the mean and weighted mean (equations 1.1 & 1.2). For computing truncated mean, you can use `trim_mean()`

from the SciPy stats module. A common choice for truncating the top and bottom of the data is 10%[1].

You can use NumPy’s `median()`

function to calculate the median. For computing the weighted median, you can use `weighted_median()`

from the robustats Python library (you can install it using `pip install robustats`

). Robustats is a high-performance Python library to compute robust statistical estimators implemented in C.

For computing the mode, you can either use the `mode()`

function either from the robustats library that is particularly useful on large datasets or from `scipy.stats`

module.

```
mean = np.average(data) # You can use Pandas dataframe.mean()
weighted_mean = np.average(data, weights=weights)
truncated_mean = stats.trim_mean(data, proportiontocut=0.1)
median = np.median(data) # You can use Pandas dataframe.median()
weighted_median = robustats.weighted_median(x=data, weights=weights)
mode = stats.mode(data) # You can also use robustats.mode() on larger datasets
print("Mean: ", mean.round(3))
print("Weighted Mean: ", weighted_mean.round(3))
print("Truncated Mean: ", truncated_mean.round(3))
print("Median: ", median)
print("Weighted Median: ", weighted_median)
print("Mode: ", mode)
```

```
>>> Mean: 4.375
>>> Weighted Mean: 3.5
>>> Truncated Mean: 4.375
>>> Median: 2.0
>>> Weighted Median: 2.0
>>> Mode: ModeResult(mode=array([2]), count=array([4]))
```

Now, let’s see if we just remove 20 from our data, how that will impact our mean.

```
mean = np.average(data[:-1]) # Remove the last data point (20)
print("Mean: ", mean.round(3))
>>> Mean: 2.143
```

You can see how the last data point (20) impacted the mean (4.375 vs 2.143). There can be many situations that we may end up with some outliers that should be cleaned from our datasets like faulty measurements that are in orders of magnitude away from other data points.

# Estimates of Variability

The second dimension (or moment) addresses how the data is spread out (variability or dispersion of the data). For this, we have to measure the difference (aka residual) between an estimate of location and an observed value[1].

## Mean Absolute Deviation

One way to get this estimate is to calculate the difference between the largest and the lowest value to get the *range*. However, the range is, by definition, very sensitive to the two extreme values. Another option is the mean absolute deviation that is *the average of the sum of all absolute deviation from the mean*, as can be seen in the below formula:

One reason why the mean absolute deviation receives less attention is since mathematically it’s preferable not to work with absolute values if there are other desirable options such as squared values available for instance, is differentiable everywhere while the derivative of is not defined at .

## Variance & Standard Deviation

The variance and standard deviation are much more popular statistics than the mean absolute deviation to estimate the data dispersion.

**sample***standard deviation*, whereas refers to the

**population***standard deviation*.

As can be noted from the formula, the standard deviation is on the same scale as the original data making it an easier metric to interpret than the variance. Analogous to the trimmed mean, we can also compute the *trimmed/truncated standard deviation* that is less sensitive to outliers.

A good way of remembering some of the above estimates of variability is to link them to other metrics or distances that share a similar formulation[1]. For instance,

## Median Absolute Deviation (MAD)

Like the arithmetic mean, none of the estimates of variability (variance, standard deviation, mean absolute deviation) is robust to outliers. Instead, we can use the *median absolute deviation from the median* to check how our data is spread out in the presence of outliers. The median absolute deviation is a robust estimator, just like the median.

## Percentiles

Percentiles (or quantiles) is another measure of the data dispersion that is based on** **** order statistics** (statistics based on sorted data). -th percentile is the least percentage of the values that are lower than or equal to percent.

25th (Q1) and 75th (Q3) percentiles are particularly interesting since their difference (Q3 – Q1) shows the middle 50% of the data. The difference is known as the **interquartile range (IQR)**** **(IQR=Q3-Q1). Percentiles are used to visualize data distribution using boxplots.

A nice article about boxplots is available on Towards Data Science blog.

## Python Implementation

You can use NumPy’s `var()`

and `std()`

function to calculate the variance and standard deviation, respectively. On the other hand, to calculate the mean absolute deviation, you can use Pandas `mad()`

function. For computing the trimmed standard deviation, you can use SciPy’s `tstd()`

from the stats module. You can use Pandas `boxplot()`

to quickly visualize a boxplot of the data.

```
import pandas as pd
import numpy as np
from scipy import stats
variance = np.var(data)
standard_deviation = np.std(data) # df["Population"].std()
mean_absolute_deviation = df["data"].mad()
trimmed_standard_deviation = stats.tstd(data)
median_absolute_deviation = stats.median_abs_deviation(data, scale="normal") # stats.median_absolute_deviation() is deprecated
# Percentile
Q1 = np.quantile(data, q=0.25) # Can also use dataframe.quantile(0.25)
Q3 = np.quantile(data, q=0.75) # Can also use dataframe.quantile(0.75)
IQR = Q3 - Q1
print("Variance: ", variance.round(3))
print("Standard Deviation: ", standard_deviation.round(3))
print("Mean Absolute Deviation: ", mean_absolute_deviation.round(3))
print("Trimmed Standard Deviation: ", trimmed_standard_deviation.round(3))
print("Median Absolute Deviation: ", median_absolute_deviation.round(3))
print("Interquantile Range (IQR): ", IQR)
```

```
>>> Variance: 35.234
>>> Standard Deviation: 5.936
>>> Mean Absolute Deviation: 3.906
>>> Trimmed Standard Deviation: 6.346
>>> Median Absolute Deviation: 0.741
>>> Interquantile Range (IQR): 1.0
```

# Conclusion

In this post, I talked about various estimates of location and variability. In particular, I covered more than 10 different sample statistics and whether they are robust metrics or not. A table of all the metric along with their corresponding Python and R functions are summarized in Table 3. We also saw how the presence of an outlier may impact non-robust metrics like the mean. In this case, we may want to use a robust estimate. However, in some problems, we are interested in studying extreme cases and outliers such as anomaly detection.

**Thanks for reading 🙏**

*If you liked this post, you can join my mailing list to receive similar posts. You can follow me on **LinkedIn**, **GitHub**, **Twitter** and **Medium**.*

*And finally, you can find my knowledge forest 🌲 (raw digital notes) at **notes.ealizadeh.com**.*

**📩 Join my mailing list**

# References

[1] P. Bruce & A. Bruce (2017), *Practical Statistics for Data Scientists*, First Edition, O’Reilly

[2] Wikipedia, Truncated mean

# Useful Links

⬅️ **Previous Post **

**➡️ Next Post**