Cheatsheets
Learn Statistics with Python

Learn Statistics with Python

Mean, Median, and Mode

Median of a Dataset

The median of a dataset is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, the middle two values are the median.


  import numpy as np
  data = [24, 16, 30, 10, 12, 28, 38, 2, 4, 36]
  data_sorted = np.sort(data)
  median = np.median(data_sorted)
  print(median)  # Output: 19.0 (since the dataset has even numbers of values)

Mean of a Dataset

Say we have a dataset with the following ten numbers:


2, 4, 10, 12, 16, 24, 28, 30, 36, 38

Variance and Standard Deviation

Standard Deviation

The standard deviation is a measure of a dataset’s spread. It is calculated by taking the square root of the variance of a data set. The resulting value has the same units as the original data.


import numpy as np
values = np.array([1,3,4,2,6,3,4,5])
# calculate variance of values
variance = np.var(values)

Calculating Variance in Python

In Python, we can calculate the variance of an array using the NumPy var() function.


import numpy as np
values = np.array([1,3,4,2,6,3,4,5])
# calculate standard deviation of values
variance = np.var(values)

Standard Deviation Units

Because standard deviation is in the same units as the original data set, it is often used to provide context for the mean of the dataset. For example, if the data set is [3, 5, 10, 14], the standard deviation is 4.301 units, and the mean is 8.0 units. By using the standard deviation, we can fairly easily see that the data point 14 is more than one standard deviation away from the mean.

Calculating Standard Deviation in Python

We can calculate standard deviation in Python using the NumPy std() function.

Interpretation of Variance

A larger variance means the data is more spread out and values tend to be far away from the mean. A variance of 0 means all values in the dataset are the same.

Variance

Variance is a measure of spread. It is calculated by finding the average of the squared differences between every observation and the mean. The resulting value is in units squared.

Histograms

Matplotlib Function To Create Histogram

In Python, the pyplot.hist() function in the Matplotlib pyplot library can be used to plot a histogram. The function accepts a NumPy array, the range of the dataset, and the number of bins as input.


import numpy as np
from matplotlib import pyplot as plt
# numpy array 
data_array = np.array([1,1,1,1,1,2,3,3,3,4,4,5,5,6,7])
# plot histogram
plt.hist(data_array, range = (1,7), bins = 7)

Mean of a Dataset

The mean, or average, of a dataset is calculated by adding all the values in the dataset and then dividing by the number of values in the set.

Histogram Bins

For example, for the dataset [1,2,3], the mean is 1+2+3 / 3 = 2.

What is a Histogram?

In a histogram, the range of the data is divided into sub-ranges represented by bins. The width of the bin is calculated by dividing the range of the dataset by the number of bins, giving each bin in a histogram the same width.

Histogram Bin Count

A Histogram is a plot that displays the spread, or distribution of a dataset. In a histogram, the data is split into intervals, called bins. Each bin shows the number of data points that are contained within that bin.

Histogram’s X and Y Axis

In a histogram, the bin count is the number of data points that fall within the bin’s range.

Describe a Histogram

Unimodal Distribution

Modality describes the number of peaks in a dataset. A unimodal distribution in a histogram means there is one distinct peak indicating the most frequent value in a histogram.

Left-Skewed Dataset

A left-skewed dataset has a long left tail with one prominent peak to the right. The median of this dataset is greater than the mean of this dataset.

Multimodal Dataset

If a histogram has more than two peaks, then the dataset is referred to as multimodal.

Bimodal Dataset

A bimodal dataset has two distinct peaks. This typically happens when the dataset contains two different populations.

Uniform Dataset

A uniform dataset does not have any distinct peaks.

Right-skewed Dataset

As seen in the histogram below, uniform datasets have approximately the same number of values in each group represented by a bar - there is no obvious clustering.

Symmetric Distribution in Histogram

In a histogram, if the prominent peak lies to the left with the tail extending to the right, then it is called a right-skewed dataset. In this case, the median is less than the mean of the dataset.

Dataset Outliers

In a histogram, the distribution of the data is symmetric if it has one prominent peak and equal tails to the left and the right. The Median and the Mean of a symmetric dataset are similar.

Spread of a Dataset

An outlier is a data point that differs significantly from the rest of the values in a dataset.

Peak of Unimodal Distribution

For example, in the dataset [1, 2, 3, 4, 100] the value 100 is an outlier because it lies a large distance from the rest of the data.

Quartiles, Quantiles, and Interquartile Range

Quantiles

Quantiles are the set of values/points that divides the dataset into groups of equal size. For example, in the figure, there are nine values that splits the dataset. Those nine values are quantiles.


# The value 5 is both the median and the 2-quantile
data = [1, 3, 5, 9, 20]
Second_quantile = 5

Quartiles

The three dividing points (or quantiles) that split data into four equally sized groups are called quartiles. For example, in the figure, the three dividing points Q1, Q2, Q3 are quartiles.


# Eventhough d_2 has an outlier, the IQR is identical for the 2 datasets 
d_1 = [1,2,3,4,5,6,7,8,9]
d_2 = [-100,2,3,4,5,6,7,8,9]

Numpy’s Quantile() Function

In Python, the numpy.quantile() function takes an array and a number say q between 0 and 1. It returns the value at the qth quantile. For example, numpy.quantile(data, 0.25) returns the value at the first quartile of the dataset data.

Quantiles and Groups

If the number of quantiles is n, then the number of equally sized groups in a dataset is n+1.

Median in Quantiles

The median is the divider between the upper and lower halves of a dataset. It is the 50%, 0.5 quantile, also known as the 2-quantile.

Interquartile Range Definition

The interquartile range is the difference between the first(Q1) and third quartiles(Q3). It can be mathematically represented as IQR = Q3 - Q1.

Interquartile Range and Outliers

The interquartile range is considered to be a robust statistic because it is not distorted by outliers like the average (or mean).

Boxplots

Box Plot Values

The box in the box plot displays the dataset’s median, first and third quartile, and the interquartile range. The line in the center of the box shows the median, the edges shows the first and third quartiles, and the interquartile range is visualized by the width of the box.


mul_datasets = [[3, 5, 7, 2], [2, 4, 10, 43]]
pyplot.boxplot(mul_datasets)

Usage of Side-by-side Box plots

The two datasets can be analyzed visually by placing two box plots side by side. This allows easy comparison of median, first and third quartiles and the IQR of the datasets.


# dataset= list of numbers
pyplot.boxplot(dataset)

Side-by-side Boxplots

In Python’s Matplotlib library, if multiple datasets are specified in function pyplot.boxplot(), then those datasets will be visualized as side by side box plots.

Box Plot Outliers

In a box plot, the data points that fall beyond the whiskers are called outliers. They are usually labeled with a dot or an asterisk.

Box Plot Whiskers

A box plot’s whiskers are the lines that extends from the 1st or 3rd quartile to points farthest from the median. The upper whisker of the box plot is the largest dataset number smaller than 1.5IQR above the third quartile and the lower whisker is the smallest dataset number larger than 1.5IQR below the first quartile.

Boxplot in Matplotlib

In Python’s Matplotlib library, the pyplot.boxplot() function takes a dataset as input and returns a box plot.

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.

Learn Statistics with Python

Topics

Mean, Median, and Mode

Median of a Dataset

Mean of a Dataset

Variance and Standard Deviation

Standard Deviation

Calculating Variance in Python

Standard Deviation Units

Calculating Standard Deviation in Python

Interpretation of Variance

Variance

Histograms

Matplotlib Function To Create Histogram

Mean of a Dataset

Histogram Bins

What is a Histogram?

Histogram Bin Count

Histogram’s X and Y Axis

Describe a Histogram

Unimodal Distribution

Left-Skewed Dataset

Multimodal Dataset

Bimodal Dataset

Uniform Dataset

Right-skewed Dataset

Symmetric Distribution in Histogram

Dataset Outliers

Spread of a Dataset

Peak of Unimodal Distribution

Quartiles, Quantiles, and Interquartile Range

Quantiles

Quartiles

Numpy’s Quantile() Function

Quantiles and Groups

Median in Quantiles

Interquartile Range Definition

Interquartile Range and Outliers

Boxplots

Box Plot Values

Usage of Side-by-side Box plots

Side-by-side Boxplots

Box Plot Outliers

Box Plot Whiskers

Boxplot in Matplotlib

Programming Cheatsheets: Quick Reference for Productivity

Recent Releases

Blogs

Learn

Tools