The median of a dataset is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, the middle two values are the median.
import numpy as np data = [24, 16, 30, 10, 12, 28, 38, 2, 4, 36] data_sorted = np.sort(data) median = np.median(data_sorted) print(median) # Output: 19.0 (since the dataset has even numbers of values)
Say we have a dataset with the following ten numbers:
2, 4, 10, 12, 16, 24, 28, 30, 36, 38
The standard deviation is a measure of a dataset’s spread. It is calculated by taking the square root of the variance of a data set. The resulting value has the same units as the original data.
import numpy as np values = np.array([1,3,4,2,6,3,4,5]) # calculate variance of values variance = np.var(values)
In Python, we can calculate the variance of an array using the NumPy var() function.
import numpy as np values = np.array([1,3,4,2,6,3,4,5]) # calculate standard deviation of values variance = np.var(values)
Because standard deviation is in the same units as the original data set, it is often used to provide context for the mean of the dataset. For example, if the data set is [3, 5, 10, 14], the standard deviation is 4.301 units, and the mean is 8.0 units. By using the standard deviation, we can fairly easily see that the data point 14 is more than one standard deviation away from the mean.
We can calculate standard deviation in Python using the NumPy std() function.
A larger variance means the data is more spread out and values tend to be far away from the mean. A variance of 0 means all values in the dataset are the same.
Variance is a measure of spread. It is calculated by finding the average of the squared differences between every observation and the mean. The resulting value is in units squared.
In Python, the pyplot.hist() function in the Matplotlib pyplot library can be used to plot a histogram. The function accepts a NumPy array, the range of the dataset, and the number of bins as input.
import numpy as np from matplotlib import pyplot as plt # numpy array data_array = np.array([1,1,1,1,1,2,3,3,3,4,4,5,5,6,7]) # plot histogram plt.hist(data_array, range = (1,7), bins = 7)
The mean, or average, of a dataset is calculated by adding all the values in the dataset and then dividing by the number of values in the set.
For example, for the dataset [1,2,3], the mean is 1+2+3 / 3 = 2.
In a histogram, the range of the data is divided into sub-ranges represented by bins. The width of the bin is calculated by dividing the range of the dataset by the number of bins, giving each bin in a histogram the same width.
A Histogram is a plot that displays the spread, or distribution of a dataset. In a histogram, the data is split into intervals, called bins. Each bin shows the number of data points that are contained within that bin.
In a histogram, the bin count is the number of data points that fall within the bin’s range.
Modality describes the number of peaks in a dataset. A unimodal distribution in a histogram means there is one distinct peak indicating the most frequent value in a histogram.
A left-skewed dataset has a long left tail with one prominent peak to the right. The median of this dataset is greater than the mean of this dataset.
If a histogram has more than two peaks, then the dataset is referred to as multimodal.
A bimodal dataset has two distinct peaks. This typically happens when the dataset contains two different populations.
A uniform dataset does not have any distinct peaks.
As seen in the histogram below, uniform datasets have approximately the same number of values in each group represented by a bar - there is no obvious clustering.
In a histogram, if the prominent peak lies to the left with the tail extending to the right, then it is called a right-skewed dataset. In this case, the median is less than the mean of the dataset.
In a histogram, the distribution of the data is symmetric if it has one prominent peak and equal tails to the left and the right. The Median and the Mean of a symmetric dataset are similar.
An outlier is a data point that differs significantly from the rest of the values in a dataset.
For example, in the dataset [1, 2, 3, 4, 100] the value 100 is an outlier because it lies a large distance from the rest of the data.
Quantiles are the set of values/points that divides the dataset into groups of equal size. For example, in the figure, there are nine values that splits the dataset. Those nine values are quantiles.
# The value 5 is both the median and the 2-quantile data = [1, 3, 5, 9, 20] Second_quantile = 5
The three dividing points (or quantiles) that split data into four equally sized groups are called quartiles. For example, in the figure, the three dividing points Q1, Q2, Q3 are quartiles.
# Eventhough d_2 has an outlier, the IQR is identical for the 2 datasets d_1 = [1,2,3,4,5,6,7,8,9] d_2 = [-100,2,3,4,5,6,7,8,9]
In Python, the numpy.quantile() function takes an array and a number say q between 0 and 1. It returns the value at the qth quantile. For example, numpy.quantile(data, 0.25) returns the value at the first quartile of the dataset data.
If the number of quantiles is n, then the number of equally sized groups in a dataset is n+1.
The median is the divider between the upper and lower halves of a dataset. It is the 50%, 0.5 quantile, also known as the 2-quantile.
The interquartile range is the difference between the first(Q1) and third quartiles(Q3). It can be mathematically represented as IQR = Q3 - Q1.
The interquartile range is considered to be a robust statistic because it is not distorted by outliers like the average (or mean).
The box in the box plot displays the dataset’s median, first and third quartile, and the interquartile range. The line in the center of the box shows the median, the edges shows the first and third quartiles, and the interquartile range is visualized by the width of the box.
mul_datasets = [[3, 5, 7, 2], [2, 4, 10, 43]] pyplot.boxplot(mul_datasets)
The two datasets can be analyzed visually by placing two box plots side by side. This allows easy comparison of median, first and third quartiles and the IQR of the datasets.
# dataset= list of numbers pyplot.boxplot(dataset)
In Python’s Matplotlib library, if multiple datasets are specified in function pyplot.boxplot(), then those datasets will be visualized as side by side box plots.
In a box plot, the data points that fall beyond the whiskers are called outliers. They are usually labeled with a dot or an asterisk.
A box plot’s whiskers are the lines that extends from the 1st or 3rd quartile to points farthest from the median. The upper whisker of the box plot is the largest dataset number smaller than 1.5IQR above the third quartile and the lower whisker is the smallest dataset number larger than 1.5IQR below the first quartile.
In Python’s Matplotlib library, the pyplot.boxplot() function takes a dataset as input and returns a box plot.
Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.
ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.
Monitor your IT infrastructure effortlessly with Site24x7 and get comprehensive insights and ensure smooth operations with 24/7 monitoring.
Sign up now!