Machine Learning 简明教程
Machine Learning - Mean, Median, Mode
均值、中位数和众数是用于描述数据集集中趋势的统计方法。在机器学习中,这些指标用于了解数据的分布并识别异常值。在这里,我们将探索均值、中位数和众数的概念以及它们在 Python 中的实现。
Mean, Median, and Mode are statistical measures used to describe the central tendency of a dataset. In machine learning, these measures are used to understand the distribution of data and identify outliers. Here, we will explore the concepts of Mean, Median, and Mode and their implementation in Python.
Mean
“均值”是数据集的平均值。它通过将数据集中的所有值相加并除以观测值的数量来计算。均值是一个有用的集中趋势度量,因为它对异常值敏感,这意味着极端值会显著影响均值的值。
The "mean" is the average value of a dataset. It is calculated by adding up all the values in the dataset and dividing by the number of observations. The mean is a useful measure of central tendency because it is sensitive to outliers, meaning that extreme values can significantly affect the value of the mean.
在 Python 中,我们可以使用 NumPy 库计算均值,它提供了一个名为 mean() 的函数。
In Python, we can calculate the mean using the NumPy library, which provides a function called mean().
Median
“中位数”是数据集中的中间值。它通过按顺序排列数据集中的值并找到位于中间的值来计算。如果数据集中的值数为偶数,则中位数是中间两个值的平均值。
The "median" is the middle value in a dataset. It is calculated by arranging the values in the dataset in order and finding the value that lies in the middle. If there are an even number of values in the dataset, the median is the average of the two middle values.
中值是一个有用的中心趋势测量度量,因为不受离群值的影响,表示极端值不会显著影响中值的值。
The median is a useful measure of central tendency because it is not affected by outliers, meaning that extreme values do not significantly affect the value of the median.
在 Python 中,我们可以使用 NumPy 库来计算中值,该库提供了名为 median() 的函数。
In Python, we can calculate the median using the NumPy library, which provides a function called median().
Mode
“众数”是数据集中最常见的值。通过在数据集中找到出现频率最高的值来计算众数。如果有多个值出现的频率相同,则数据集被称为双峰、三峰或多峰。
The "mode" is the most common value in a dataset. It is calculated by finding the value that occurs most frequently in the dataset. If there are multiple values that occur with the same frequency, the dataset is said to be bimodal, trimodal, or multimodal.
众数是中心趋势的一个有用的测量度量,因为它可以识别数据集中最常见的值。但是,它对于值的范围较宽的数据集或没有重复值的的数据集来说,不是一个好的中心趋势测量度量。
The mode is a useful measure of central tendency because it can identify the most common value in a dataset. However, it is not a good measure of central tendency for datasets with a wide range of values or datasets with no repeating values.
在 Python 中,我们可以使用 SciPy 库来计算众数,该库提供了名为 mode() 的函数。
In Python, we can calculate the mode using the SciPy library, which provides a function called mode().
Python Implementation
让我们看一个使用 NumPy 和 Pandas 在 Python 中计算薪资表的平均值、中值和众数的示例 −
Let’s see an example of calculating mean, median, and mode for a salary table in Python using NumPy and Pandas −
import numpy as np
import pandas as pd
# create a sample salary table
salary = pd.DataFrame({
'employee_id': ['001', '002', '003', '004', '005', '006', '007',
'008', '009', '010'],
'salary': [50000, 65000, 55000, 45000, 70000, 60000, 55000, 45000,
80000, 70000]
})
# calculate mean
mean_salary = np.mean(salary['salary'])
print('Mean salary:', mean_salary)
# calculate median
median_salary = np.median(salary['salary'])
print('Median salary:', median_salary)
# calculate mode
mode_salary = salary['salary'].mode()[0]
print('Mode salary:', mode_salary)