Machine Learning 简明教程

Machine Learning - Data Distribution

在机器学习中，数据分布是指数据点在数据集中分布或展开的方式。了解数据集中数据的分布很重要，因为它会对机器学习算法的性能产生重大影响。

In machine learning, data distribution refers to the way in which data points are distributed or spread out across a dataset. It is important to understand the distribution of data in a dataset, as it can have a significant impact on the performance of machine learning algorithms.

数据分布可以用均值、中值、众数、标准差和方差等统计量来描述。这些测量值有助于描述数据的集中度、传播度和形状。

Data distribution can be characterized by several statistical measures, including mean, median, mode, standard deviation, and variance. These measures help to describe the central tendency, spread, and shape of the data.

机器学习中的一些常见数据分布类型如下 -

Some common types of data distribution in machine learning are given below −

Normal Distribution

正态分布（也称为高斯分布）是一个连续概率分布，广泛用于机器学习和统计中。它是一个钟形曲线，描述了一个均值对称的随机变量的概率分布。正态分布有两个参数，均值（μ）和标准差（σ）。

Normal distribution, also known as Gaussian distribution, is a continuous probability distribution that is widely used in machine learning and statistics. It is a bell-shaped curve that describes the probability distribution of a random variable that is symmetric around the mean. The normal distribution has two parameters, the mean (μ) and the standard deviation (σ).

在机器学习中，正态分布通常用于对线性回归和其他统计模型中的误差项的分布进行建模。它还可用作检验假设和置信区间。

In machine learning, normal distribution is often used to model the distribution of error terms in linear regression and other statistical models. It is also used as a basis for various hypothesis tests and confidence intervals.

正态分布的一个重要属性是经验法则，也称为 68-95-99.7 法则。这条规则指出，大约 68% 的观测值落在均值的第一个标准差内，95% 的观测值落在均值的两个标准差内，99.7% 的观测值落在均值的三个标准差内。

One important property of normal distribution is the empirical rule, also known as the 68- 95-99.7 rule. This rule states that approximately 68% of the observations fall within one standard deviation of the mean, 95% of the observations fall within two standard deviations of the mean, and 99.7% of the observations fall within three standard deviations of the mean.

Python 提供了可用于处理正态分布的各种库。其中一个库是 scipy.stats ，它提供了用于计算概率密度函数 (PDF)、累积分布函数 (CDF)、百分点函数 (PPF) 和正态分布的随机变量的函数。

Python provides various libraries that can be used to work with normal distributions. One such library is scipy.stats, which provides functions for calculating the probability density function (PDF), cumulative distribution function (CDF), percent point function (PPF), and random variables for normal distribution.

Example

以下是使用 scipy.stats 生成和可视化正态分布的一个示例 −

Here is an example of using scipy.stats to generate and visualize a normal distribution −

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Generate a random sample of 1000 values from a normal distribution
mu = 0 # Mean
sigma = 1 # Standard deviation
sample = np.random.normal(mu, sigma, 1000)

# Calculate the PDF for the normal distribution
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
pdf = norm.pdf(x, mu, sigma)
# Plot the histogram of the random sample and the PDF of the normal
distribution
plt.figure(figsize=(7.5, 3.5))
plt.hist(sample, bins=30, density=True, alpha=0.5)
plt.plot(x, pdf)
plt.show()

在这个示例中，我们首先使用 np.random.normal 从具有均值 0 和标准差 1 的正态分布中生成 1000 个值的随机样本。然后，我们使用 norm.pdf 计算正态分布的 PDF，并使用 np.linspace 生成 μ-3σ 和 μ+3σ 之间 100 个均匀间隔的值的数组

In this example, we first generate a random sample of 1000 values from a normal distribution with mean 0 and standard deviation 1 using np.random.normal. We then use norm.pdf to calculate the PDF for the normal distribution and np.linspace to generate an array of 100 evenly spaced values between μ -3σ and μ +3σ

最后，我们使用 plt.hist 绘制随机样本的直方图，并使用 plt.plot 叠加正态分布的 PDF。

Finally, we plot the histogram of the random sample using plt.hist and overlay the PDF of the normal distribution using plt.plot.

生成的图显示了正态分布的钟形曲线和近似正态分布的随机样本的直方图。

The resulting plot shows the bell-shaped curve of the normal distribution and the histogram of the random sample that approximates the normal distribution.

Skewed Distribution

机器学习中的偏态分布是指数据集在其均值或平均值周围分布不均匀。在偏态分布中，大部分数据点倾向于向分布的一侧聚集，而另一侧的数据点较少。

A skewed distribution in machine learning refers to a dataset that is not evenly distributed around its mean, or average value. In a skewed distribution, the majority of the data points tend to cluster towards one end of the distribution, with a smaller number of data points at the other end.

有两种类型的偏态分布：左偏态和右偏态。左偏态分布，也称为负偏态分布，在分布的左侧有一个长尾，大部分数据点向右侧。相反，右偏态分布，也称为正偏态分布，在分布的右侧有一个长尾，大部分数据点向左侧。

There are two types of skewed distributions: left-skewed and right-skewed. A left-skewed distribution, also known as a negative-skewed distribution, has a long tail towards the left side of the distribution, with the majority of data points towards the right side. In contrast, a right-skewed distribution, also known as a positive-skewed distribution, has a long tail towards the right side of the distribution, with the majority of data points towards the left side.

偏态分布可能出现在许多不同类型的数据集中，例如财务数据、社交媒体指标或医疗记录。在机器学习中，适当地识别和处理偏态分布非常重要，因为它们会影响某些算法和模型的性能。例如，在某些情况下，偏态数据会导致有偏差的预测和不准确的结果，并且可能需要预处理技术（例如标准化或数据变换）来提高模型的性能。

Skewed distributions can occur in many different types of datasets, such as financial data, social media metrics, or healthcare records. In machine learning, it is important to identify and handle skewed distributions appropriately, as they can affect the performance of certain algorithms and models. For example, skewed data can lead to biased predictions and inaccurate results in some cases and may require preprocessing techniques such as normalization or data transformation to improve the performance of the model.

Example

以下是使用 Python 的 NumPy 和 Matplotlib 库生成和绘制偏态分布的示例 −

Here is an example of generating and plotting a skewed distribution using Python’s NumPy and Matplotlib libraries −

import numpy as np
import matplotlib.pyplot as plt

# Generate a skewed distribution using NumPy's random function
data = np.random.gamma(2, 1, 1000)

# Plot a histogram of the data to visualize the distribution
plt.figure(figsize=(7.5, 3.5))
plt.hist(data, bins=30)

# Add labels and title to the plot
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Skewed Distribution')

# Show the plot
plt.show()

执行此代码后，您将得到以下绘图作为输出 −

On executing this code, you will get the following plot as the output −

Uniform Distribution

机器学习中的均匀分布是指所有可能结果发生概率相等的概率分布。换句话说，数据集中每个值被观察到的概率相同，并且没有数据点聚集在特定值周围。

A uniform distribution in machine learning refers to a probability distribution in which all possible outcomes are equally likely to occur. In other words, each value in a dataset has the same probability of being observed, and there is no clustering of data points around a particular value.

均匀分布通常用作与其他分布比较的基线，因为它代表了数据的随机无偏抽样。它还可以用于某些类型的应用程序中，例如生成随机数或从集合中选择项目而不带偏差。

The uniform distribution is often used as a baseline for comparison with other distributions, as it represents a random and unbiased sampling of the data. It can also be useful in certain types of applications, such as generating random numbers or selecting items from a set without bias.

在概率论中，连续均匀分布的概率密度函数定义为 −

In probability theory, the probability density function of a continuous uniform distribution is defined as −

f\left ( x \right )=\left\{\begin{matrix} 1 & for\: a\leq x\leq b \\ 0 & otherwise \\ \end{matrix}\right.

其中 a 和 b 分别是分布的最小值和最大值。均匀分布的均值为 $\frac{a+b}{2} $，方差为 $\frac{\left ( b-a \right )^{2}}{12}$

where a and b are the minimum and maximum values of the distribution, respectively. mean of a uniform distribution is $\frac{a+b}{2} $ and the variance is $\frac{\left ( b-a \right )^{2}}{12}$

Example

在 Python 中，NumPy 库提供了从均匀分布生成随机数的函数，例如 numpy.random.uniform() 。这些函数以分布的最小值和最大值作为参数，可用于生成具有均匀分布的数据集。

In Python, the NumPy library provides functions for generating random numbers from a uniform distribution, such as numpy.random.uniform(). These functions take as arguments the minimum and maximum values of the distribution and can be used to generate datasets with a uniform distribution.

以下是使用 Python 的 NumPy 库生成均匀分布的示例 −

Here is an example of generating a uniform distribution using Python’s NumPy library −

import numpy as np
import matplotlib.pyplot as plt

# Generate 10,000 random numbers from a uniform distribution between 0 and 1
uniform_data = np.random.uniform(low=0, high=1, size=10000)

# Plot the histogram of the uniform data
plt.figure(figsize=(7.5, 3.5))
plt.hist(uniform_data, bins=50, density=True)

# Add labels and title to the plot
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Uniform Distribution')

# Show the plot
plt.show()

它将生成以下绘图作为输出 −

It will produce the following plot as the output −

Bimodal Distribution

在机器学习中，双峰分布是一种具有两个不同模式或峰值的概率分布。换句话说，分布有两个数据值最可能发生的区域，由数据不太可能发生的谷或低谷分隔。

In machine learning, a bimodal distribution is a probability distribution that has two distinct modes or peaks. In other words, the distribution has two regions where the data values are most likely to occur, separated by a valley or trough where the data is less likely to occur.

双峰分布可能出现在各种类型的数据中，例如生物测量、经济指标或社交媒体指标。它们可以表示数据集中不同的子群，或随着时间的推移不同的行为或趋势模式。

Bimodal distributions can arise in various types of data, such as biometric measurements, economic indicators, or social media metrics. They can represent different subpopulations within the dataset, or different modes of behavior or trends over time.

可以使用各种统计方法（如直方图、核密度估计或假设检验）来识别和分析双峰分布。在某些情况下，可以将双峰分布拟合到特定的概率分布，如高斯混合模型，该模型允许分别对基础亚群进行建模。

Bimodal distributions can be identified and analyzed using various statistical methods, such as histograms, kernel density estimations, or hypothesis testing. In some cases, bimodal distributions can be fitted to specific probability distributions, such as the Gaussian mixture model, which allows for modeling the underlying subpopulations separately.

Example

在 Python 中，NumPy、SciPy 和 Matplotlib 等库提供了用于生成和可视化双峰分布的函数。

In Python, libraries such as NumPy, SciPy, and Matplotlib provide functions for generating and visualizing bimodal distributions.

例如，以下代码生成并绘制了双峰分布 −

For example, the following code generates and plots a bimodal distribution −

import numpy as np
import matplotlib.pyplot as plt

# Generate 10,000 random numbers from a bimodal distribution
bimodal_data = np.concatenate((np.random.normal(loc=-2, scale=1, size=5000),
   np.random.normal(loc=2, scale=1, size=5000)))

# Plot the histogram of the bimodal data
plt.figure(figsize=(7.5, 3.5))
plt.hist(bimodal_data, bins=50, density=True)

# Add labels and title to the plot
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Bimodal Distribution')

# Show the plot
plt.show()

执行此代码后，您将得到以下绘图作为输出 −

On executing this code, you will get the following plot as the output −