Scipy 简明教程

SciPy - Stats

所有统计函数都位于子程序包 scipy.stats 中,使用 info(stats) 函数可以获得这些函数的相当完整的列表。统计子函数包中可用的随机变量的列表也可以从 docstring 得到。该模块包含大量概率分布以及越来越丰富的统计函数库。

All of the statistics functions are located in the sub-package scipy.stats and a fairly complete listing of these functions can be obtained using info(stats) function. A list of random variables available can also be obtained from the docstring for the stats sub-package. This module contains a large number of probability distributions as well as a growing library of statistical functions.

如以下表格所述,每个单变量分布都有自己从属的子类——

Each univariate distribution has its own subclass as described in the following table −

Sr. No.

Class & Description

1

rv_continuous A generic continuous random variable class meant for subclassing

2

rv_discrete A generic discrete random variable class meant for subclassing

3

rv_histogram Generates a distribution given by a histogram

Normal Continuous Random Variable

随机变量 X 可以获得任何值的概率分布是连续随机变量。位置(loc)关键字指定均值。 scale(scale)关键字指定标准差。

A probability distribution in which the random variable X can take any value is continuous random variable. The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation.

作为 rv_continuous 类的实例, norm 对象继承了它的通用方法集合,并用针对这种特殊分布的详细信息对此类方法进行了补充。

As an instance of the rv_continuous class, norm object inherits from it a collection of generic methods and completes them with details specific for this particular distribution.

若要计算一系列点的 CDF,我们可以传递一个列表或一个 NumPy 数组。让我们考虑一下以下示例。

To compute the CDF at a number of points, we can pass a list or a NumPy array. Let us consider the following example.

from scipy.stats import norm
import numpy as np
print norm.cdf(np.array([1,-1., 0, 1, 3, 4, -2, 6]))

上述程序将生成以下输出。

The above program will generate the following output.

array([ 0.84134475, 0.15865525, 0.5 , 0.84134475, 0.9986501 ,
0.99996833, 0.02275013, 1. ])

要找出分布的中位数,我们可以使用百分点函数 (PPF),它是 CDF 的逆函数。让我们通过以下示例来理解这一点。

To find the median of a distribution, we can use the Percent Point Function (PPF), which is the inverse of the CDF. Let us understand by using the following example.

from scipy.stats import norm
print norm.ppf(0.5)

上述程序将生成以下输出。

The above program will generate the following output.

0.0

要生成随机变量序列,我们应该使用 size 关键字参数,该参数在以下示例中所示。

To generate a sequence of random variates, we should use the size keyword argument, which is shown in the following example.

from scipy.stats import norm
print norm.rvs(size = 5)

上述程序将生成以下输出。

The above program will generate the following output.

array([ 0.20929928, -1.91049255, 0.41264672, -0.7135557 , -0.03833048])

以上输出不可复现。要生成相同的随机数,请使用 seed 函数。

The above output is not reproducible. To generate the same random numbers, use the seed function.

Uniform Distribution

可以使用 uniform 函数生成均匀分布。让我们考虑以下示例。

A uniform distribution can be generated using the uniform function. Let us consider the following example.

from scipy.stats import uniform
print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)

上述程序将生成以下输出。

The above program will generate the following output.

array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])

Build Discrete Distribution

让我们生成一个随机样本,并将观察到的频率与概率进行比较。

Let us generate a random sample and compare the observed frequencies with the probabilities.

Binomial Distribution

作为 rv_discrete class 的实例, binom object 从它那里继承了一系列通用方法,并用针对此特定分布的详细信息对它们进行了补充。让我们考虑以下示例。

As an instance of the rv_discrete class, the binom object inherits from it a collection of generic methods and completes them with details specific for this particular distribution. Let us consider the following example.

from scipy.stats import uniform
print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)

上述程序将生成以下输出。

The above program will generate the following output.

array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])

Descriptive Statistics

最小值、最大值、均值和方差等基本统计信息以 NumPy 数组作为输入,并返回相应的结果。 scipy.stats package 中可用的几个基本统计函数在以下表格中进行了解释。

The basic stats such as Min, Max, Mean and Variance takes the NumPy array as input and returns the respective results. A few basic statistical functions available in the scipy.stats package are described in the following table.

Sr. No.

Function & Description

1

describe() Computes several descriptive statistics of the passed array

2

gmean() Computes geometric mean along the specified axis

3

hmean() Calculates the harmonic mean along the specified axis

4

kurtosis() Computes the kurtosis

5

mode() Returns the modal value

6

skew() Tests the skewness of the data

7

f_oneway() Performs a 1-way ANOVA

8

iqr() Computes the interquartile range of the data along the specified axis

9

zscore() Calculates the z score of each value in the sample, relative to the sample mean and standard deviation

10

sem() Calculates the standard error of the mean (or standard error of measurement) of the values in the input array

其中几个函数在 scipy.stats.mstats 中有类似的版本,适用于屏蔽数组。让我们通过以下示例来理解这一点。

Several of these functions have a similar version in the scipy.stats.mstats, which work for masked arrays. Let us understand this with the example given below.

from scipy import stats
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9])
print x.max(),x.min(),x.mean(),x.var()

上述程序将生成以下输出。

The above program will generate the following output.

(9, 1, 5.0, 6.666666666666667)

T-test

让我们了解 T 检验如何在 SciPy 中有用。

Let us understand how T-test is useful in SciPy.

ttest_1samp

计算一组分数的均值的 T 检验。这是一个双边检验,用于检验一个独立的观察样本‘a’的预期值(均值)等于给定的总体均值 popmean 的原假设。让我们考虑以下示例。

Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, popmean. Let us consider the following example.

from scipy import stats
rvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2))
print stats.ttest_1samp(rvs,5.0)

上述程序将生成以下输出。

The above program will generate the following output.

Ttest_1sampResult(statistic = array([-1.40184894, 2.70158009]),
pvalue = array([ 0.16726344, 0.00945234]))

Comparing two samples

在以下示例中,有两个样本,它们可以来自相同或不同分布,我们希望测试这些样本是否具有相同的统计特性。

In the following examples, there are two samples, which can come either from the same or from different distribution, and we want to test whether these samples have the same statistical properties.

ttest_ind − 计算两个独立分数样本均值的 T 检验。这是一个双侧检验,用于检验两个独立样本具有相同的平均(预期)值的原假设。此检验默认情况下假设总体具有相同的方差。

ttest_ind − Calculates the T-test for the means of two independent samples of scores. This is a two-sided test for the null hypothesis that two independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

如果我们观察到来自相同或不同总体中的两个独立样本,我们可以使用此检验。让我们考虑以下示例。

We can use this test, if we observe two independent samples from the same or different population. Let us consider the following example.

from scipy import stats
rvs1 = stats.norm.rvs(loc = 5,scale = 10,size = 500)
rvs2 = stats.norm.rvs(loc = 5,scale = 10,size = 500)
print stats.ttest_ind(rvs1,rvs2)

上述程序将生成以下输出。

The above program will generate the following output.

Ttest_indResult(statistic = -0.67406312233650278, pvalue = 0.50042727502272966)

您可以使用长度相同但具有不同均值的新数组来测试相同的内容。在 loc 中使用不同的值并测试相同的内容。

You can test the same with a new array of the same length, but with a varied mean. Use a different value in loc and test the same.