Python Pandas 简明教程
Python Pandas - Statistical Functions
统计方法有助于理解和分析数据的行为。我们现在将学习一些可以在 Pandas 对象上应用的统计函数。
Statistical methods help in the understanding and analyzing the behavior of data. We will now learn a few statistical functions, which we can apply on Pandas objects.
Percent_change
Series、DatFrames 和 Panel 都具有函数 pct_change() 。此函数将每个元素与其之前的元素进行比较并计算变化百分比。
Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage.
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print s.pct_change()
df = pd.DataFrame(np.random.randn(5, 2))
print df.pct_change()
它的 output 如下所示 −
Its output is as follows −
0 NaN
1 1.000000
2 0.500000
3 0.333333
4 0.250000
5 -0.200000
dtype: float64
0 1
0 NaN NaN
1 -15.151902 0.174730
2 -0.746374 -1.449088
3 -3.582229 -3.165836
4 15.601150 -1.860434
默认情况下, pct_change() 在列上运行;如果你想应用同一行,则使用 axis=1() 参数。
By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument.
Covariance
协方差应用于序列数据。Series 对象有一个方法 cov 来计算序列对象之间的协方差。NA 将自动排除。
Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.
Cov Series
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print s1.cov(s2)
它的 output 如下所示 −
Its output is as follows −
-0.12978405324
协方差方法在应用于 DataFrame 时,计算所有列之间的 cov 。
Covariance method when applied on a DataFrame, computes cov between all the columns.
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print frame['a'].cov(frame['b'])
print frame.cov()
它的 output 如下所示 −
Its output is as follows −
-0.58312921152741437
a b c d e
a 1.780628 -0.583129 -0.185575 0.003679 -0.136558
b -0.583129 1.297011 0.136530 -0.523719 0.251064
c -0.185575 0.136530 0.915227 -0.053881 -0.058926
d 0.003679 -0.523719 -0.053881 1.521426 -0.487694
e -0.136558 0.251064 -0.058926 -0.487694 0.960761
Note − 观察第一条语句中 a 和 b 的 cov ,DataFrame 上的 cov 返回的值也相同。
Note − Observe the cov between a and b column in the first statement and the same is the value returned by cov on DataFrame.
Correlation
相关性显示任何两个值数组(序列)之间的线性关系。有多种计算相关性的方法,如 pearson(默认)、spearman 和 kendall。
Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print frame['a'].corr(frame['b'])
print frame.corr()
它的 output 如下所示 −
Its output is as follows −
-0.383712785514
a b c d e
a 1.000000 -0.383713 -0.145368 0.002235 -0.104405
b -0.383713 1.000000 0.125311 -0.372821 0.224908
c -0.145368 0.125311 1.000000 -0.045661 -0.062840
d 0.002235 -0.372821 -0.045661 1.000000 -0.403380
e -0.104405 0.224908 -0.062840 -0.403380 1.000000
如果 DataFrame 中存在任何非数字列,则会自动将其排除。
If any non-numeric column is present in the DataFrame, it is excluded automatically.
Data Ranking
数据排序为数组中的每个元素产生排序。在存在并列的情况下,分配平均等级。
Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank.
import pandas as pd
import numpy as np
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print s.rank()
它的 output 如下所示 −
Its output is as follows −
a 1.0
b 3.5
c 2.0
d 3.5
e 5.0
dtype: float64
Rank 可选地采用 ascending 参数,该参数的默认值为 true;为 false 时,数据将以逆序排名,并将较大的值分配给较小的排名。
Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.
Rank 支持不同的平局打破方法,通过 method 参数指定−
Rank supports different tie-breaking methods, specified with the method parameter −
-
average − average rank of tied group
-
min − lowest rank in the group
-
max − highest rank in the group
-
first − ranks assigned in the order they appear in the array