Python Pandas 简明教程
Python Pandas - Descriptive Statistics
许多方法共同计算描述性统计信息和其他对 DataFrame 的相关操作。其中大多数是像 sum(), mean(), 这样的聚合,但其中一些(如 sumsum() )生成相同大小的对象。通常情况下,这些方法采用 axis 参数,就像 ndarray.{sum, std, …},但可以按名称或整数指定轴
A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer
-
DataFrame − “index” (axis=0, default), “columns” (axis=1)
让我们创建 DataFrame,并在本章中针对所有操作使用此对象。
Let us create a DataFrame and use this object throughout this chapter for all the operations.
Example
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df
它的 output 如下所示 −
Its output is as follows −
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
7 34 Lee 3.78
8 40 David 2.98
9 30 Gasper 4.80
10 51 Betina 4.10
11 46 Andres 3.65
sum()
返回请求轴上的值的总和。默认情况下,轴为索引 (axis=0)。
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum()
它的 output 如下所示 −
Its output is as follows −
Age 382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating 44.92
dtype: object
单独添加每个列(追加字符串)。
Each individual column is added individually (Strings are appended).
axis=1
此语法将给予如下所示的输出。
This syntax will give the output as shown below.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.sum(1)
它的 output 如下所示 −
Its output is as follows −
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
mean()
返回平均值
Returns the average value
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.mean()
它的 output 如下所示 −
Its output is as follows −
Age 31.833333
Rating 3.743333
dtype: float64
std()
返回数值列的 Bessel 标准差。
Returns the Bressel standard deviation of the numerical columns.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.std()
它的 output 如下所示 −
Its output is as follows −
Age 9.232682
Rating 0.661628
dtype: float64
Functions & Description
让我们现在理解 Python Pandas 中描述性统计下的函数。下表列出了重要函数 -
Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions −
Sr.No. |
Function |
Description |
1 |
count() |
Number of non-null observations |
2 |
sum() |
Sum of values |
3 |
mean() |
Mean of Values |
4 |
median() |
Median of Values |
5 |
mode() |
Mode of values |
6 |
std() |
Standard Deviation of the Values |
7 |
min() |
Minimum Value |
8 |
max() |
Maximum Value |
9 |
abs() |
Absolute Value |
10 |
prod() |
Product of Values |
11 |
cumsum() |
Cumulative Sum |
12 |
cumprod() |
Cumulative Product |
Note - 由于 DataFrame 是异构数据结构。通用操作并不适用于所有函数。
Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.
-
Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception.
-
Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed.
Summarizing Data
describe() 函数计算与 DataFrame 列有关的统计数据摘要。
The describe() function computes a summary of statistics pertaining to the DataFrame columns.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
它的 output 如下所示 −
Its output is as follows −
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
此函数给出了 mean, std 和 IQR 值。并且,函数排除了字符列并给出了有关数值列的摘要。 'include' 是一个参数,用于传递有关需要考虑哪些列进行总结的必要信息。采用值列表;默认情况下为“number”。
This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'.
-
object − Summarizes String columns
-
number − Summarizes Numeric columns
-
all − Summarizes all columns together (Should not pass it as a list value)
现在,在程序中使用以下语句并检查输出 -
Now, use the following statement in the program and check the output −
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe(include=['object'])
它的 output 如下所示 −
Its output is as follows −
Name
count 12
unique 12
top Ricky
freq 1
现在,使用以下语句并检查输出 -
Now, use the following statement and check the output −
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df. describe(include='all')
它的 output 如下所示 −
Its output is as follows −
Age Name Rating
count 12.000000 12 12.000000
unique NaN 12 NaN
top NaN Ricky NaN
freq NaN 1 NaN
mean 31.833333 NaN 3.743333
std 9.232682 NaN 0.661628
min 23.000000 NaN 2.560000
25% 25.000000 NaN 3.230000
50% 29.500000 NaN 3.790000
75% 35.500000 NaN 4.132500
max 51.000000 NaN 4.800000