Python Data Science 简明教程
Python - Measuring Central Tendency
数学上的中心趋势意味着测量数据集中值的位置中心或分布中心。它给出了数据集中数据的平均值的大概情况,也给出了数据集中值分散程度的指示信息。这反过来有助于评估新输入数据适合现有数据集的可能性,从而确定成功的概率。
Mathematically central tendency means measuring the center or distribution of location of values of a data set. It gives an idea of the average value of the data in the data set and also an indication of how widely the values are spread in the data set. That in turn helps in evaluating the chances of a new input fitting into the existing data set and hence probability of success.
有三种主要的中心趋势度量,可以使用 pandas python 库中的方法计算出来。
There are three main measures of central tendency which can be calculated using the methods in pandas python library.
-
Mean - It is the Average value of the data which is a division of sum of the values with the number of values.
-
Median - It is the middle value in distribution when the values are arranged in ascending or descending order.
-
Mode - It is the most commonly occurring value in a distribution.
Calculating Mean and Median
pandas 函数可以直接用来计算这些值。
The pandas functions can be directly used to calculate these values.
import pandas as pd
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print "Mean Values in the Distribution"
print df.mean()
print "*******************************"
print "Median Values in the Distribution"
print df.median()
它的 output 如下所示 −
Its output is as follows −
Mean Values in the Distribution
Age 31.833333
Rating 3.743333
dtype: float64
*******************************
Median Values in the Distribution
Age 29.50
Rating 3.79
dtype: float64
Calculating Mode
根据数据是连续的还是存在最高出现频率的值来确定众数在分布中是否存在。我们在下面采用一个简单的分布来找出众数。此处我们有一个在分布中出现频率最高的值。
Mode may or may not be available in a distribution depending on whether the data is continous or whether there are values which has maximum frquency. We take a simple distribution below to find out the mode. Here we have a value which has maximum frequency in the distribution.
import pandas as pd
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}
#Create a DataFrame
df = pd.DataFrame(d)
print df.mode()
它的 output 如下所示 −
Its output is as follows −
Age Name
0 25.0 Andres
1 NaN Chanchal
2 NaN Gasper
3 NaN Jack
4 NaN James
5 NaN Lee
6 NaN Naviya
7 NaN Ricky
8 NaN Smith
9 NaN Steve
10 NaN Tom
11 NaN Vin