Python Pandas 简明教程
Python Pandas - Categorical Data
通常在实时数据中,将包含重复的文本列。性别、国家和代码等要素始终是重复的。这些是分类数据的示例。
Often in real-time, data includes the text columns, which are repetitive. Features like gender, country, and codes are always repetitive. These are the examples for categorical data.
分类变量只能取有限且通常是固定数量的可能值。除了固定长度外,分类数据可能存在顺序,但不能执行数值操作。分类是熊猫数据类型。
Categorical variables can take on only a limited, and usually fixed number of possible values. Besides the fixed length, categorical data might have an order but cannot perform numerical operation. Categorical are a Pandas data type.
分类数据类型在以下情况下有用−
The categorical data type is useful in the following cases −
-
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
-
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
-
As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
Object Creation
分类对象可以通过多种方式创建。以下介绍了不同的方式-
Categorical object can be created in multiple ways. The different ways have been described below −
category
在熊猫对象创建中将 dtype 指定为“类别”。
By specifying the dtype as "category" in pandas object creation.
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
print s
它的 output 如下所示 −
Its output is as follows −
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
传递给 series 对象的元素数量为四个,但类别只有三个。在输出类别中观察相同的内容。
The number of elements passed to the series object is four, but the categories are only three. Observe the same in the output Categories.
pd.Categorical
使用标准的 pandas 分类构造函数,我们可以创建一个类别对象。
Using the standard pandas Categorical constructor, we can create a category object.
pandas.Categorical(values, categories, ordered)
让我们举个例子-
Let’s take an example −
import pandas as pd
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
print cat
它的 output 如下所示 −
Its output is as follows −
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
我们举另一个例子-
Let’s have another example −
import pandas as pd
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'])
print cat
它的 output 如下所示 −
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
在此,第二个参数表示类别。因此,在类别中不存在的任何值都将被视为 NaN 。
Here, the second argument signifies the categories. Thus, any value which is not present in the categories will be treated as NaN.
现在,看下面的例子 -
Now, take a look at the following example −
import pandas as pd
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'], ['c', 'b', 'a'],ordered=True)
print cat
它的 output 如下所示 −
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
从逻辑上来说,该顺序表示 a 大于 b , b 大于 c 。
Logically, the order means that, a is greater than b and b is greater than c.
Description
使用分类数据上的 .describe() 命令,我们得到一个类似于 type 字符串中的 Series 或 DataFrame 的输出。
Using the .describe() command on the categorical data, we get similar output to a Series or DataFrame of the type string.
import pandas as pd
import numpy as np
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})
print df.describe()
print df["cat"].describe()
它的 output 如下所示 −
Its output is as follows −
cat s
count 3 3
unique 2 2
top c c
freq 2 2
count 3
unique 2
top c
freq 2
Name: cat, dtype: object
Get the Properties of the Category
obj.cat.categories 命令用于获取 categories of the object 。
obj.cat.categories command is used to get the categories of the object.
import pandas as pd
import numpy as np
s = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
print s.categories
它的 output 如下所示 −
Its output is as follows −
Index([u'b', u'a', u'c'], dtype='object')
obj.ordered 命令用于获取对象的顺序。
obj.ordered command is used to get the order of the object.
import pandas as pd
import numpy as np
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
print cat.ordered
它的 output 如下所示 −
Its output is as follows −
False
函数返回 false ,因为我们没有指定任何顺序。
The function returned false because we haven’t specified any order.
Renaming Categories
通过将新值赋值给 *series.cat.categories*series.cat.categories 属性来重新命名类别。
Renaming categories is done by assigning new values to the *series.cat.categories*series.cat.categories property.
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
print s.cat.categories
它的 output 如下所示 −
Its output is as follows −
Index([u'Group a', u'Group b', u'Group c'], dtype='object')
对象的 s.cat.categories 属性更新了初始类别 [a,b,c] 。
Initial categories [a,b,c] are updated by the s.cat.categories property of the object.
Appending New Categories
使用 Categorical.add.categories() 方法可以附加新类别。
Using the Categorical.add.categories() method, new categories can be appended.
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
s = s.cat.add_categories([4])
print s.cat.categories
它的 output 如下所示 −
Its output is as follows −
Index([u'a', u'b', u'c', 4], dtype='object')
Removing Categories
使用 Categorical.remove_categories() 方法可以删除不需要的类别。
Using the Categorical.remove_categories() method, unwanted categories can be removed.
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
print ("Original object:")
print s
print ("After removal:")
print s.cat.remove_categories("a")
它的 output 如下所示 −
Its output is as follows −
Original object:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
After removal:
0 NaN
1 b
2 c
3 NaN
dtype: category
Categories (2, object): [b, c]
Comparison of Categorical Data
在三个情况下,分类数据与其他对象进行比较 -
Comparing categorical data with other objects is possible in three cases −
-
comparing equality (== and !=) to a list-like object (list, Series, array, …) of the same length as the categorical data.
-
all comparisons (==, !=, >, >=, <, and ⇐) of categorical data to another categorical Series, when ordered==True and the categories are the same.
-
all comparisons of a categorical data to a scalar.
请看以下示例:
Take a look at the following example −
import pandas as pd
cat = pd.Series([1,2,3]).astype("category", categories=[1,2,3], ordered=True)
cat1 = pd.Series([2,2,2]).astype("category", categories=[1,2,3], ordered=True)
print cat>cat1
它的 output 如下所示 −
Its output is as follows −
0 False
1 False
2 True
dtype: bool