Python Pandas 简明教程
Python Pandas - Missing Data
缺失数据在实际生活中始终是一个问题。由于由缺失值导致的数据质量低,机器学习和数据挖掘等领域在其模型预测的准确性方面面临严重问题。在这些领域,缺失值处理是使其模型更准确、更有效的重点。
Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.
When and Why Is Data Missed?
我们考虑一下一个产品的在线调查。很多时候,人们不会分享与他们相关的所有信息。很少有人分享他们的体验,但不会分享他们使用该产品的时间;很少有人分享他们使用该产品的时间、体验但不会分享他们的联系信息。因此,或多或少总有部分数据缺失,而且这在实时中是非常常见的。
Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.
现在让我们看看如何使用 Pandas 处理缺失值(例如 NA 或 NaN)。
Let us now see how we can handle missing values (say NA or NaN) using Pandas.
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df
它的 output 如下所示 −
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g NaN NaN NaN
h 0.085100 0.532791 0.887415
使用 reindexing,我们创建了一个包含缺失值的 DataFrame。在输出中, NaN 意味着 Not a Number.
Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.
Check for Missing Values
为了使检测缺失值更容易(并且跨不同的数组数据类型),Pandas 提供了 isnull() 和 notnull() 函数,它们也是 Series 和 DataFrame 对象上的方法 −
To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df['one'].isnull()
它的 output 如下所示 −
Its output is as follows −
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df['one'].notnull()
它的 output 如下所示 −
Its output is as follows −
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
Calculations with Missing Data
-
When summing data, NA will be treated as Zero
-
If the data are all NA, then the result will be NA
Cleaning / Filling Missing Data
Pandas 提供了多种方法来清理缺失值。fillna 函数可以通过几种方式用非空数据“填充”NA 值,我们已在以下部分中进行了说明。
Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.
Replace NaN with a Scalar Value
以下程序显示了如何将“NaN”替换为“0”。
The following program shows how you can replace "NaN" with "0".
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print df
print ("NaN replaced with '0':")
print df.fillna(0)
它的 output 如下所示 −
Its output is as follows −
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN
c 0.744328 -1.735166 1.749580
NaN replaced with '0':
one two three
a -0.576991 -0.741695 0.553172
b 0.000000 0.000000 0.000000
c 0.744328 -1.735166 1.749580
在这里,我们用值 0 来填充;相反,我们也可以用任何其他值来填充。
Here, we are filling with value zero; instead we can also fill with any other value.
Fill NA Forward and Backward
利用重新索引章节中讨论的填充概念,我们将填充缺失值。
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.
Sr.No |
Method & Action |
1 |
pad/fill Fill methods Forward |
2 |
bfill/backfill Fill methods Backward |
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.fillna(method='pad')
它的 output 如下所示 −
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
d -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.fillna(method='backfill')
它的 output 如下所示 −
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
b -0.390208 -0.551605 -2.301950
c -0.390208 -0.551605 -2.301950
d -2.000303 -0.788201 1.510072
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
g 0.085100 0.532791 0.887415
h 0.085100 0.532791 0.887415
Drop Missing Values
如果你只想排除缺失值,那么使用 dropna 函数以及 axis 参数。默认情况下,axis=0,即沿着行,这意味着如果一行内的任何值都为 NA,则整行都会被排除在外。
If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.dropna()
它的 output 如下所示 −
Its output is as follows −
one two three
a 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.146615
h 0.085100 0.532791 0.887415
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print df.dropna(axis=1)
它的 output 如下所示 −
Its output is as follows −
Empty DataFrame
Columns: [ ]
Index: [a, b, c, d, e, f, g, h]
Replace Missing (or) Generic Values
很多时候,我们必须用某个特定值替换一个通用值。我们可以通过应用 replace 方法来实现此目的。
Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.
使用标量值替换 NA 与 fillna() 函数的行为相同。
Replacing NA with a scalar value is equivalent behavior of the fillna() function.