Scikit Learn 简明教程

Scikit Learn - Data Representation

众所周知，机器学习即将从数据创建模型。为了这个目的，计算机必须首先理解数据。接下来，我们将讨论各种方法，以便计算机可以理解如何表示数据。

As we know that machine learning is about to create model from data. For this purpose, computer must understand the data first. Next, we are going to discuss various ways to represent the data in order to be understood by computer −

Data as table

在 Scikit-learn 中表示数据的最佳方式是表格形式。表格表示一个 2-D 数据网格，其中行表示数据集的各个元素，而列表示与这些各个元素相关的数量。

The best way to represent data in Scikit-learn is in the form of tables. A table represents a 2-D grid of data where rows represent the individual elements of the dataset and the columns represents the quantities related to those individual elements.

Example

使用下面给出的示例，我们可以借助 python seaborn 库以 Pandas DataFrame 的形式下载 iris dataset 。

With the example given below, we can download iris dataset in the form of a Pandas DataFrame with the help of python seaborn library.

import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

Output

sepal_length sepal_width petal_length petal_width  species
0        5.1      3.5         1.4             0.2   setosa
1        4.9      3.0         1.4             0.2   setosa
2        4.7      3.2         1.3             0.2   setosa
3        4.6      3.1         1.5             0.2   setosa
4        5.0      3.6         1.4             0.2   setosa

从上面的输出中，我们可以看出数据中的每一行都代表一朵观察到的花，而行数代表数据集中花的总数。通常，我们将矩阵的行称为样本。

From above output, we can see that each row of the data represents a single observed flower and the number of rows represents the total number of flowers in the dataset. Generally, we refer the rows of the matrix as samples.

另一方面，数据中的每一列表示描述每个样本的定量信息。通常，我们将矩阵的列称为特征。

On the other hand, each column of the data represents a quantitative information describing each sample. Generally, we refer the columns of the matrix as features.

Data as Feature Matrix

特征矩阵可以定义为可以将信息视为二维矩阵的表格布局。它存储在名为 ` X ` 的变量中，并且假定为具有形状 [n_samples, n_features] 的二维矩阵。通常，它包含在 NumPy 数组或 Pandas DataFrame 中。正如前面所说的，样本始终表示由数据集描述的各个对象，而特征表示以定量方式描述每个样本的不同观察结果。

Features matrix may be defined as the table layout where information can be thought of as a 2-D matrix. It is stored in a variable named X and assumed to be two dimensional with shape [n_samples, n_features]. Mostly, it is contained in a NumPy array or a Pandas DataFrame. As told earlier, the samples always represent the individual objects described by the dataset and the features represents the distinct observations that describe each sample in a quantitative manner.

Data as Target array

除了特征矩阵（用 X 表示）之外，我们还有目标数组。它也称为标签。它用 y 表示。标签或目标数组通常是一维的，长度为 n_samples。它通常包含在 NumPy ` array ` 或 Pandas ` Series ` 中。目标数组可以同时具有值，连续数值和离散值。

Along with Features matrix, denoted by X, we also have target array. It is also called label. It is denoted by y. The label or target array is usually one-dimensional having length n_samples. It is generally contained in NumPy array or Pandas Series. Target array may have both the values, continuous numerical values and discrete values.

How target array differs from feature columns?

我们可以通过一点来区分两者，即目标数组通常是我们希望从数据中预测的数量，即在统计学中它是因变量。

We can distinguish both by one point that the target array is usually the quantity we want to predict from the data i.e. in statistical terms it is the dependent variable.

Example

在下面的示例中，从鸢尾数据集我们根据其他测量结果预测花卉的物种。在这种情况下，Species 列将被视为特征。

In the example below, from iris dataset we predict the species of flower based on the other measurements. In this case, the Species column would be considered as the feature.

import seaborn as sns
iris = sns.load_dataset('iris')
%matplotlib inline
import seaborn as sns; sns.set()
sns.pairplot(iris, hue='species', height=3);

Output

X_iris = iris.drop('species', axis=1)
X_iris.shape
y_iris = iris['species']
y_iris.shape

Output

(150,4)
(150,)