Machine Learning 简明教程
Machine Learning - Python Libraries
Python libraries 是一些代码和函数的集合,可在特定任务的程序中使用。它们通常用于简化重复和复杂任务的编程过程。
Python libraries are collection of codes and functions that can be used in a program for a specific task. They are generally used to ease the process of programming when the tasks are repetitive and complex.
如你所知,机器学习是一个交叉学科领域,其中每个算法都是基于编程和数学的结合开发的。使用库可以轻松完成任务,而无需手动使用数学和统计公式对整个算法进行编码。
As you know Machine Learning is an interdisciplinary field where each algorithm is developed on combining programming and mathematics. Instead of manually coding the complete algorithm with mathematical and statistical formulas, using libraries would make the task easy.
Python 是一种非常流行的编程语言,特别适合实现机器学习,因为它简单易用、拥有大量的库且易于使用。
Python is the most popular programming language specially to implement machine learning because of its simplicity, vast collection of libraries and easiness.
下面是几个流行的 Python 机器学习库−
Some popular Python machine learning libraries are as follows −
让我们详细讨论以上提到的每个 Python 库。
Let’s discuss each of the above mentioned Python libraries in detail.
NumPy
NumPy 是一个通用的数组和矩阵处理包,用于科学计算和执行各种数学运算,如线性代数、傅里叶变换等。它提供了一个高性能多维数组对象和工具,用于操作矩阵以改进机器学习算法。它是 Python 机器学习生态系统的一个关键组成部分,因为它提供了许多机器学习算法所需的底层数据结构和数值运算。
NumPy is a general purpose array and matrix processing package used for scientific computing and to perform a variety of mathematical operations like linear algebra, Fourier transform and others. It provides a high performance multi-dimensional array object and tools , to manipulate the matrices for the improvement of machine learning algorithms. It is a critical component of the Python machine learning ecosystem, as it provides the underlying data structure and numerical operations required for many machine learning algorithms.
通过使用 NumPy,我们可以执行以下重要运算−
By using NumPy, we can perform the following important operations −
-
Mathematical and logical operations on arrays.
-
Fourier transformation
-
Operations associated with linear algebra.
我们还可以将 NumPy 视为 MATLAB 的替代品,因为 NumPy 通常与 Scipy(科学 Python)和 Mat-plotlib(绘图库)一起使用。
We can also see NumPy as the replacement of MATLAB because NumPy is mostly used along with Scipy (Scientific Python) and Mat-plotlib (plotting library).
Installation and Execution
如果你正在使用 Anaconda 发行版,则无需单独安装 NumPy,因为它已经随发行版一起安装。你只需使用以下方式将该程序包导入到 Python 脚本中即可−
If you are using Anaconda distribution, then no need to install NumPy separately as it is already installed with it. You just need to import the package into your Python script with the help of following −
import numpy as np
另一方面,如果你正在使用标准 Python 发行版,则可以使用流行的 python 程序包安装程序 pip 安装 NumPy。
On the other hand, if you are using standard Python distribution then NumPy can be installed using popular python package installer, pip.
pip install numpy
Example
以下是一个使用 NumPy 创建一维数组的简单示例−
Following is a simple example that creates a one-dimensional array using NumPy −
import numpy as np
data = np.array([1,2,3,4,5])
print(data)
print(len(data))
print(type(data))
print(data.shape)
上述 Python 示例代码会产生以下结果−
The above Python example code will produce the following result −
[1 2 3 4 5]
5
<class 'numpy.ndarray'>
(5,)
Pandas
Pandas 是一个强大的数据处理和分析库。此库并未在机器学习算法中实际使用,而是在前一步中使用,即用于数据准备。它的功能基于两个数据结构:Series(一维)和 Data frames(二维)。这使得它能够处理金融、商业和健康等各个领域的庞大典型用例。
Pandas is a powerful library for data manipulation and analysis. This library is not exactly used in machine learning algorithms but is used in the prior step i.e., for data preparation. It functions based on two data structures: Series(one-dimensional) and Data frames(two-dimensional). This allows it to handle vast typical use cases in various sectors like Finance, Business, and Health.
借助 Pandas,在数据处理中,我们可以完成以下五个步骤−
With the help of Pandas, in data processing, we can accomplish the following five steps −
-
Load
-
Prepare
-
Manipulate
-
Model
-
Analyze
Data Representation in Pandas
Panda 中数据的整个表示都是借助以下三个数据结构完成的−
The entire representation of data in Pandas is done with the help of the following three data structures −
Series − 它是一个一维 ndarray,带有轴标签,这意味着它就像一个具有同质数据的简单数组。例如,以下系列是一组整数 1,5,10,15,24,25…
Series − It is a one-dimensional ndarray with an axis label, which means it is like a simple array with homogeneous data. For example, the following series is a collection of integers 1,5,10,15,24,25…
1 |
5 |
10 |
15 |
24 |
25 |
28 |
36 |
40 |
89 |
Data frame − 它是最有用的数据结构,用于 pandas 中几乎所有类型的数据表示和处理。它是一个二维数据结构,可以包含异构数据。通常,使用数据帧来表示表格数据。例如,下表显示了学生的数据,包括姓名、学号、年龄和性别−
Data frame − It is the most useful data structure and is used for almost all kinds of data representation and manipulation in pandas. It is a two-dimensional data structure that can contain heterogeneous data. Generally, tabular data is represented by using data frames. For example, the following table shows the data of students having their names and roll numbers, age and gender −
Name |
Roll number |
Age |
Gender |
Aarav |
1 |
15 |
Male |
Harshit |
2 |
14 |
Male |
Kanika |
3 |
16 |
Female |
Mayank |
4 |
15 |
Male |
Panel − 它是一个包含异构数据的三维数据结构。以图形方式表示框架非常困难,但可以将其视为 DataFrame 的容器。
Panel − It is a 3-dimensional data structure containing heterogeneous data. It is very difficult to represent the panel in graphical representation, but it can be illustrated as a container of DataFrame.
下表为我们在 Pandas 中使用的上述数据结构提供维度和描述 −
The following table gives us the dimension and description about the above-mentioned data structures used in Pandas −
Data Structure |
Dimension |
Description |
Series |
1-D |
Size immutable, 1-D homogeneous data |
DataFrames |
2-D |
Size Mutable, Heterogeneous data in tabular form |
Panel |
3-D |
Size-mutable array, container of DataFrame. |
我们可以理解,高维数据结构是低维数据结构的容器。
We can understand these data structures as the higher dimensional data structure is the container of lower dimensional data structure.
Installation and Execution
如果你使用 Anaconda 发行版,则无需单独安装 Pandas,因为它已随此发行版安装。你只需要使用以下命令将其导入 Python 脚本中 −
If you are using Anaconda distribution, then no need to install Pandas separately as it is already installed with it. You just need to import the package into your Python script with the help of following −
import pandas as pd
另一方面,如果你使用的是标准 Python 发行版,则可以使用流行的 python 软件包安装程序 pip 安装 Pandas。
On the other hand, if you are using standard Python distribution then Pandas can be installed using popular python package installer, pip.
pip install pandas
安装 Pandas 后,您可以将其导入 Python 脚本,如上所述。
After installing Pandas, you can import it into your Python script as did above.
Example
以下是使用 Pandas 从 ndarray 创建一个系列的示例 −
The following is an example of creating a series from ndarray by using Pandas −
import pandas as pd
import numpy as np
data = np.array(['g','a','u','r','a','v'])
s = pd.Series(data)
print (s)
上述示例代码将产生以下结果 −
The above example code will produce the following result −
0 g
1 a
2 u
3 r
4 a
5 v
dtype: object
SciPy
SciPy 是一个用于对大型数据集执行科学计算的开源库。它易于使用,并且执行数据可视化和操作任务的速度很快。它包含用于优化算法和执行积分、线性代数或信号处理等操作的模块。SciPy 构建在 NumPy 之上,但通过执行数值算法和代数函数等复杂任务来扩展其功能。
SciPy is an open-source library that performs scientific computing on large datasets. It is easy to use and fast to execute data visualization and manipulation tasks. It consists of modules used for the optimization of algorithms and to perform operations like integration, linear algebra, or signal processing. SciPy is built on NumPy but extends its functionality by performing complex tasks like numerical algorithms and algebraic functions.
Installation and Execution
如果你使用 Anaconda 发行版,则无需单独安装 SciPy,因为它已随此发行版安装。你只需要将其导入 Python 脚本。例如,使用以下脚本行,我们从 scipy 导入 linalg 子模块 −
If you are using Anaconda distribution, then no need to install SciPy separately as it is already installed with it. You just need to use the package into your Python script. For example, with the following line of script we are importing linalg submodule from scipy −
from scipy import linalg
另一方面,如果你使用的是标准 Python 发行版并且使用了 NumPy,则可以使用流行的 python 软件包安装程序 pip 安装 SciPy。
On the other hand, if you are using standard Python distribution and having NumPy, then SciPy can be installed using a popular python package installer, pip.
pip install scipy
Example
以下是创建一个二维数组(矩阵)并找到该矩阵的逆的示例。
Following is an example of creating a two-dimensional array (matrix) and finding the inverse of the matrix.
import numpy as np
import scipy
from scipy import linalg
A= np.array([[1,2],[3,4]])
print(linalg.inv(A))
上述 Python 示例代码会产生以下结果−
The above Python example code will produce the following result −
[[-2. 1. ]
[ 1.5 -0.5]]
Scikit-learn
Scikit-learn 是一个建立在 NumPy 和 SciPy 之上的流行开源库,用于实现机器学习模型和统计建模。它支持监督学习和非监督学习。它提供了用于实现数据预处理、特征选择、模型选择、模型评估以及许多其他任务的各种工具。
Scikit-learn, a popular open-source library built on NumPy and SciPy, is used to implement machine learning models and statistical modeling. It supports supervised and unsupervised learning. It provides various tools for implementing data pre-processing, feature selection, model selection, model evaluation, and many other tasks.
以下是 Scikit-learn 的一些功能,这些功能使其非常有用 −
The following are some features of Scikit-learn that makes it so useful −
-
It is built on NumPy, SciPy, and Matplotlib.
-
It is an open source and can be reused under BSD license.
-
It is accessible to everybody and can be reused in various contexts.
-
Wide range of machine learning algorithms covering major areas of ML like classification, clustering, regression, dimensionality reduction, model selection etc. can be implemented with the help of it.
Installation and Execution
如果您正在使用 Anaconda 发行版,那么无需单独安装 Scikit-learn,因为它已与该发行版一起安装。您只需要在 Python 脚本中使用该软件包即可。例如,使用脚本的以下行,我们从 Scikit-learn 导入乳腺癌患者的数据集 −
If you are using Anaconda distribution, then there is no need to install Scikit-learn separately as it is already installed with it. You just need to use the package into your Python script. For example, with the following line of the script, we are importing a dataset of breast cancer patients from Scikit-learn −
from sklearn.datasets import load_breast_cancer
另一方面,如果您正在使用标准 Python 发行版且具有 NumPy 及 SciPy,则可以使用流行的 python 软件包安装程序 pip 来安装 Scikit-learn。
On the other hand, if you are using standard Python distribution and having NumPy and SciPy, then Scikit-learn can be installed using the popular python package installer, pip.
pip install scikit-learn
在安装 Scikit-learn 后,您可以像上述步骤中所做的那样在 Python 脚本中使用它。
After installing Scikit-learn, you can use it in your Python script as you have done above.
Example
以下是加载乳腺癌数据集的一个示例 −
Following is an example to load breast cancer dataset −
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
print(data.target[[10, 50, 85]])
print(list(data.target_names))
上述 Python 示例代码将产生以下结果 −
The above python exmaple code will produce the following result −
[0 1 0]
['malignant', 'benign']
有关 Scikit-learn 的更详细研究,您可以转到链接 www.tutorialspoint.com/scikit_learn/index.htm 。
For the more detailed study of Scikit-learn, you can go to the link www.tutorialspoint.com/scikit_learn/index.htm.
PyTorch
PyTorch 是一个基于 Torch 库的开源 Python 库,通常用于开发深度神经网络。它基于直观的 Python,并且可以动态定义计算图。对于需要灵活且强大的深度学习框架的研究人员和开发人员而言,PyTorch 特别有用。
PyTorch is an open-source Python library based on Torch library, generally used for developing deep neural networks. It is based on intuitive Python and can dynamically define computational graphs. PyTorch is particularly useful for researchers and developers who need a flexible and powerful deep learning framework.
Installation and Execution
对于 Windows 操作系统上的 Python 3.8 或更高版本和 CPU 平台,可以使用以下命令安装 PyTorch(torch、torchvision 和 torchaudio)
For Python 3.8 or later and CPU plateform on Windows operating system, you can use the following command to install PyTorch (torch, torchvision and torchaudio)
pip3 install torch torchvision torchaudio
您可以参阅带有更多选项的 PyTorch 安装指南
You can refer to the to following link for installation of PyTorch with more options
[role="bare"] [role="bare"]https://pytorch.org/get-started/locally/
[role="bare"]https://pytorch.org/get-started/locally/
要导入 PyTorch,请使用以下方法 −
To import PyTorch use the following −
import torch
在安装 PyTorch 后,您可以像上面所做的那样将其导入 Python 脚本。
After installing PyTorch, you can import it into your Python script as did above.
Example
以下是创建 NumPy 数组并将其转换为 PyTorch 张量的示例 −
Following is an example of creating a NumPy array and converting it to a PyTorch tensor −
import numpy as np
import torch
x = np.ones([3,4])
y = torch.from_numpy(x)
print(y)
上述示例代码将产生以下结果 −
The above example code will produce the following result −
tensor([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]], dtype=torch.float64)
TensorFlow
TensorFlow 是 Google 开发的最著名的软件库之一,用于实现机器学习和深度学习任务。有了它,可以更轻松地创建计算图,并在各种硬件平台上高效执行。它广泛用于自然语言处理、图像识别和手写识别等任务的开发。
TensorFlow is one of the most known software libraries developed by Google to implement machine learning and deep learning tasks. The creation of computational graphs and efficient execution on various hardware platforms is made easier with this. It is widely used for the development of tasks like natural language processing, image recognition and handwriting recognition.
Installation and Execution
对于 Windows 操作系统上的 CPU 平台,可以使用以下命令使用 pip 安装 TensorFlow −
For CPU platform on Windows operating system, you can use the following command to install TensorFlow using pip −
pip install tensorflow
您可以参阅带有更多选项的 TensorFlow 安装指南 −
You can refer to the to the following link for installation of TensorFlow with more options −
[role="bare"] [role="bare"]https://www.tensorflow.org/install/pip
[role="bare"]https://www.tensorflow.org/install/pip
若要导入 TensorFlow,请使用以下命令 −
To import TensorFlow use the following −
import tensorflow as tf
安装 TensorFlow 后,你可以将其导入到你的 Python 脚本中,就像上面所做的那样。
After installing TensorFlow, you can import it into your Python script as did above.
Example
以下是如何使用 TensorFlow 创建张量数据或对象的一个示例 −
Following is an example of creating a tensor data or object using TensorFlow −
import tensorflow as tf
data = tf.constant([[2,1],[4,6]])
print(data)
上述示例代码将产生以下结果 −
The above example code will produce the following result −
tf.Tensor(
[[2 1]
[4 6]], shape=(2, 2), dtype=int32)
Keras
Keras 是一个创建深度学习模型的高级神经网络库。它运行在 TensorFlow、CNTK 或 Theano 之上。它为构建和训练深度学习模型提供了一个简单直观的 API,使其成为初学者和研究人员的绝佳选择。Keras 是一个很受欢迎的库,因为它允许轻松快速地进行原型设计。
Keras is an high level neural network library that creates deep learning models. It runs on top of TensorFlow, CNTK, or Theano. It provides a simple and intuitive API for building and training deep learning models, making it an excellent choice for beginners and researchers. Keras is one of the popular library as it allows for easy and fast prototyping.
Installation and Execution
对于 Windows 操作系统的 CPU 平台,请使用以下命令使用 pip 安装 Keras −
For CPU platform on Windows operating system, use the following to install Keras using pip −
pip install keras
若要导入 TensorFlow,请使用以下命令 −
To import TensorFlow use the following −
import keras
安装 Keras 后,你可以将其导入到你的 Python 脚本中,就像上面所做的那样。
After installing Keras, you can import it into your Python script as we did above.
Example
在下面的示例中,我们正在从 Keras 导入 CIFAR-10 数据集并打印训练数据和测试数据的形状 −
In the example below, we are importing CIFAR-10 dataset from Keras and printing the shape of training data and test data −
import keras
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
上述示例代码将产生以下结果 −
The above example code will produce the following result −
(50000, 32, 32, 3)
(10000, 32, 32, 3)
(50000, 1)
(10000, 1)
Matplotlib
Matplotlib 是一个流行的绘图库,通常用于数据可视化,创建图形、绘图、直方图和条形图。它为数据分析、探索和展示任务提供了工具和函数。
Matplotlib is a popular plotting library usually used for data visualization, to create graphs, plots, histograms and bar charts. It provides tools and functions for data analysis, exploration and presentation tasks.
Installation and Execution
我们可以使用以下脚本行使用 pip 安装 Matplotlib −
We can use the following line of script to install Matplotlib using pip −
pip install matplotlib
大多数 matplotlib 实用工具都位于 pyplot 子模块中。我们可以使用以下脚本行从 Matplot 导入 pyplot −
Most of the matplotlib utilities lies under the pyplot submodule. We can import pyplot from Matplot using the following lines of script −
import matplotlib.pyplot as plt
安装 Matplotlib 后,你可以将其导入到你的 Python 脚本中,就像上面所做的那样。
After installing Matplotlib, you can import it into your Python script as we did above.
Seaborn
Seaborn 是一个基于 Matplotlib 构建并与 Pandas 集成的开源 Python 库。它用于制作美观且信息丰富的统计图形,使其非常适用于商业和营销分析。此库可帮助你学习和探索数据。
Seaborn is an open-source Python library built based on Matplotlib and integrates with Pandas. It is used for making presentable and informative statistical graphics which makes it ideal for business and marketing analysis. This library helps you learn and explore about data.
Installation and Execution
我们可以使用以下脚本行使用 pip 安装 Seaborn −
We can use the following line of script to install Seaborn using pip −
pip install seaborn
我们可以使用以下脚本行将 Seaborn 导入到我们的 Python 脚本中 −
We can import Seaborn to our Python script using the following lines of script −
import seaborn as sns
安装 Seaborn 后,你可以将其导入到你的 Python 脚本中,就像上面所做的那样。
After installing Seaborn, you can import it into your Python script as we did above.
OpenCV
开源计算机视觉库,简称 OpenCV 是用于计算机视觉和图像处理任务的 python 库。该库用于识别来自数据的图像模式和各种特征,还可以与 NumPy 集成以处理 openCV 数组结构。
Open Source Computer Vision Library, in short OpenCV is an python library for computer vision and image processing tasks. This library is used to identify an image pattern and various features from the data, and can also be integrated with NumPy to process the openCV array structure.
NLTK
Natural Language ToolKit ,简称 NLTK 是一个通常用于开发自然语言处理任务的 python 编程环境。它包含易于使用的界面,如 WordNet、用于分类、标记化、解析和语义推理的测试处理库。
Natural Language ToolKit, in short NLTK is a python programming environment usually used for developing natural language processing tasks. It comprises easy-to-use interfaces like WordNet, test processing libraries for classification, tokenization, parsing and semantic reasoning.
spaCy
spaCy 是一个免费的开源 Python 库。它以更快速、更优秀的方式为自然语言处理中的高级任务提供功能。该库有效执行单词标记化和 POS 标记这两项任务。
spaCy is a free open source Python Library. It provides features for advanced tasks in Natural Language Processing in fast and better manner. Word tokenization and POS tagging are two tasks that the library performs effectively.
Python 中还有许多其他工具和框架用于机器学习。学习 Python 库有助于理解机器学习的生态系统,有助于构建、训练和部署模型。
XGBoost, LightGBM, and Gensim are many other tools and frameworks in Python used for Machine learning. Studying Python Libraries would help to understand the ecosystem of machine learning, and helps to built, train and deploy models.