Scikit Learn 简明教程

Scikit Learn - Introduction

在本章中,我们将了解什么是 Scikit-Learn 或 Sklearn、Scikit-Learn 的起源以及其他一些相关主题,例如负责 Scikit-Learn 开发和维护的社区和贡献者、它的前提条件、安装和它的特性。

In this chapter, we will understand what is Scikit-Learn or Sklearn, origin of Scikit-Learn and some other related topics such as communities and contributors responsible for development and maintenance of Scikit-Learn, its prerequisites, installation and its features.

What is Scikit-Learn (Sklearn)

Scikit-learn (Sklearn) 是 Python 中最有用且健壮的机器学习库。它通过 Python 中的一致界面提供一系列高效的工具,用于机器学习和统计建模,包括分类、回归、聚类和降维。此库在很大程度上是用 Python 编写的,它基于 NumPy, SciPyMatplotlib 构建。

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

Origin of Scikit-Learn

它最初称为 scikits.learn ,最初由 David Cournapeau 在 2007 年作为谷歌代码夏季项目开发。后来,在 2010 年,来自 FIRCA(法国计算机科学与自动化研究院)的 Fabian Pedregosa、Gael Varoquaux、Alexandre Gramfort 和 Vincent Michel 将该项目带到了另一个层次,并于 2010 年 2 月 1 日发布了第一个公开版本 (v0.1 beta)。

It was originally called scikits.learn and was initially developed by David Cournapeau as a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on 1st Feb. 2010.

让我们看看它的版本历史 −

Let’s have a look at its version history −

  1. May 2019: scikit-learn 0.21.0

  2. March 2019: scikit-learn 0.20.3

  3. December 2018: scikit-learn 0.20.2

  4. November 2018: scikit-learn 0.20.1

  5. September 2018: scikit-learn 0.20.0

  6. July 2018: scikit-learn 0.19.2

  7. July 2017: scikit-learn 0.19.0

  8. September 2016. scikit-learn 0.18.0

  9. November 2015. scikit-learn 0.17.0

  10. March 2015. scikit-learn 0.16.0

  11. July 2014. scikit-learn 0.15.0

  12. August 2013. scikit-learn 0.14

Community & contributors

Scikit-learn 是一项社区工作,任何人都可以为它做出贡献。此项目托管在 https://github.com/scikit-learn/scikit-learn. 上,当前有以下人员是 Sklearn 开发和维护的核心贡献者 −

Scikit-learn is a community effort and anyone can contribute to it. This project is hosted on https://github.com/scikit-learn/scikit-learn. Following people are currently the core contributors to Sklearn’s development and maintenance −

  1. Joris Van den Bossche (Data Scientist)

  2. Thomas J Fan (Software Developer)

  3. Alexandre Gramfort (Machine Learning Researcher)

  4. Olivier Grisel (Machine Learning Expert)

  5. Nicolas Hug (Associate Research Scientist)

  6. Andreas Mueller (Machine Learning Scientist)

  7. Hanmin Qin (Software Engineer)

  8. Adrin Jalali (Open Source Developer)

  9. Nelle Varoquaux (Data Science Researcher)

  10. Roman Yurchak (Data Scientist)

各种机构都在使用 Sklearn,例如 Booking.com、JP Morgan、Evernote、Inria、AWeber、Spotify 以及更多其他机构。

Various organisations like Booking.com, JP Morgan, Evernote, Inria, AWeber, Spotify and many more are using Sklearn.

Prerequisites

在我们开始使用 scikit-learn 的最新版本之前,我们需要满足以下要求:

Before we start using scikit-learn latest release, we require the following −

  1. Python (>=3.5)

  2. NumPy (>= 1.11.0)

  3. Scipy (>= 0.17.0)li

  4. Joblib (>= 0.11)

  5. Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities.

  6. Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data structure and analysis.

Installation

如果您已安装 NumPy 和 Scipy,以下是安装 scikit-learn 的两种最简单的方法:

If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn −

Using pip

可以使用以下命令通过 pip 安装 scikit-learn:

Following command can be used to install scikit-learn via pip −

pip install -U scikit-learn

Using conda

可以使用以下命令通过 conda 安装 scikit-learn:

Following command can be used to install scikit-learn via conda −

conda install scikit-learn

另一方面,如果尚未在您的 Python 工作站安装 NumPy 和 Scipy,则可以使用 pipconda 来安装它们。

On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either pip or conda.

使用 scikit-learn 的另一种选择是使用 Python 发行版,例如 CanopyAnaconda ,因为它们都随附最新版本的 scikit-learn。

Another option to use scikit-learn is to use Python distributions like Canopy and Anaconda because they both ship the latest version of scikit-learn.

Features

Scikit-learn 库专注于对数据进行建模,而不是专注于加载、处理和汇总数据。Sklearn 提供的一些最流行的模型组如下:

Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows −

Supervised Learning algorithms - scikit-learn 包含几乎所有的流行监督学习算法,例如线性回归、支持向量机 (SVM)、决策树等。

Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn.

Unsupervised Learning algorithms - 另一方面,它还包含从聚类、因子分析、PCA(主成分分析)到无监督神经网络的所有流行无监督学习算法。

Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.

Clustering - 此模型用于对未标记的数据进行分组。

Clustering − This model is used for grouping unlabeled data.

Cross Validation − 用于检查监督模型在未见过数据上的准确性。

Cross Validation − It is used to check the accuracy of supervised models on unseen data.

Dimensionality Reduction − 用于减少数据中的属性数量,此数量可进一步用于汇总、可视化和功能选择。

Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.

Ensemble methods − 如名称所示,用于组合多个监督模型的预测。

Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models.

Feature extraction − 用于从数据中提取功能,以便定义图像和文本数据中的属性。

Feature extraction − It is used to extract the features from data to define the attributes in image and text data.

Feature selection − 用于识别创建监督模型的有用属性。

Feature selection − It is used to identify useful attributes to create supervised models.

Open Source − 它是开源库,也可以在 BSD 许可证下进行商业使用。

Open Source − It is open source library and also commercially usable under BSD license.