Python Web Scraping 简明教程

Python Web Scraping - Dealing with Text

在上一章中,我们了解了如何处理作为网络爬取内容一部分获得的视频和图片。在本章中,我们将使用 Python 库来处理文本分析,并详细了解其相关内容。

In the previous chapter, we have seen how to deal with videos and images that we obtain as a part of web scraping content. In this chapter we are going to deal with text analysis by using Python library and will learn about this in detail.

Introduction

您可以使用名为自然语言工具包 (NLTK) 的 Python 库来执行文本分析。在深入了解 NLTK 概念之前,我们先了解文本分析和网络爬取之间的关系。

You can perform text analysis in by using Python library called Natural Language Tool Kit (NLTK). Before proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping.

分析文本中的单词可以帮助我们了解哪些单词很重要,哪些单词不常见,单词如何分组。此项分析简化了网络爬取任务。

Analyzing the words in the text can lead us to know about which words are important, which words are unusual, how words are grouped. This analysis eases the task of web scraping.

Getting started with NLTK

自然语言工具包 (NLTK) 是 Python 库的集合,专门为识别和标记自然语言(如英语)文本中发现的词性而设计。

The Natural language toolkit (NLTK) is collection of Python libraries which is designed especially for identifying and tagging parts of speech found in the text of natural language like English.

Installing NLTK

您可以使用以下命令在 Python 中安装 NLTK −

You can use the following command to install NLTK in Python −

pip install nltk

如果您使用的是 Anaconda,则可以通过使用以下命令来构建 NLTK 的 conda 包 −

If you are using Anaconda, then a conda package for NLTK can be built by using the following command −

conda install -c anaconda nltk

Downloading NLTK’s Data

在安装 NLTK 后,我们必须下载预设文本库。但在下载文本预设库之前,我们需要通过 import 命令导入 NLTK,如下所示 −

After installing NLTK, we have to download preset text repositories. But before downloading text preset repositories, we need to import NLTK with the help of import command as follows −

mport nltk

现在,可以通过以下命令下载 NLTK 数据 −

Now, with the help of following command NLTK data can be downloaded −

nltk.download()

安装所有可用的 NLTK 软件包需要一些时间,但始终建议安装所有软件包。

Installation of all available packages of NLTK will take some time, but it is always recommended to install all the packages.

Installing Other Necessary packages

我们还需要其他一些 Python 软件包,如 gensimpattern ,才能使用 NLTK 执行文本分析以及构建自然语言处理应用程序。

We also need some other Python packages like gensim and pattern for doing text analysis as well as building building natural language processing applications by using NLTK.

gensim − 一个强大的语义建模库,适用于许多应用。可以通过以下命令进行安装 −

gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the following command −

pip install gensim

pattern − 用于使 gensim 软件包正常工作。可以通过以下命令进行安装 −

pattern − Used to make gensim package work properly. It can be installed by the following command −

pip install pattern

Tokenization

将给定文本分解为称为标记的较小单位的过程称为标记化。这些标记可以是单词、数字或标点符号。它也称为 word segmentation

The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can be the words, numbers or punctuation marks. It is also called word segmentation.

Example

tokenization

NLTK 模块为标记化提供了不同的软件包。我们可以根据需要使用这些软件包。此处描述了其中一些软件包 −

NLTK module provides different packages for tokenization. We can use these packages as per our requirement. Some of the packages are described here −

sent_tokenize package − 此软件包将输入文本划分为句子。可以使用以下命令导入此软件包 −

sent_tokenize package − This package will divide the input text into sentences. You can use the following command to import this package −

from nltk.tokenize import sent_tokenize

word_tokenize package − 此软件包将输入文本划分为单词。可以使用以下命令导入此软件包 −

word_tokenize package − This package will divide the input text into words. You can use the following command to import this package −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package − 此软件包将输入文本以及标点符号划分为单词。可以使用以下命令导入此软件包 −

WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into words. You can use the following command to import this package −

from nltk.tokenize import WordPuncttokenizer

Stemming

在任何语言中,单词都有不同的形式。由于语法原因,语言包含许多变体。例如,考虑以下单词 democracydemocraticdemocratization 。对于机器学习以及网络爬取项目,机器可以理解这些不同的单词具有相同的词干很重要。因此,我们可以说在分析文本时,提取单词的词干可能很有用。

In any language, there are different forms of a words. A language includes lots of variations due to the grammatical reasons. For example, consider the words democracy, democratic, and democratization. For machine learning as well as for web scraping projects, it is important for machines to understand that these different words have the same base form. Hence we can say that it can be useful to extract the base forms of the words while analyzing the text.

这可以通过词干提取来实现,词干提取可以定义为通过切掉单词结尾来提取单词基本形式的启发式过程。

This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of the words by chopping off the ends of words.

NLTK 模块提供了不同的词干提取包。我们可以根据需要使用这些包。其中一些包此处进行了说明 -

NLTK module provides different packages for stemming. We can use these packages as per our requirement. Some of these packages are described here −

PorterStemmer package - 此 Python 词干提取包使用波特算法来提取基本形式。您可以使用以下命令导入此包 -

PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

from nltk.stem.porter import PorterStemmer

例如,在此词干提取器中输入单词 ‘writing’ 后,词干提取后的输出将是单词 ‘write’

For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’ after stemming.

LancasterStemmer package - 此 Python 词干提取包使用兰开斯特算法来提取基本形式。您可以使用以下命令导入此包 -

LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

from nltk.stem.lancaster import LancasterStemmer

例如,在此词干提取器中输入单词 ‘writing’ 后,词干提取后的输出将是单词 ‘writ’

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘writ’ after stemming.

SnowballStemmer package - 此 Python 词干提取包使用 Snowball 算法来提取基本形式。您可以使用以下命令导入此包 -

SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

from nltk.stem.snowball import SnowballStemmer

例如,在此词干提取器中输入单词“writing”后,词干提取后的输出将是单词“write”。

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’ after stemming.

Lemmatization

提取单词基本形式的另一种方法是词形还原,通常通过使用词汇表和形态分析去除屈折词尾。任何单词在词形还原之后的单词基本形式被称为词形。

An other way to extract the base form of words is by lemmatization, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. The base form of any word after lemmatization is called lemma.

NLTK 模块提供了以下词形还原包 -

NLTK module provides following packages for lemmatization −

WordNetLemmatizer package - 它会根据单词用作名词还是动词来提取基本形式。您可以使用以下命令导入此包 -

WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as noun as a verb. You can use the following command to import this package −

from nltk.stem import WordNetLemmatizer

Chunking

分块是指将数据分成小块,这是自然语言处理中用于识别词性和小短语(如名词短语)的重要过程。分块是标记令牌。我们可以借助分块过程来获取句子的结构。

Chunking, which means dividing the data into small chunks, is one of the important processes in natural language processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of tokens. We can get the structure of the sentence with the help of chunking process.

Example

在本例中,我们将使用 NLTK Python 模块来实现名词短语分块。名词短语分块是分块的一种类型,它将在句子中找到名词短语块。

In this example, we are going to implement Noun-Phrase chunking by using NLTK Python module. NP chunking is a category of chunking which will find the noun phrases chunks in the sentence.

Steps for implementing noun phrase chunking

我们需要按照以下步骤实现名词短语分块 -

We need to follow the steps given below for implementing noun-phrase chunking −

Step 1 − Chunk grammar definition

第一步中,我们将为分块定义语法。它包括我们需要遵循的规则。

In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.

Step 2 − Chunk parser creation

现在,我们将创建一个分块解析器。它将解析语法并给出输出。

Now, we will create a chunk parser. It would parse the grammar and give the output.

Step 3 − The Output

在最后一步中,输出将以树格式生成。

In this last step, the output would be produced in a tree format.

首先,我们需要按如下方式导入 NLTK 包 -

First, we need to import the NLTK package as follows −

import nltk

接下来,我们需要定义句子。此处 DT:限定词,VBP:动词,JJ:形容词,IN:介词和 NN:名词。

Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the preposition and NN: the noun.

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

接下来,我们以正则表达式的形式给出了语法。

Next, we are giving the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

现在,下一行代码将定义用于解析语法的一个解析器。

Now, next line of code will define a parser for parsing the grammar.

parser_chunking = nltk.RegexpParser(grammar)

现在,解析器将解析这个句子。

Now, the parser will parse the sentence.

parser_chunking.parse(sentence)

然后,我们把输出保存在变量里。

Next, we are giving our output in the variable.

Output = parser_chunking.parse(sentence)

借助以下代码,我们能够图形化输出,如图所示。

With the help of following code, we can draw our output in the form of a tree as shown below.

output.draw()
phrase chunking

Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form

单词袋 (BoW) 是自然语言处理中一个有用的模型,主要用于提取文本中的特征。从文本中提取特征后,它可用于机器学习算法中的建模,因为原始数据不能用于 ML 应用程序。

Bag of Word (BoW), a useful model in natural language processing, is basically used to extract the features from text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because raw data cannot be used in ML applications.

Working of BoW Model

最初,模型从文档中的所有单词中提取词汇表。稍后,使用文档术语矩阵,它将建立一个模型。这样一来,BoW 模型将文档仅表示为单词的集合,顺序或结构将被丢弃。

Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it would build a model. In this way, BoW model represents the document as a bag of words only and the order or structure is discarded.

Example

假设我们有以下两个句子:

Suppose we have the following two sentences −

Sentence1 – 这是单词袋模型的示例。

Sentence1 − This is an example of Bag of Words model.

Sentence2 – 我们可以使用单词袋模型提取特征。

Sentence2 − We can extract features by using Bag of Words model.

现在,通过考虑这两个句子,我们有以下 14 个不同的单词:

Now, by considering these two sentences, we have the following 14 distinct words −

  1. This

  2. is

  3. an

  4. example

  5. bag

  6. of

  7. words

  8. model

  9. we

  10. can

  11. extract

  12. features

  13. by

  14. using

Building a Bag of Words Model in NLTK

我们来看一看以下 Python 脚本,它将在 NLTK 中构建一个 BoW 模型。

Let us look into the following Python script which will build a BoW model in NLTK.

首先,导入以下包:

First, import the following package −

from sklearn.feature_extraction.text import CountVectorizer

接下来,定义句子集:

Next, define the set of sentences −

Sentences=['This is an example of Bag of Words model.', ' We can extract
   features by using Bag of Words model.']
   vector_count = CountVectorizer()
   features_text = vector_count.fit_transform(Sentences).todense()
   print(vector_count.vocabulary_)

Output

它显示我们在以上两个句子中发现了 14 个不同的单词:

It shows that we have 14 distinct words in the above two sentences −

{
   'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
   'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
   'extract': 5, 'features': 6, 'by': 2, 'using':11
}

Topic Modeling: Identifying Patterns in Text Data

通常,文档被分组为主题,主题建模是一种用于识别文本中与特定主题相对应的模式的技术。换句话说,主题建模用于发现给定文档集中的抽象主题或隐藏结构。

Generally documents are grouped into topics and topic modeling is a technique to identify the patterns in a text that corresponds to a particular topic. In other words, topic modeling is used to uncover abstract themes or hidden structure in a given set of documents.

你可以在以下情况下使用主题建模:

You can use topic modeling in following scenarios −

Text Classification

主题建模可以改进分类,因为它将相似的单词组合在一起,而不是将每个单词单独用作一个特征。

Classification can be improved by topic modeling because it groups similar words together rather than using each word separately as a feature.

Recommender Systems

我们可以通过使用相似性度量来构建推荐系统。

We can build recommender systems by using similarity measures.

Topic Modeling Algorithms

我们可以使用以下算法来实现主题建模 −

We can implement topic modeling by using the following algorithms −

Latent Dirichlet Allocation(LDA) − 它是使用概率图形模型来实现主题建模的最流行算法之一。

Latent Dirichlet Allocation(LDA) − It is one of the most popular algorithm that uses the probabilistic graphical models for implementing topic modeling.

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − 它基于线性代数,并在文档术语矩阵上使用 SVD(奇异值分解)的概念。

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI) − It is based upon Linear Algebra and uses the concept of SVD (Singular Value Decomposition) on document term matrix.

Non-Negative Matrix Factorization (NMF) − 它也基于线性代数,如 LDA。

Non-Negative Matrix Factorization (NMF) − It is also based upon Linear Algebra as like LDA.

上面提到的算法将具有以下元素 −

The above mentioned algorithms would have the following elements −

  1. Number of topics: Parameter

  2. Document-Word Matrix: Input

  3. WTM (Word Topic Matrix) & TDM (Topic Document Matrix): Output