Python Text Processing 简明教程

Python - Process Word Document

要读取单词文档,我们借助名为 docx 的模块。我们首先按如下所示安装 docx。然后编写程序来使用 docx 模块中的不同函数,按段落读取整个文件。

我们使用以下命令将 docx 模块引入环境中。

 pip install docx

在以下示例中,我们通过将每行附加到段落,然后最终打印出所有段落文本来读取单词文档的内容。

import docx

def readtxt(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

print (readtxt('path\Tutorialspoint.docx'))

当我们运行以上程序时,我们得到以下输出:

Tutorials Point originated from the idea that there exists a class of readers who respond
better to online content and prefer to learn new skills at their own pace from the comforts
of their drawing rooms.

The journey commenced with a single tutorial on HTML in 2006 and elated by the response it generated,
we worked our way to adding fresh tutorials to our repository which now proudly flaunts
a wealth of tutorials and allied articles on topics ranging from programming languages
to web designing to academics and much more.

Reading Individual Paragraphs

我们可以使用段落属性从单词文档中读取特定段落。在以下示例中,我们仅从单词文档中读取第二段落。

import docx

doc = docx.Document('path\Tutorialspoint.docx')
print len(doc.paragraphs)

print doc.paragraphs[2].text

当我们运行以上程序时,我们得到以下输出:

The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository
which now proudly flaunts a wealth of tutorials and allied articles on topics
ranging from programming languages to web designing to academics and much more.