Python Data Science 简明教程

Python - Processing Unstructured Data

已经存在于行和列格式中或可以轻松转换为行和列格式以便稍后可以很好地放入数据库中的数据称为结构化数据。CSV、TXT、XLS 文件等属于此类。这些文件具有分隔符，并且定宽或变宽，其中缺失值以分隔符之间的空格表示。但有时我们会获得其中行不是定宽的数据，或者它们只是 HTML、图像或 pdf 文件。此类数据称为非结构化数据。虽然可以通过处理 HTML 标记来处理 HTML 文件，但来自 Twitter 的订阅或来自新闻订阅的纯文本文档在没有分隔符的情况下是没有要处理的标记的。在这种情况中，我们使用各个 Python 库的不同内置函数来处理文件。

The data that is already present in a row and column format or which can be easily converted to rows and columns so that later it can fit nicely into a database is known as structured data. Examples are CSV, TXT, XLS files etc. These files have a delimiter and either fixed or variable width where the missing values are represented as blanks in between the delimiters. But sometimes we get data where the lines are not fixed width, or they are just HTML, image or pdf files. Such data is known as unstructured data. While the HTML file can be handled by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a delimiter does not have tags to handle. In such scenario we use different in-built functions from various python libraries to process the file.

Reading Data

在下面的示例中，我们选取一个文本文件，然后读取该文件，将文件中的每一行都分开。下一步我们可以将输出进一步划分为行和单词。原始文件是一个文本文件，其中包含一些描述 Python 语言的段落。

In the below example we take a text file and read the file segregating each of the lines in it. Next we can divide the output into further lines and words. The original file is a text file containing some paragraphs describing the python language.

filename = 'path\input.txt'

with open(filename) as fn:

# Read each line
   ln = fn.readline()

# Keep count of lines
   lncnt = 1
   while ln:
       print("Line {}: {}".format(lncnt, ln.strip()))
       ln = fn.readline()
       lncnt += 1

当我们执行上面的代码时，它会产生以下结果。

When we execute the above code, it produces the following result.

Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.
Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.

Counting Word Frequency

我们可以使用以下计数器函数来计算文件中单词的频率：

We can count the frequency of the words in the file using the counter function as follows.

from collections import Counter

with open(r'pathinput2.txt') as f:
               p = Counter(f.read().split())
               print(p)

当我们执行上面的代码时，它会产生以下结果。

When we execute the above code, it produces the following result.

Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '1991,': 1, 'is': 1, 'programming.': 1, 'dynamic': 1, 'an': 1, 'design': 1, 'in': 1, 'high-level': 1, 'management.': 1, 'features': 1, 'readability,': 1, 'van': 1, 'both': 1, 'for': 1, 'Rossum': 1, 'system': 1, 'provides': 1, 'memory': 1, 'has': 1, 'type': 1, 'enable': 1, 'Created': 1, 'philosophy': 1, 'constructs': 1, 'emphasizes': 1, 'general-purpose': 1, 'notably': 1, 'released': 1, 'significant': 1, 'Guido': 1, 'using': 1, 'interpreted': 1, 'by': 1, 'on': 1, 'language': 1, 'whitespace.': 1, 'clear': 1, 'It': 1, 'large': 1, 'small': 1, 'automatic': 1, 'scales.': 1, 'first': 1})