Scipy 简明教程

SciPy - CSGraph

CSGraph 表示 Compressed Sparse Graph ，专注于基于稀疏矩阵表示的快速图算法。

CSGraph stands for Compressed Sparse Graph, which focuses on Fast graph algorithms based on sparse matrix representations.

Graph Representations

首先，我们了解一下稀疏图是什么，以及它在图表示中如何提供帮助。

To begin with, let us understand what a sparse graph is and how it helps in graph representations.

What exactly is a sparse graph?

图就是节点的集合，在节点之间有链接。图可以表示几乎所有内容，例如社交网络连接，其中每个节点都是一个人，并与熟人相连；图像，其中每个节点都是一个像素并与相邻像素相连；高维分布中的点，其中每个节点都与它最近的邻居相连；以及你能想象到的实际上任何其他东西。

A graph is just a collection of nodes, which have links between them. Graphs can represent nearly anything − social network connections, where each node is a person and is connected to acquaintances; images, where each node is a pixel and is connected to neighboring pixels; points in a high-dimensional distribution, where each node is connected to its nearest neighbors; and practically anything else you can imagine.

用稀疏矩阵表示图数据是一种非常有效的方法：让我们称之为 G。矩阵 G 的大小为 N x N，G[i, j] 给出了节点“i”和节点“j”之间连接的值。稀疏图主要包含零 − 也就是说，大多数节点只有少数连接。此特性在大多数情况下都是成立的。

One very efficient way to represent graph data is in a sparse matrix: let us call it G. The matrix G is of size N x N, and G[i, j] gives the value of the connection between node ‘i' and node ‘j’. A sparse graph contains mostly zeros − that is, most nodes have only a few connections. This property turns out to be true in most cases of interest.

创建稀疏图子模块的动机来自 scikit-learn 中使用的几种算法，包括以下算法 −

The creation of the sparse graph submodule was motivated by several algorithms used in scikit-learn that included the following −

Isomap − A manifold learning algorithm, which requires finding the shortest paths in a graph.
Hierarchical clustering − A clustering algorithm based on a minimum spanning tree.
Spectral Decomposition − A projection algorithm based on sparse graph laplacians.

举个具体的例子，假设我们要表示以下无向图 −

As a concrete example, imagine that we would like to represent the following undirected graph −

此图有三个节点，其中节点 0 和 1 由权重为 2 的边连接，节点 0 和 2 由权重为 1 的边连接。我们可以构造稠密、掩码和稀疏表示，如以下示例所示，同时谨记无向图由对称矩阵表示。

This graph has three nodes, where node 0 and 1 are connected by an edge of weight 2, and nodes 0 and 2 are connected by an edge of weight 1. We can construct the dense, masked and sparse representations as shown in the following example, keeping in mind that an undirected graph is represented by a symmetric matrix.

G_dense = np.array([ [0, 2, 1],
                     [2, 0, 0],
                     [1, 0, 0] ])

G_masked = np.ma.masked_values(G_dense, 0)
from scipy.sparse import csr_matrix

G_sparse = csr_matrix(G_dense)
print G_sparse.data

上述程序将生成以下输出。

The above program will generate the following output.

array([2, 1, 2, 1])

这与之前的图相同，只是节点 0 和 2 由权重为 0 的边连接。在这种情况下，以上的稠密表示会导致歧义 − 如果 0 是一个有意义的值，那么如何表示非边。在这种情况下，必须使用掩码或稀疏表示来消除歧义。

This is identical to the previous graph, except nodes 0 and 2 are connected by an edge of zero weight. In this case, the dense representation above leads to ambiguities − how can non-edges be represented, if zero is a meaningful value. In this case, either a masked or a sparse representation must be used to eliminate the ambiguity.

让我们考虑以下示例。

Let us consider the following example.

from scipy.sparse.csgraph import csgraph_from_dense
G2_data = np.array
([
   [np.inf, 2, 0 ],
   [2, np.inf, np.inf],
   [0, np.inf, np.inf]
])
G2_sparse = csgraph_from_dense(G2_data, null_value=np.inf)
print G2_sparse.data

上述程序将生成以下输出。

The above program will generate the following output.

array([ 2., 0., 2., 0.])

Word ladders using sparse graphs

词梯是由刘易斯·卡罗尔发明的游戏，在游戏中，单词通过每一步更改一个字母而链接在一起。例如 −

Word ladders is a game invented by Lewis Carroll, in which words are linked by changing a single letter at each step. For example −

APE → APT → AIT → BIT → BIG → BAG → MAG → MAN

在这里，我们在七个步骤中从“APE”走到了“MAN”，每次更改一个字母。问题是 - 我们可以使用相同的规则在这些单词之间找到更短的路径吗？这个问题自然表示为稀疏图问题。节点将对应于各个单词，我们将在相差最多一个字母的单词之间创建连接。

Here, we have gone from "APE" to "MAN" in seven steps, changing one letter each time. The question is - Can we find a shorter path between these words using the same rule? This problem is naturally expressed as a sparse graph problem. The nodes will correspond to individual words, and we will create connections between words that differ by at the most – one letter.

Obtaining a List of Words

当然，首先，我们必须获得一个有效单词列表。我在使用 Mac，而 Mac 在以下代码块中给出的位置有一个单词词典。如果你的架构不同，你可能不得不搜索一番才能找到你的系统词典。

First, of course, we must obtain a list of valid words. I am running Mac, and Mac has a word dictionary at the location given in the following code block. If you are on a different architecture, you may have to search a bit to find your system dictionary.

wordlist = open('/usr/share/dict/words').read().split()
print len(wordlist)

上述程序将生成以下输出。

The above program will generate the following output.

我们现在想查看长度为 3 的单词，所以让我们只选择长度正确的那些单词。我们还将剔除以大写字母（专有名词）开头或包含非字母数字字符（例如撇号和连字符）的单词。最后，我们将确保一切都在小写中以备以后比较。

We now want to look at words of length 3, so let us select just those words of the correct length. We will also eliminate words, which start with upper case (proper nouns) or contain non-alpha-numeric characters such as apostrophes and hyphens. Finally, we will make sure everything is in lower case for a comparison later on.

word_list = [word for word in word_list if len(word) == 3]
word_list = [word for word in word_list if word[0].islower()]
word_list = [word for word in word_list if word.isalpha()]
word_list = map(str.lower, word_list)
print len(word_list)

上述程序将生成以下输出。

The above program will generate the following output.

现在，我们有一个 1135 个有效的三字母单词列表（确切数量可能会因所使用的特定列表而异）。这些单词中的每一个都将成为我们图中的一个节点，我们将创建连接节点的边，每个节点都与另一对单词相关联，这些单词仅相差一个字母。

Now, we have a list of 1135 valid three-letter words (the exact number may change depending on the particular list used). Each of these words will become a node in our graph, and we will create edges connecting the nodes associated with each pair of words, which differs by only one letter.

import numpy as np
word_list = np.asarray(word_list)

word_list.dtype
word_list.sort()

word_bytes = np.ndarray((word_list.size, word_list.itemsize),
   dtype = 'int8',
   buffer = word_list.data)
print word_bytes.shape

上述程序将生成以下输出。

The above program will generate the following output.

(1135, 3)

我们将使用各点之间的汉明距离来确定哪些词对是相连的。汉明距离测量两个向量之间哪些条目不同：汉明距离等于 1/N1/N的两个单词相连，其中 NN 是该单词中相连的字母的数量。

We will use the Hamming distance between each point to determine, which pairs of words are connected. The Hamming distance measures the fraction of entries between two vectors, which differ: any two words with a hamming distance equal to 1/N1/N, where NN is the number of letters, which are connected in the word ladder.

from scipy.spatial.distance import pdist, squareform
from scipy.sparse import csr_matrix
hamming_dist = pdist(word_bytes, metric = 'hamming')
graph = csr_matrix(squareform(hamming_dist < 1.5 / word_list.itemsize))

在比较距离时，我们不使用相等，因为这对于浮点值是不稳定的。只要单词列表的没有两个条目相同，不等式产生所需的结果。现在，我们的图已经建立好，我们将使用最短路径搜索在图中任意两个单词之间找到路径。

When comparing the distances, we do not use equality because this can be unstable for floating point values. The inequality produces the desired result as long as no two entries of the word list are identical. Now, that our graph is set up, we will use the shortest path search to find the path between any two words in the graph.

i1 = word_list.searchsorted('ape')
i2 = word_list.searchsorted('man')
print word_list[i1],word_list[i2]

上述程序将生成以下输出。

The above program will generate the following output.

ape, man

我们需要检查它们是否匹配，因为如果单词没有在列表中，输出中将会有一个错误。现在，我们需要做的就是在这个图中这两个索引之间找到最短路径。我们将使用 dijkstra’s 算法，因为它允许我们仅为一个节点找到路径。

We need to check that these match, because if the words are not in the list there will be an error in the output. Now, all we need is to find the shortest path between these two indices in the graph. We will use dijkstra’s algorithm, because it allows us to find the path for just one node.

from scipy.sparse.csgraph import dijkstra
distances, predecessors = dijkstra(graph, indices = i1, return_predecessors = True)
print distances[i2]

上述程序将生成以下输出。

The above program will generate the following output.

5.0

因此，我们看到“ape”和“man”之间的最短路径仅包含五个步骤。我们可以使用该算法返回的前驱来重建该路径。

Thus, we see that the shortest path between ‘ape’ and ‘man’ contains only five steps. We can use the predecessors returned by the algorithm to reconstruct this path.

path = []
i = i2

while i != i1:
   path.append(word_list[i])
   i = predecessors[i]

path.append(word_list[i1])
print path[::-1]i2]

上述程序将生成以下输出。

The above program will generate the following output.

['ape', 'ope', 'opt', 'oat', 'mat', 'man']