Data Structures Algorithms 简明教程

Longest Common Subsequence Algorithm

最长公共子序列问题是找到两个给定字符串中都存在的最长序列。

The longest common subsequence problem is finding the longest sequence which exists in both the given strings.

但在我们理解这个问题之前，我们先来理解一下什么是子序列 -

But before we understand the problem, let us understand what the term subsequence is −

假设我们有一个序列 S = <s1, s2, s3, s4, …,sn>。并且另一个序列 Z = <z1, z2, z3, …,zm> 在 S 上被称为 S 的子序列，当且仅当它可以通过删除 S 中的某些元素而生成。简单来说，一个子序列是由构成序列中一个小部分的连续元素组成的。

Let us consider a sequence S = <s1, s2, s3, s4, …,sn>. And another sequence Z = <z1, z2, z3, …,zm> over S is called a subsequence of S, if and only if it can be derived from S deletion of some elements. In simple words, a subsequence consists of consecutive elements that make up a small part in a sequence.

Common Subsequence

假设 X 和 Y 是有限元素集上的两个序列。我们可以说 Z 是 X 和 Y 的一个公共子序列，如果 Z 同时是 X 和 Y 的子序列。

Suppose, X and Y are two sequences over a finite set of elements. We can say that Z is a common subsequence of X and Y, if Z is a subsequence of both X and Y.

Longest Common Subsequence

如果给定了一组序列，最长公共子序列问题就是找到所有序列的一个公共子序列，其长度最大。

If a set of sequences are given, the longest common subsequence problem is to find a common subsequence of all the sequences that is of maximal length.

Naive Method

令 X 是长度为 m 的序列， Y 是长度为 n 的序列。检查 X 的每一个子序列是否都是 Y 的子序列，并返回找到的最长公共子序列。

Let X be a sequence of length m and Y a sequence of length n. Check for every subsequence of X whether it is a subsequence of Y, and return the longest common subsequence found.

X 中有 2m 个子序列。测试序列是否为 Y 的子序列需要 O(n) 的时间。因此，朴素算法将需要 O(n2m) 的时间。

There are 2m subsequences of X. Testing sequences whether or not it is a subsequence of Y takes O(n) time. Thus, the naive algorithm would take O(n2m) time.

Longest Common Subsequence Algorithm

令 X=<x1,x2,x3….,xm> 和 Y=<y1,y2,y3….,ym> 为序列。要计算一个元素的长度，可以使用以下算法。

Let X=<x1,x2,x3….,xm> and Y=<y1,y2,y3….,ym> be the sequences. To compute the length of an element the following algorithm is used.

Step 1 - 构造一个大小为 n × m 的空邻接表，其中 n = 序列 X 的大小，m = 序列 Y 的大小。表中的行表示序列 X 中的元素，列表示序列 Y 中的元素。

Step 1 − Construct an empty adjacency table with the size, n × m, where n = size of sequence X and m = size of sequence Y. The rows in the table represent the elements in sequence X and columns represent the elements in sequence Y.

Step 2 - 第 0 行和列必须用零填充。剩下的值根据不同的情况填充，通过维护一个计数器值。

Step 2 − The zeroeth rows and columns must be filled with zeroes. And the remaining values are filled in based on different cases, by maintaining a counter value.

Case 1 − If the counter encounters common element in both X and Y sequences, increment the counter by 1.
Case 2 − If the counter does not encounter common elements in X and Y sequences at T[i, j], find the maximum value between T[i-1, j] and T[i, j-1] to fill it in T[i, j].

Step 3 - 填充表格后，从表格中的最后一个值回溯。此处的回溯通过跟踪计数器首先增量的位置来完成。

Step 3 − Once the table is filled, backtrack from the last value in the table. Backtracking here is done by tracing the path where the counter incremented first.

Step 4 - 最长的公共子序列是通过记录跟踪路径中的元素而获得的。

Step 4 − The longest common subseqence obtained by noting the elements in the traced path.

Pseudocode

在此过程中，表 C[m, n] 以行优先顺序计算，另一个表 B[m,n] 计算用于构造最优解。

In this procedure, table C[m, n] is computed in row major order and another table B[m,n] is computed to construct optimal solution.

Algorithm: LCS-Length-Table-Formulation (X, Y)
m := length(X)
n := length(Y)
for i = 1 to m do
   C[i, 0] := 0
for j = 1 to n do
   C[0, j] := 0
for i = 1 to m do
   for j = 1 to n do
      if xi = yj
         C[i, j] := C[i - 1, j - 1] + 1
         B[i, j] := ‘D’
      else
         if C[i -1, j] ≥ C[i, j -1]
            C[i, j] := C[i - 1, j] + 1
            B[i, j] := ‘U’
         else
            C[i, j] := C[i, j - 1] + 1
            B[i, j] := ‘L’
return C and B

Algorithm: Print-LCS (B, X, i, j)
if i=0 and j=0
   return
if B[i, j] = ‘D’
   Print-LCS(B, X, i-1, j-1)
   Print(xi)
else if B[i, j] = ‘U’
   Print-LCS(B, X, i-1, j)
else
   Print-LCS(B, X, i, j-1)

此算法将打印 X 和 Y 的最长公共子序列。

This algorithm will print the longest common subsequence of X and Y.

Analysis

要填充表格，外部 for 循环迭代 m 次，内部 for 循环迭代 n 次。因此，该算法的复杂度为 O(m,n) ，其中 m 和 n 是两个字符串的长度。

To populate the table, the outer for loop iterates m times and the inner for loop iterates n times. Hence, the complexity of the algorithm is O(m,n), where m and n are the length of two strings.

Example

在这个例子中，我们有两个字符串 X=BACDB 和 Y=BDCB ，需要找到它们的最长公共子序列。

In this example, we have two strings X=BACDB and Y=BDCB to find the longest common subsequence.

根据算法，我们需要计算两个表 1 和 2。

Following the algorithm, we need to calculate two tables 1 and 2.

给定 n = X 的长度，m = Y 的长度

Given n = length of X, m = length of Y

X = BDCB，Y = BACDB

X = BDCB, Y = BACDB

Constructing the LCS Tables

在下表中，零行和零列用零填充。通过增加和根据算法选择最大值来填充其余值。

In the table below, the zeroeth rows and columns are filled with zeroes. Remianing values are filled by incrementing and choosing the maximum values according to the algorithm.

填充完值后，从表中最后一个值 T[4, 5] 中跟踪回路径。

Once the values are filled, the path is traced back from the last value in the table at T[4, 5].

从跟踪路径中，通过选择首次增加计数器的值来找到最长公共子序列。

From the traced path, the longest common subsequence is found by choosing the values where the counter is first incremented.

在此示例中，最终计数为 3，因此计数器在 3 个地方增加，即 B、C、B。因此，序列 X 和 Y 的最长公共子序列为 BCB。

In this example, the final count is 3 so the counter is incremented at 3 places, i.e., B, C, B. Therefore, the longest common subsequence of sequences X and Y is BCB.

Implementation

以下是使用动态规划方法查找最长公共子序列的最终实现：

Following is the final implementation to find the Longest Common Subsequence using Dynamic Programming Approach −

Applications

最长公共子序列问题是一个经典的计算机科学问题，是 diff-utility 等数据比较程序的基础，并在生物信息学中得到了应用。它还被版本控制系统广泛使用，例如 SVN 和 Git，用于协调对受版本控制的文件集合进行的多次更改。

The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff-utility, and has applications in bioinformatics. It is also widely used by revision control systems, such as SVN and Git, for reconciling multiple changes made to a revision-controlled collection of files.