字符串模糊匹配的方法都有哪些

工作中经常遇到文本处理上的两个问题,一个是如何在长的文本串中找到跟短文本串最像的子串;另一个是如何将两个文本串进行对齐,忽略掉其中不同的部分。准备专门写一个工具来解决这些问题,因此先调研了模糊匹配和字符串对齐的工具。

参考:
https://zhuanlan.zhihu.com/p/53135935
https://www.thinbug.com/q/17740833
https://github.com/eseraygun/python-alignment
https://pypi.org/project/StringDist/
https://pypi.org/project/edlib/
https://pypi.org/project/strsimpy/
https://github.com/gfairchild/pyxDamerauLevenshtein
https://github.com/mbreese/swalign/
打印表格:https://pypi.org/project/tabulate/
https://pypi.org/project/weighted-levenshtein/
https://pypi.org/project/nwalign/
https://pypi.org/project/pyhacrf-datamade/
打印表格:https://pypi.org/project/Frmt/

difflib

参考:
https://blog.csdn.net/Lockey23/article/details/77913855
https://blog.csdn.net/gavin_john/article/details/78951698
https://docs.python.org/3.5/library/difflib.html

difflib模块提供的类和方法用来进行序列的差异化比较,它能够比对文件并生成差异结果文本或者html格式的差异化比较页面。

SequenceMatcher

SequenceMatcher类可以用来比较两个任意类型的数据,只要是可以哈希的。它使用一个算法来计算序列的最长连续子序列,并且忽略没有意义的“无用数据”。下面代码计算了模糊匹配的相似度:

1
2
3
>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

其基本算法比Ratcliff和Obershelp在20世纪80年代末发表的“格式塔模式匹配”(gestalt pattern matching)算法更早,也更新奇。其思想是寻找不包含“垃圾”元素的最长连续匹配子序列;这些“垃圾”元素在某种意义上是无趣的,比如空白行或空白(垃圾信息处理是Ratcliff和Obershelp算法的扩展)。然后,将相同的思想递归地应用到匹配子序列的左子序列和右子序列。这不会产生最小的编辑序列,但是会产生人们“看起来正确”的匹配。
在时间复杂度上,基本的Ratcliff-Obershelp算法在最坏情况下是三次时间,在期望情况下是二次时间。SequenceMatcher是最坏情况下的二次时间,它的期望情况行为以一种复杂的方式依赖于序列有多少个公共元素;最好的情况是时间是线性的。

SequenceMatcher支持一种自动将某些序列项视为垃圾的启发式方法。启发式计算每个单独的项目在序列中出现的次数。如果一个项目的重复项(在第一个之后)占序列的1%以上,并且序列至少有200个项目长,则该项目将被标记为“popular”,并被视为垃圾,以便进行序列匹配。在创建SequenceMatcher时,可以通过将autojunk参数设置为False来关闭这种启发式。下面代码可以得到编辑距离的所有操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import difflib
s1 = [1, 2, 3, 5, 6, 4]
s2 = [2, 3, 5, 4, 6, 1]
# 忽略所有空格
# SequenceMatcher(lambda x: x == ' ', A, B)
matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in reversed(matcher.get_opcodes()):
if tag == 'delete':
print('Remove {} from positions [{}:{}]'.format(
s1[i1:i2], i1, i2))
print(' before =', s1)
del s1[i1:i2]
elif tag == 'equal':
print('s1[{}:{}] and s2[{}:{}] are the same'.format(
i1, i2, j1, j2))
elif tag == 'insert':
print('Insert {} from s2[{}:{}] into s1 at {}'.format(
s2[j1:j2], j1, j2, i1))
print(' before =', s1)
s1[i1:i2] = s2[j1:j2]
elif tag == 'replace':
print(('Replace {} from s1[{}:{}] '
'with {} from s2[{}:{}]').format(
s1[i1:i2], i1, i2, s2[j1:j2], j1, j2))
print(' before =', s1)
s1[i1:i2] = s2[j1:j2]
print(' after =', s1, '\n')
print('s1 == s2:', s1 == s2)

输出结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Replace [4] from s1[5:6] with [1] from s2[5:6]
  before = [1, 2, 3, 5, 6, 4]
   after = [1, 2, 3, 5, 6, 1] 

s1[4:5] and s2[4:5] are the same
   after = [1, 2, 3, 5, 6, 1] 

Insert [4] from s2[3:4] into s1 at 4
  before = [1, 2, 3, 5, 6, 1]
   after = [1, 2, 3, 5, 4, 6, 1] 

s1[1:4] and s2[0:3] are the same
   after = [1, 2, 3, 5, 4, 6, 1] 

Remove [1] from positions [0:1]
  before = [1, 2, 3, 5, 4, 6, 1]
   after = [2, 3, 5, 4, 6, 1]

get_matching_blocks

返回匹配子序列的三元组列表。每个三元组的形式是(i, j, n),表示a[i:i+n] == b[j:j+n]。在i和j中,三元组是单调递增的。最后一个三元组是一个哑元,它的值是(len(a), len(b), 0),它是唯一一个n == 0的三元组。

1
2
3
>>> s = SequenceMatcher(None, "abxcd", "abcd")
>>> s.get_matching_blocks()
[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]

如果想获得所有match的子串,可以这样写:

1
2
3
4
5
6
7
8
9
10
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
yield match
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

结果:

1
2
>>> print(list(matches(large_string, query_string, 0.8)))
['manhatan', 'manhattn']

fuzzywuzzy

参考:
https://github.com/seatgeek/fuzzywuzzy
https://zhuanlan.zhihu.com/p/77166627
https://blog.csdn.net/laobai1015/article/details/80451371
https://stackoverflow.com/questions/48671270/use-sklearn-tfidfvectorizer-with-already-tokenized-inputs
https://github.com/ing-bank/sparse_dot_topn

FuzzyWuzzy 是一个简单易用的模糊字符串匹配工具包。它依据 Levenshtein Distance 算法 计算两个序列之间的差异。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# 简单匹配
fuzz.ratio("this is a test", "this is a test!")

# 非完全匹配
fuzz.partial_ratio("this is a test", "this is a test!")

# 忽略顺序匹配
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

# 去重子集匹配
fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

# 返回模糊匹配的字符串和相似度
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=2)
# 返回:[('New York Jets', 100), ('New York Giants', 78)]
process.extractOne("cowboys", choices)
# 返回:("Dallas Cowboys", 90)
# 可以传入附加参数到 extractOne 方法来设置使用特定的匹配模式,一个典型的用法是来匹配文件路径
process.extractOne("System of a down - Hypnotize - Heroin", songs)
# 返回:('/music/library/good/System of a Down/2005 - Hypnotize/01 - Attack.mp3', 86)
process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
# 返回:("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

上面方法可以用于在候选answers中找到最接近query的answer,但在面临大数据时,会遇到速度慢的问题。我们可以通过先确定一个候选answers的子集,再进行fuzzywuzzy的方式缩短运行时间。
首先,我们先将候选answers转换成tf-idf向量:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
choices = [["candle"], ["Don't", "trouble", "trouble", "until", "trouble", "troubles", "you."], ["A", "bad", "excuse", "is", "better", "than", "none", "at", "all."], ["Bad", "excuses", "are", "worse", "than", "none."], ["A", "bribe", "in", "hand", "betrays", "mischief", "at", "heart."], ["A", "candle", "lights", "others", "and", "consumes", "itself."], ["Don't", "teach", "your", "grandmother", "to", "suck", "eggs."], ["A", "teacher", "is", "just", "a", "candle", ",", "lights", "others", "and", "consumes", "itself."], ["A", "a", "candle", "lights", "others", "and", "consumes", "itself."]]
# 按word分,还可以按char、char_wb处理:
# vectorizer = TfidfVectorizer(min_df=1, analyzer='word')
# 也可以使用自定义的分词
vectorizer = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
tf_idf_matrix_candidates = vectorizer.fit_transform(choices)
tf_idf_matrix_queries = tf_idf_matrix_candidates[-1]
tf_idf_matrix_candidates = tf_idf_matrix_candidates[:-1]
# vectorizer.get_feature_names()可以看到所有的token

其次使用sparse_dot_topn找到相似的字符串:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct

def awesome_cossim_top(A, B, ntop, lower_bound=0):
# force A and B as a CSR matrix.
# If they have already been CSR,there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeros(M+1,dtype=idx_dtype)
indices = np.zeros(nnz_max,dtype=idx_dtype)
data = np.zeros(nnz_max,dtype=A.dtype)
ct.sparse_dot_topn(M, N, np.asarray(A.indptr,dtype=idx_dtype), np.asarray(A.indices,dtype=idx_dtype), A.data, np.asarray(B.indptr,dtype=idx_dtype), np.asarray(B.indices,dtype=idx_dtype), B.data, ntop, lower_bound, indptr, indices, data)
return csr_matrix((data,indices,indptr),shape=(M,N))

matches = awesome_cossim_top(tf_idf_matrix_candidates, tf_idf_matrix_queries.transpose(),1,0.0).todense()
matches = np.squeeze(matches)
match_score_index = np.argsort(-matches)

alignment

alignment主要用于字符串之间对齐,其使用方法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from alignment.sequence import Sequence
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, GlobalSequenceAligner
# Create sequences to be aligned.
a = Sequence('what a beautiful day'.split())
b = Sequence('what a disappointingly bad day'.split())
# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)
# Create a scoring and align the sequences using global aligner.
scoring = SimpleScoring(1, -1)
aligner = GlobalSequenceAligner(scoring, -1)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)
# Iterate over optimal alignments and print them.
for encoded in encodeds:
alignment = v.decodeSequenceAlignment(encoded)
print(alignment)
print('Alignment score:', alignment.score)
print('Percent identity:', alignment.percentIdentity())

strsimpy

这是一个用于计算各种字符串距离的包。其使用方法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
from strsimpy.levenshtein import Levenshtein
levenshtein = Levenshtein()
print(levenshtein.distance('My string', 'My $string'))
from strsimpy.normalized_levenshtein import NormalizedLevenshtein
normalized_levenshtein = NormalizedLevenshtein()
print(normalized_levenshtein.distance('My string', 'My $string'))
print(normalized_levenshtein.similarity('My string', 'My $string'))

# 带权重的编辑距离
from strsimpy.weighted_levenshtein import WeightedLevenshtein
from strsimpy.weighted_levenshtein import CharacterSubstitutionInterface
class CharacterSubstitution(CharacterSubstitutionInterface):
def cost(self, c0, c1):
if c0=='t' and c1=='r':
return 0.5
return 1.0
weighted_levenshtein = WeightedLevenshtein(CharacterSubstitution())
print(weighted_levenshtein.distance('String1', 'String2'))
from strsimpy.damerau import Damerau
damerau = Damerau()
print(damerau.distance('ABCDEF', 'ABDCEF'))

# 最优化对齐后的编辑距离
from strsimpy.optimal_string_alignment import OptimalStringAlignment
optimal_string_alignment = OptimalStringAlignment()
print(optimal_string_alignment.distance('CA', 'ABC'))

from strsimpy.jaro_winkler import JaroWinkler
jarowinkler = JaroWinkler()
print(jarowinkler.similarity('My string', 'My tsring'))

# 最长公共子序列
from strsimpy.longest_common_subsequence import LongestCommonSubsequence
lcs = LongestCommonSubsequence()
print(lcs.distance('AGCAT', 'GAC'))

from strsimpy.metric_lcs import MetricLCS
metric_lcs = MetricLCS()
s1 = 'ABCDEFG'
s2 = 'ABCDEFHJKL'
print(metric_lcs.distance(s1, s2))

# ngram
from strsimpy.ngram import NGram
twogram = NGram(2)
print(twogram.distance('ABCD', 'ABTUIO'))
s1 = 'Adobe CreativeSuite 5 Master Collection from cheap 4zp'
s2 = 'Adobe CreativeSuite 5 Master Collection from cheap d1x'
fourgram = NGram(4)
print(fourgram.distance(s1, s2))

from strsimpy.qgram import QGram
qgram = QGram(2)
print(qgram.distance('ABCD', 'ABCE'))

from strsimpy.cosine import Cosine
cosine = Cosine(2)
s0 = 'My first string'
s1 = 'My other string...'
p0 = cosine.get_profile(s0)
p1 = cosine.get_profile(s1)
print(cosine.similarity_profiles(p0, p1))