Taming Text with string2string: A Powerful Python Library for String-to-String Algorithms

Leverage string2string for Natural Language Processing Tasks

Data Science
Machine Learning
Natural Language Processing
Python Library
Similarity Analysis
Semantic Search
Lexical Search
Text Mining
Open-Source
Author
Published

May 11, 2023

Modified

June 7, 2023

Featured image of wordcloud of concepts covered in string2string library.

Note

👉 This article is also published on Towards Data Science blog.

Introduction

The string2string library is an open-source tool that has a full set of efficient methods for string-to-string problems.1 String pairwise alignment, distance measurement, lexical and semantic search, and similarity analysis are all covered in this library. Additionally, a variety of useful visualization tools and metrics that make it simpler to comprehend and evaluate the findings of these approaches are also included.

The library has well-known algorithms like the Smith-Waterman, Hirschberg, Wagner-Fisher, BARTScore, BERTScore, Knuth-Morris-Pratt, and Faiss search. It can be used for many jobs and problems in natural-language processing, bioinformatics, and computer social studies [1].

The Stanford NLP group, which is part of the Stanford AI Lab, has developed the library and introduced it in [1]. The library’s GitHub repository has several tutorials that you may find useful.

Note

A string is a sequence of characters (letters, numbers, and symbols) that stands for a piece of data or text. From everyday phrases to DNA sequences, and even computer programs, strings may be used to represent just about everything [1].

Installation

You can install the library via pip by running pip install string2string.

Pairwise Alignment

String pairwise alignment is a method used in NLP and other disciplines to compare two strings, or sequences of characters, by highlighting their shared and unique characteristics. The two strings are aligned, and a similarity score is calculated based on the number of shared characters, as well as the number of shared gaps and mismatches. This procedure is useful for locating sequences of characters that share similarities and calculating the “distance” between two sets of strings. Spell checking, text analysis, and bioinformatics sequence comparison (e.g., DNA sequence alignment) are just some of the many uses for it.

Currently, the string2string package provides the following alignment techniques:

  • Needleman-Wunsch for global alignment
  • Smith-Waterman for local alignment
  • Hirchberg’s algorithm for linear space global alignment
  • Longest common subsequence
  • Longest common substring
  • Dynamic time warping (DTW) for time series alignment

In this post, we’ll look at two examples: one for global alignment and one for time series alignment.

Needleman-Wunsch Algorithm for Global Alignment

The Needleman-Wunsch algorithm is a type of dynamic programming algorithm that is often used in bioinformatics to match two DNA or protein sequences, globally.

from string2string.alignment import NeedlemanWunsch

nw = NeedlemanWunsch()

# Define two sequences
s1 = 'ACGTGGA'
s2 = 'AGCTCGC'

aligned_s1, aligned_s2, score_matrix = nw.get_alignment(s1, s2, return_score_matrix=True)

print(f'The alignment between "{s1}" and "{s2}":')
nw.print_alignment(aligned_s1, aligned_s2)
The alignment between "ACGTGGA" and "AGCTCGC":
A | C | G | - | T | G | G | A
A | - | G | C | T | C | G | C

For a more informative comparison, we can use plot_pairwise_alignment() function in the library.

from string2string.misc.plotting_functions import plot_pairwise_alignment

path, s1_pieces, s2_pieces = nw.get_alignment_strings_and_indices(aligned_s1, aligned_s2)

plot_pairwise_alignment(
    seq1_pieces=s1_pieces,
    seq2_pieces=s2_pieces,
    alignment=path,
    str2colordict={
        "A": "pink", 
        "G": "lightblue", 
        "C": "lightgreen", 
        "T": "yellow", 
        "-": "lightgray"
    },
    title="",
    seq1_name="Sequence 1",
    seq2_name="Sequence 2",
)
Figure 1: Global alignment between ACGTGGA and AGCTCGC

Dynamic Time Warping

DTW is a useful tool to compare two time series that might differ in speed, duration, or both. It discovers the path across these distances that minimizes the total difference between the sequences by calculating the “distance” between each pair of points in the two sequences.

Let’s go over an example using the alignment module in the string2string library.

from string2string.alignment import DTW

dtw = DTW()

x = [3, 1, 2, 2, 1]
y = [2, 0, 0, 3, 3, 1, 0]

dtw_path = dtw.get_alignment_path(x, y, distance="square_difference")
print(f"DTW path: {dtw_path}")
DTW path: [(0, 0), (1, 1), (1, 2), (2, 3), (3, 4), (4, 5), (4, 6)]

Above is an example borrowed from my previous post, An Illustrative Introduction to Dynamic Time Warping. For those looking to delve deeper into the topic, in [2], I explained the core concepts of DTW in a visual and accessible way.

Search Problems

String search is the task of finding a pattern substring within another string. The library offers two types of search algorithms: lexical search and semantic search.

Distance

String distance is the task of quantifying the degree to which two supplied strings differ using a distance function. Currently, the string2string library offers the following distance functions:

  • Levenshtein edit distance
  • Damerau-Levenshtein edit distance
  • Hamming distance
  • Jaccard distance3

Levenshtein edit distance

Levenshtein edit distance, or simply the edit distance, is the minimal number of insertions, deletions, or substitutions needed to convert one string into another.

from string2string.distance import LevenshteinEditDistance

edit_dist = LevenshteinEditDistance()

# Create two strings
s1 = "The beautiful cherry blossoms bloom in the spring time."
s2 = "The beutiful cherry blosoms bloom in the spring time."

# Let's compute the edit distance between the two strings and measure the computation time
edit_dist_score  = edit_dist.compute(s1, s2, method="dynamic-programming")

print(f'The distance between the following two sentences is {edit_dist_score}:') #\n- "{s1}"\n- "{s2}"')
print(f'"The beautiful cherry blossoms bloom in the spring time."')
print(f'"The beutiful cherry blosoms bloom in the spring time."')
The distance between the following two sentences is 2.0:
"The beautiful cherry blossoms bloom in the spring time."
"The beutiful cherry blosoms bloom in the spring time."

Jaccard Index

The Jaccard index can be used to quantify the similarity between sets of words or tokens and is commonly used in tasks such as document similarity or topic modeling. For example, the Jaccard index can be used to measure the overlap between the sets of words in two different documents or to identify the most similar topics across a collection of documents.

from string2string.distance import JaccardIndex 

jaccard_dist = JaccardIndex()

# Suppose we have two documents
doc1 = ["red", "green", "blue", "yellow", "purple", "pink"]
doc2 = ["green", "orange", "cyan", "red"]

jaccard_dist_score = jaccard_dist.compute(doc1, doc2)

print(f"Jaccard distance between doc1 and doc2: {jaccard_dist_score:.2f}")
Jaccard distance between doc1 and doc2: 0.75

Similarity

To put it simply, string similarity determines the degree to which two strings of text (or sequences of characters) are linked or similar to one another. Take, as an example, the following pair of sentences:

  • “The cat sat on the mat.”
  • “The cat was sitting on the rug.”

Although not identical, these statements share vocabulary and convey a connected sense. Methods based on string similarity analysis reveal and quantify the degree of similarity between such text pairings.

Duality

There is a duality between string similarity and distance measures, meaning that they can be used interchangeably see [1].

The similarly module of the string2string library currently offers the following algorithms:

  • Cosine similarity
  • BERTScore
  • BARTScore
  • Jaro similarity
  • LCSubsequence similarity

Let’s go over an example of the BERTScore similarity algorithm with the following four sentences:

  1. The bakery sells a variety of delicious pastries and bread.
  2. The park features a playground, walking trails, and picnic areas.
  3. The festival showcases independent movies from around the world.
  4. A range of tasty bread and pastries are available at the bakery.

Sentences 1 and 2 are similar semantically, as both are about bakery and pastry. Hence, we should expect a high similarity score between the two.

Let’s implement the above example in the library.

from string2string.similarity import BERTScore
from string2string.misc import ModelEmbeddings

bert_score = BERTScore(model_name_or_path="bert-base-uncased")
bart_model = ModelEmbeddings(model_name_or_path="facebook/bart-large")

sentences = [
    "The bakery sells a variety of delicious pastries and bread.", 
    "The park features a playground, walking trails, and picnic areas.", 
    "The festival showcases independent movies from around the world.", 
    "A range of tasty bread and pastries are available at the bakery.", 
]

embeds = [
    bart_model.get_embeddings(
        sentence, embedding_type='mean_pooling'
    ) for sentence in sentences
]

# Define source and target sentences (to compute BERTScore for each pair)
source_sentences, target_sentences = [], []
for i in range(len(sentences)):
    for j in range(len(sentences)):
        source_sentences.append(sentences[i])
        target_sentences.append(sentences[j])

# You can rewrite above in a more concise way using itertools.product
# from itertools import product
# source_sentences, target_sentences = map(
#   list, zip(*product(sentences, repeat=2))
# )

bertscore_similarity_scores = bert_score.compute(
    source_sentences,
    target_sentences,
)
bertscore_precision_scores = bertscore_similarity_scores['precision'].reshape(
    len(sentences), len(sentences)
)

We can visualize the similarity between every pair of sentences using the plot_heatmap() function provided in the library.

from string2string.misc.plotting_functions import plot_heatmap

plot_ticks = [f"Sentence {i + 1}" for i in range(len(sentences))]

# We can also visualize the BERTScore similarity scores using a heatmap
plot_heatmap(
    bertscore_precision_scores,
    title="",
    x_ticks=plot_ticks,
    y_ticks=plot_ticks,
    x_label="",
    y_label="",
    valfmt="{x:.2f}",
    cmap="Blues",
)
Semantic similarity (BERTScore) between sentences
Figure 2: Semantic similarity (BERTScore) between sentences

As can be seen above, sentences 1 and 4 are much more similar (using the BERTScore algorithm) as we expected.

Note

📓 You can find the Jupyter notebook for this blog post on GitHub.

Conclusion

The string2string Python library is an open-source tool that provides a full set of efficient methods for string-to-string problems. In particular, the library has four main modules that address the following tasks: 1. pairwise alignment including both global and local alignments; 2. distance measurement; 3. lexical and semantic search; and 4. similarity analysis. The library offers various algorithms in each category and provides helpful visualization tools.

Date Sections Changes
June 7, 2023 - Fixed a few spelling errors.
May 15, 2023 5 Added a commented code on using product() function from itertools library.
May 15, 2023 - Added a link to the published version of this post on Towards Data Science blog.

References

[1]
M. Suzgun, S. M. Shieber, and D. Jurafsky, “string2string: A modern python library for string-to-string algorithms,” 2023, Available: http://arxiv.org/abs/2304.14395
[2]
E. Alizadeh, An Illustrative Introduction to Dynamic Time Warping,” 2020. https://ealizadeh.com/blog/introduction-to-dynamic-time-warping/
[3]
J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.

Footnotes

  1. Visit the library’s GitHub Page.↩︎

  2. A corpus (plural of corpora) is a large and structured collections of texts used for linguistic research, NLP and ML applications.↩︎

  3. Not to be confused with Jaccard similarity coefficient. Jaccard distance = 1 - Jaccard coefficient↩︎

Citation

BibTeX citation:
@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {Taming {Text} with String2string: {A} {Powerful} {Python}
    {Library} for {String-to-String} {Algorithms}},
  date = {2023-05-11},
  url = {https://ealizadeh.com/blog/tutorial-string2string/},
  langid = {en}
}
For attribution, please cite this work as:
E. Alizadeh, “Taming Text with string2string: A Powerful Python Library for String-to-String Algorithms,” May 11, 2023. https://ealizadeh.com/blog/tutorial-string2string/

No Spam. Unsubscribe anytime.