By © Hans Hillewaert, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3153928

This is Part 1 in a series covering NLP (Natural Language Processing) which progresses through a real-world problem that I recently encountered.

I thought this would be a good topic to write since it is relatable to many online businesses and demonstrates how leetcode/hackerrank types of problems can occur in your day-to-day as a Software Engineer.

So what is NLP anyway? NLP is a branch of Computer Science that roughly dates back to the late 1940s and early 1950s in early proposals to automate language translation. The research in this domain progressed initially through rule-based and symbolic systems like ELIZA which relied on pattern matching and scripted responses. This was followed by statistical models like Hidden Markov Models (HMM) and n-grams which try to predict the next word based on the preceding words. Fast forward to the 2020’s and the deep learning revolution has taken the world by storm. NLP is an especially hot topic now, given the widespread adoption and growth of Large Language Models (LLMs) such as ChatGPT.

This series will take you through time, showing how a progression of the fundamental approaches of each generation applies to essentially the same problem with an increase in complexity as we move forward.

Let us start with the following:

wordSynonyms = set([
("popularity", "rating"),
("ratings", "popularity"),
("game", "match")
])

sentencePairs = [
("Click to see Messi's ratings", "Click to see Messi's popularity"),
("Messi's popularity has increased", "Messi's rating has increased"),
("Messi's rating has grown after the last match", "Messi's popularity has grown after the last game"),
("Is Messi the best", "Is Messi the greatest"),
("Messi's assists vs Ronaldo", "Ronaldo's assists vs Messi's"),
("Cristiano Ronaldo vs Lionel Messi, head to head", "Lionel Messi v. Cristiano Ronaldo, toe to toe")
]

# implement a method that evaluates the above to [True, True, True, False, False, False])

Requirements:

  1. a sentence can have multiple synonyms
  2. sentences always have the same number of words
  3. ordering must be maintained

As you can see from the above input parameters and requirements, to determine equivalence between each sentence pair, we can evaluate each word in order, since ordering is always the same and sentences always have the same number of words. This isolates the comparisons to the specific word pairs across both sentences which have the same ordinal word position. For example, consider the first sentence pair:

(“Click to see Messi’s ratings“, “Click to see Messi’s popularity“),

Using our set of tuples in wordSynonyms, we can check for the existence of (ratings, popularity) or (popularity, ratings).

So for our first version, we’ll break apart each sentence into lists of words, then iterate through each word pair and compare them. If they are different, we’ll check if the pair is in our wordSynonyms, if it’s not, then return False. If we go through each word pair without returning False, then the sentences are equivalent, return True!

def compare(s1: list[str], s2: list[str], synonymSet: set[str]) -> bool:
  if len(s1) != len(s2):
    return False

  for i in range(len(s1)):
    w1 = s1[i]
    w2 = s2[i]
    if w1 != w2:
      if (w1,w2) in wordSynonyms or (w2,w1) in wordSynonyms:
        continue
      else:
        return False
  return True 

def similarityRank(sentencePairs: list[tuple[str, str]]) -> list[bool]:
  result = []
  for s1, s2 in sentencePairs:
    result.append(compare(s1.lower().split(" "), s2.lower().split(" ")))
  return result

Not so bad eh?! Quick note we could python-ify the above code using list comprehensions and zip to reduce the total number of lines, but I intentionally avoided that to make it easier to understand/translate across other programming languages.

The .lower() call on each s1 and s2 helps us handle cases where the sentence could start with one of our synonyms and affect the casing, e.g. Popularity.

Also if you are wondering why we chose to iterate through the sentence pairs instead of the word synonyms, this is a quick lesson in runtime complexity. Simply consider if your wordSynonyms was a dictionary of the entire language. You wouldn’t want to go through the full database each time you want to compare a few sentences.

Ready for our first twist? You might have already noticed earlier, the sentences don’t have punctuation marks. What if we add those in?

(“Click to see Messi’s ratings.”, “Click to see Messi’s popularity!”),

Then our code fails! Since the tuple requires an exact match to ratings+popularity without punctuation, and we string split on spaces, the lookup will attempt ratings.popularity! and popularity!ratings. – neither of which combination are in wordSynonyms.

This is also solvable with a few more simple rules. Remember earlier where we performed a .lower() on each sentence. Since there are a finite number of punctuation marks, we can replace this with a simple sanitize method. Our sanitize method will check if each character is a letter or number or space. Most languages can implement this the same way, for example Java has IsLetterOrDigit.

def sanitize(text: str):
    sb = [] 
    for c in text.lower(): #lower here is less method calls than on each char
        if c.isalnum() or c.isspace():
            sb.append(c)
    return "".join(sb)

Cool so we simply replace the original call to .lower() in the similarityRank method with a call to sanitize which now handles lower and filters out punctuations.

Quick side note that if you have malformed input, for example, extra spaces, that could be handled with some more internal sanitization, assuming words are not malformed and always broken up by spaces, which we will get to in later lessons. 🙂

What if we introduce some more variances on sentence structural words that are not exactly synonyms or synonym pairs?

(“Click here for Messi’s ratings.”, “Click to see Messi’s popularity!”),

As humans we semantically understand these two sentences to be roughly equivalent. If we continue with the rule-based approaches we’ve been using so far, we do have some options here. For example we can filter out any structural words that don’t have direct meaning, such as “here”, “to”, “see” and “for”. These are referred to as “Stop Words”. There are some existing libraries that already have these lists of stop words (e.g. NLTK and spaCy in Python), but for this example lets simply set it.

stopWords = set([
    "here", "for", "to", "see"
])

def sanitize(text: str):
    sb = [] 
    for c in text.lower(): #lower here is less method calls than on each char
        if c.isalnum() or c.isspace():
            sb.append(c)
          
    sanitizedWords = "".join(sb).split(" ")
    filteredWords = [word for word in sanitizedWords if word not in stopWords]
    return " ".join(filteredWords)

This keeps the sentences equivalent, yet this approach also only gets us so far. Time for the last twist. What if the sentences have varied length?

(“Click to see latest Messi’s ratings.”, “Click to see stats on Messi’s popularity!”),

Here is where our rule-based approach starts to break down. One step that we can take without making significant changes to our algorithm is to start considering approximate string matching -sometimes known by the colloquial slight misnomer – fuzzy matching. There are some creative things we can do here, such as assign different words values, for example based on their length or similarity to stop words, and/or handle other word transformations with filters or replacements (e.g. “seeing”).

Consider this simple change to the compare method, which begins to add some lenience on the sentence equivalence:

def compare(s1: list[str], s2: list[str]) -> bool:
  variance = 0
  minLen = min(len(s1), len(s2))
  for i in range(minLen):
    w1 = s1[i]
    w2 = s2[i]
    #print(f"w1: {w1} w2: {w2}")
    if w1 != w2:
      print(f"w1: {w1} w2: {w2}")
      if (w1,w2) in wordSynonyms or (w2,w1) in wordSynonyms:
        continue
      else:
        variance += 1
  print(f"variance: {variance}")
  return variance < 2

Now there are a lot of issues with the above approach. Aside from the potential degradation in quality, lets first draw attention to something you might have noticed – our variance counter does not consider the length of the sentences when determining variance. In this implementation, a long and short sentence or two short sentences both count word variances rather than the percentage of difference between them. If you noticed this, great, as this marks an important mental model shift!

What we are calling variance here is actually a form of edit distance called the Word-Level Edit Distance or Word Error Rate (WER); meaning the number of edits or changes needed to make the strings match. Currently we have a rule to substitute synonyms for words before we check for equivalence, but what if we introduce insertions and deletions?

Consider the following transformations for the previous sentence pair, after filtering out punctation and stop words.

s1 -> “Click to see latest Messi’s ratings” -> transform to s2 -> “Click to see stats on Messi’s popularity

  1. s1 -> replace ratings with similarity. Do not count since it meets our synonym rule.
  2. s1 -> delete latest. increment edits.
  3. s1 -> insert stats. increment edits

s1 requires 2 edits, ignoring the first replace synonyms step per our replacement rule.

s2 -> “Click to see stats on Messi’s popularity” -> transform to s1 -> “Click to see latest Messi’s ratings

  1. s2 -> delete stats. increment edits.
  2. s2 -> insert latest. increment edits.
  3. s2 -> replace ratings with popularity. Do not count since it meets our synonym rule

s2 requires 2 edits, ignoring the first replace synonyms step per our replacement rule..

Our takeaway from this quick exercise is each sentence required 2 edits to become equivalent, or we would say in this context each has 2 errors. Since there are 4 words in each of reduced sentences, our overall Word Error Rate (WER) is (2+2)/(4+4) = 4/8 = 1/2 or 50% WER.

You might be wondering, what if we kept the stop words in? Interestingly, our original sentences had lengths of 6 and 7 words, and can achieve equivalence with only one extra deletion (by deleting “on”), which results in (2+3)/(6+7) = 38.5% WER. This illustrates a complex property of stop words where they can sometimes be helpful or harmful depending on context, and highlights more limits. This can result in very different results. For example, if “not” is a stop word, then this sentence pair would be considered equivalent!

(“Do not take with water”, “Do take with water”)

Yet we could keep expanding on the rule-based approach further if we wanted. The above could be handled with a new rule to consider words with opposite meanings or negation effects on sentences. By adding a complex series of conditional logic, to an extent we could take it pretty far with some hard work, sweat and tears. You can also speed some of this up with metaprogramming and templating, and eventually arrive at a solution similar to many early chatbots, including the predecessor to Amazon Alexa developed by Ivona Software. Except there are diminishing returns at scale to rule-based textual similarity comparison, especially considering debug complexity. To improve further, a paradigm shift is needed in how we think about similarity measure and ranking.

This brings us to Part 2. Jaccard similarity, Cosine similarity and applications of edit distance (Levenshtein distance). Using these, we will transform our original algorithm into a new more robust version and move forward into the next period in NLP history.

Key Terms

Rule-based system. https://en.wikipedia.org/wiki/Rule-based_system

Brief history of Natural Language Processing: (NLP) – ELIZA, Hidden Markov Models (HMM), n-grams, Large Language Models (LLMs). https://en.wikipedia.org/wiki/History_of_natural_language_processing

Stop Words; Natural Language Toolkit (NLTK), spaCy. https://en.wikipedia.org/wiki/Stop_word, https://www.nltk.org/, https://spacy.io/

approximate string matching and fuzzy matching. https://en.wikipedia.org/wiki/Approximate_string_matching

Word-Level Edit Distance or Word Error Rate, general edit distance (Levenshtein distance). https://en.wikipedia.org/wiki/Edit_distance

similarity measure. https://en.wikipedia.org/wiki/Similarity_measure

Jaccard similarity, Cosine similarity. https://en.wikipedia.org/wiki/Jaccard_index, https://en.wikipedia.org/wiki/Cosine_similarity

Minor Terms

https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions

https://docs.python.org/3/library/functions.html#zip

https://en.wikipedia.org/wiki/Metaprogramming

Ronnie Diaz Avatar

Published by

Leave a comment