In this shot, we will use Regular Expressions, or Regex, to remove duplicate words from the text.
A regular expression, or regex, is basically a pattern used to search for something in textual data. Using regex can help you eliminate a dozen lines of code. Although understanding regex is a bit difficult due to its complex structure, these expressions can be accommodating if you practice them. These expressions are mainly used in text processing or when you are dealing with text data.
We are going to use the below regex:
regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
Let’s break down the sections:
Now, let’s take a look at the code:
import redef removeDuplicatesFromText(text):regex = r'\b(\w+)(?:\W+\1\b)+'return re.sub(regex, r'\1', text, flags=re.IGNORECASE)str1 = "How are are you"print(removeDuplicatesFromText(str1))str2 = "Edpresso is the the best platform to learn"print(removeDuplicatesFromText(str2))str3 = "Programming is fun fun"print(removeDuplicatesFromText(str3))
Explanation:
In line 1, we import the re
package, which will allow us to use regex.
In line 3, we define a function that will return text after removing the duplicate words.
In line 4, we define our regex pattern.
In line 5, we use the sub()
function of the re
module that returns a substring. Here, we pass the regex
pattern: the \1
specifies what needs to be replaced in the input text when the regex
pattern matches the text
, and the flag ignores the case letters.
From lines 7 to 14, we pass some text data containing duplicate words (we can see in the output that it can remove duplicate words from the text).
In this way, it is somewhat effortless to perform text preprocessing.