How to normalize text using the cleantext package in Python

What is the cleantext package?

cleantext is a third-party package that pre-processes text data to obtain a normalized text representation.

The package can be installed via pip. Check the following command to install the clean-text package:

pip install clean-text

The normalize_whitespace() method is a built-in function that is provided by the cleantext library in Python. We can use it to perform the following operations:

  1. Replace one or more spacings with a single space.
  2. Replace one or more line breaks with a single newline.
  3. Strip leading/trailing whitespaces.

Syntax

from cleantext import clean

clean(text, normalize_whitespace=True)

The cleantext package provides us with the clean function.

Parameters

  • text: This is the text data to normalize.
  • normalize_whitespace: This is a Boolean value indicating whether to normalize whitespaces in the text. By default, the value is True.

Return value

The method returns the normalized text.

Code

import cleantext
string = """hello educative
hello edpresso """
normalized_string = cleantext.clean(string, normalize_whitespace=True)
print("Original String - '" + string + "'")
print("\n")
print("Normalized String - '" + normalized_string + "'")

Code explanation

  • Line 1: We import the cleantext package.
  • Lines 3-5: This is a string with newlines, and multiple spaces are defined.
  • Line 7: We obtain a normalized string removing multiple spaces using the clean method and passing normalize_whitespace as True.
  • Lines 9-11: We print the original and the normalized string.

Free Resources