What is the clean() method of the clean-text package in Python?

What is the clean-text package?

clean-text is a third-party package that preprocesses text data to obtain a normalized text representation.

The package can be installed via pip. Check the following command to install the clean-text package:

pip install clean-text

clean() method

The clean() method replaces all the URLs in the given text with the replacement string.

Method signature

clean(
    text,
    fix_unicode=True,
    to_ascii=True,
    lower=True,
    normalize_whitespace=True,
    no_line_breaks=False,
    strip_lines=True,
    keep_two_line_breaks=False,
    no_urls=False,
    no_emails=False,
    no_phone_numbers=False,
    no_numbers=False,
    no_digits=False,
    no_currency_symbols=False,
    no_punct=False,
    no_emoji=False,
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    replace_with_punct="",
    lang="en",
)

Parameters

  • text: This is the text to preprocess.
  • fix_unicode=True: A boolean value indicating whether or not to fix broken unicodes.
  • to_ascii=True: If this is True then it converts non-to_ascii characters into their closest to_ascii equivalents.
  • lower=True: If this is True, it converts the text to lowercase.
  • no_line_breaks=False: If this is True, it strips the line breaks from the text.
  • no_urls=False: This is a boolean value that indicates replacing all the URL strings in the text with a special URL token.
  • no_emails=False: This is a boolean value that indicates whether to replace all emails in the text with a special EMAIL token.
  • no_phone_numbers=False: This is a boolean value indicating whether to replace all the phone numbers in the text with a special PHONE token.
  • no_numbers=False: This is a boolean value indicating whether to replace all the numbers in the text with a special NUMBER token.
  • no_digits=False: This is a boolean value indicating whether to replace all the digits in the text with a special DIGIT token.
  • no_currency_symbols=False: This is a boolean value indicating whether to replace all the currency symbols in the text with a special CURRENCY token.
  • no_punct=False: This is a boolean value indicating whether to remove all the punctuations in the text.
  • replace_with_url="<URL>": This is the special URL token. The default value is <URL>.
  • replace_with_email="<EMAIL>": This is the special EMAIL token. The default value is <EMAIL>.
  • replace_with_phone_number="<PHONE>": This is the special PHONE token. The default value is <PHONE>.
  • replace_with_number="<NUMBER>": This is the special NUMBER token. The default value is <NUMBER>.
  • replace_with_digit="0": This is the special DIGIT token. The default value is 0.
  • replace_with_currency_symbol="<CUR>": This is the special CURRENCY token. The default value is <CUR>.
  • replace_with_punct="": We replace the punctuations with this string. The default value is an empty string.
  • lang="en": This is a parameter to mention the language that indicates the type of text preprocessing. The default value is English (β€˜en’). Other than English, only German (β€˜de’) is supported.

Return value

The method returns the cleaned text depending on the different parameters passed.

Code example 1

import cleantext
txt = "Hello Educative!!! How are you?"
new_txt = cleantext.clean(txt, no_punct=True)
print("Original String - '" + txt + "'")
print("Modified String after removing punctuations - '" + new_txt + "'")

Code explanation

  • Line 1: We import the cleantext package.
  • Line 3: We define a string called txt with punctuations.
  • Line 5: We remove all the punctuations in txt using the clean method and passing no_punct as True. The result is stored in new_txt.
  • Lines 7-9: We print the original and the modified strings.

Code example 2

import cleantext
txt = "Hello Educative!!! 123 How are you? 456"
new_txt = cleantext.clean(txt, no_numbers=True)
print("Original String - '" + txt + "'")
print("Modified String after replacing numbers - '" + new_txt + "'")

Code explanation

  • Line 1: We import the cleantext package.
  • Line 3: We define a string called txt with numbers in it.
  • Line 5: We replace all the numbers in txt with the special NUMBER token using the clean method and passing no_numbers as True. The result is stored in new_txt.
  • Lines 7-9: We print the original and the modified strings.

New on Educative
Learn to Code
Learn any Language as a beginner
Develop a human edge in an AI powered world and learn to code with AI from our beginner friendly catalog
πŸ† Leaderboard
Daily Coding Challenge
Solve a new coding challenge every day and climb the leaderboard

Free Resources