How to use the fetch_20newsgroups() function

Overview

The 20 newsgroups dataset is used in classification problems. The fetch_20newsgroups() function allows the loading of filenames and data from the 20 newsgroups dataset. It has 20 classes, 18846 observations, and features in the form of strings.

It downloads the dataset from the original 20 newsgroups website and caches it locally.

Syntax


sklearn.datasets.fetch_20newsgroups(*,
data_home=None,
subset='train',
categories=None,
shuffle=True,
random_state=42,
remove=(),
download_if_missing=True,
return_X_y=False)

Parameters

It takes the following argument values:

  • data_home: This is the directory to download/cache the dataset. By default, it's '~/scikit_learn_data'.
  • subset: This partially selects dataset as segments like train, test, or all. By default, its value is 'train'.
  • categories: If None, it loads all the categories of the dataset. Otherwise, it requires a list of categories to load.
  • shuffle: Its default value is True, it shows whether or not to shuffle this dataset when loading into the program.
  • download_if_missing: Its default value is True. If set to False, it instructs not to download the dataset locally if it's missing.

Return value

It returns a dictionary-like object, bunch-object.

Example

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
# fetch 20 newsgroups dataset
data= fetch_20newsgroups()
# print dataset on console
print(data)

Explanation

  • Line 1–2: We load the fetch_20newsgroups() method from the sklearn.datasets module.
  • Line 5: We invoke the fetch_20newsgroups() method to load 20 newsgroups dataset into the program.
  • Line 7: We print the loaded dataset to the console.

Free Resources