The 20 newsgroups dataset is used in classification problems. The fetch_20newsgroups()
function allows the loading of filenames and data from the 20 newsgroups dataset. It has 20 classes, 18846 observations, and features in the form of strings.
It downloads the dataset from the original 20 newsgroups website and caches it locally.
sklearn.datasets.fetch_20newsgroups(*,data_home=None,subset='train',categories=None,shuffle=True,random_state=42,remove=(),download_if_missing=True,return_X_y=False)
It takes the following argument values:
data_home
: This is the directory to download/cache the dataset. By default, it's '~/scikit_learn_data'
.subset
: This partially selects dataset as segments like train, test, or all. By default, its value is 'train'
.categories
: If None
, it loads all the categories of the dataset. Otherwise, it requires a list of categories to load.shuffle
: Its default value is True
, it shows whether or not to shuffle this dataset when loading into the program.download_if_missing
: Its default value is True
. If set to False
, it instructs not to download the dataset locally if it's missing.It returns a dictionary-like object, bunch-object.
from sklearn.datasets import fetch_20newsgroupsimport pandas as pd# fetch 20 newsgroups datasetdata= fetch_20newsgroups()# print dataset on consoleprint(data)
fetch_20newsgroups()
method from the sklearn.datasets
module.fetch_20newsgroups()
method to load 20 newsgroups
dataset into the program.