How do you create a category column while file reading in pandas?

What are categorical values in pandas?

A Categorical is a pandas data type that corresponds to the categorical variables in statistics. A categorical variable usually takes a fixed number of possible values. Some of the examples that can be considered as categorical are, gender, social class, blood type, country, etc.

Old way to create a categorical column in pandas

Earlier in pandas, you could create a category column after reading the file. Below is a code snippet that shows how this works. We used the astype() function to convert a column to a category column.

Let’s take a look at the code:

import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
print("Datatype of each column:")
print(drinks.dtypes)
drinks['continent'] = drinks.continent.astype('category')
print("\nDatatype after creating category column:")
print(drinks.dtypes)

Explanation:

  • In line 1, we import the required package.
  • In line 3, we read a CSV file from the URL.
  • In line 6, we print the data types of all the columns. You can see that continent is of type object and not a categorical column.
  • In line 8, we create the continent column as a categorical column using the astype() function.
  • In line 11, we again print the data types of all the columns, and can see that the continent column is now a categorical column.

New way to create a categorical column in pandas

The above approach works fine, but what if we could do this conversion while reading the file itself?

Take a look at the code to see how this works.

import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry',
dtype={'continent':'category'})
print("Datatype of each column:")
print(drinks.dtypes)

Explanation:

  • In line 1, we import the required package.
  • In line 3, we read the CSV file and, while reading the file, we pass the dtype parameterwhere we set the data type of the continent column.

Similarly, you can set the data type of multiple columns using key-value pairs.

  • In line 7, we print the data type of all the columns and can see that the continent column is now a categorical column.

Free Resources