In this shot, we will learn about a fundamental structure in pandas: the DataFrame. While a Series is essentially a column, a DataFrame is a multi-dimensional table made up of a collection of Series. DataFrames allow us to store and manipulate tabular data where rows consist of observations and columns represent variables.
There are several ways to create a DataFrame through the use of pd.DataFrame()
. For
example, we can:
Let’s look at each of these methods in detail.
To create a DataFrame from a single Series, we can pass the Series object as
input to the DataFrame creation method, along with an optional input
parameter, column
, which allows us to name the columns:
import pandas as pddata_s1 = pd.Series([12, 24, 33, 15],index=['apples', 'bananas', 'strawberries', 'oranges'])# 'quantity' is the name for our columndataframe1 = pd.DataFrame(data_s1, columns=['quantity'])print(dataframe1)
We can construct a DataFrame from any list of dictionaries. Say we have a dictionary with countries, their capitals, and some other
import pandas as pddict = {"country": ["Norway", "Sweden", "Spain", "France"],"capital": ["Oslo", "Stockholm", "Madrid", "Paris"],"SomeColumn": ["100", "200", "300", "400"]}data = pd.DataFrame(dict)print(data)
We can also construct a DataFrame from a dictionary of Series objects. Say we have two different Series: one for the price of fruits and one for their quantity. We want to put all the fruits-related data together into a single table. We can do this like so:
import pandas as pdquantity = pd.Series([12, 24, 33, 15],index=['apples', 'bananas', 'strawberries', 'oranges'])price = pd.Series([4, 4.5, 8, 7.5],index=['apples', 'bananas', 'strawberries', 'oranges'])df = pd.DataFrame({'quantity': quantity,'price': price})print(df)
It’s quite simple to load data from
We will import actual data to analyze the IMDB-movies dataset in the next lesson.
Here is what loading data from different file formats looks like in code:
import pandas as pd
# Given we have a file called data1.csv in our working directory:
df = pd.read_csv('data1.csv')
#given JSON data
df = pd.read_json('data2.json')