How to convert XML to a DataFrame using BeautifulSoup

Overview

XMLExtensible Markup Language is a markup language that's used to represent information like HTML, but without predefined tags. It stores information in plain text format, which provides us with a way to store information that's independent of the platform.

Learn more: What is XML?

In this Answer, we discuss the process of creating a DataFrame using the BeautifulSoup library in Python. The BeautifulSoup library is used for data pulling and web scrapingProcess of collecting data from online websites with the help of bots..

We use the following command to install the BeautifulSoup library and some other necessary libraries locally:

# install beautifulsoup
pip install beautifulsoup4
# install pandas
pip install pandas

Step-by-step approach

Step 1: We include some necessary libraries in the program.

from bs4 import BeautifulSoup # including BeautifulSoup from bs4 module
import pandas as pd # including pandas as pd

In the code snippet given above, we import the BeautifulSoup and Pandas libraries.

Step 2: We read the XML file.

fd = open("data.xml",'r')
data = fd.read()

In the code snippet above, we open a data.xml file in read mode, 'r'. The open() function returns a file descriptor, fd. Then, we use the read() function to extract the file content in data.

Step 3: We invoke the BeautifulSoup library.

soup = BeautifulSoup(data,'xml')

Here, we pass data and the data file format xml to the BeautifulSoup function.

Step 4: We search the data.

authors = soup.find_all('author')
titles = soup.find_all('title')
prices = soup.find_all('price')
pubdate = soup.find_all('publish_date')
genres = soup.find_all('genre')
des = soup.find_all('description')

Step 5:  We get the text data from XML.

data = []
for i in range(0,len(authors)):
rows = [authors[i].get_text(),titles[i].get_text(),
genres[i].get_text(),prices[i].get_text(),
pubdate[i].get_text(),des[i].get_text()]
data.append(rows)

Step 6:  We create and print the DataFrame.

df = pd.DataFrame(data,columns = ['Author','Book Title',
'Genre','Price','Publish Date',
'Description'], dtype = float)
display(df)

Free Resources