How to navigate the web with Beautiful Soup

Web scraping has revolutionized the way we gather, analyze, and utilize data from websites. Beautiful Soup is a Python library used for web scraping and parsing HTML and XML documents. Let's go through some basics:

Installing Beautiful Soup

To begin, we need to install Beautiful Soup. Here’s how we can install it using the pip:

pip install beautifulsoup4

Navigating the HTML tree

Navigating the HTML tree is at the heart of web scraping. We will cover various methods and techniques to traverse the document’s hierarchical structure. But first, we need to import Beautiful Soup:

from bs4 import BeautifulSoup

Next, let's create a simple HTML snippet and parse it using Beautiful Soup:

# importing Beautiful Soup
from bs4 import BeautifulSoup
#Initializing html string
html = "<html><div><p class='intro'>Hello, world!</p></div></html>"
#Parsing html
soup = BeautifulSoup(html, "html.parser")
#printing parsed html
print(soup)

Accessing tags and their attributes

If we have parsed HTML, we can access a specific element, like a paragraph (<p>) tag:

# importing Beautiful Soup
from bs4 import BeautifulSoup
#Initializing html string
html = "<html><div><p class='intro'>Hello, world!</p></div></html>"
#Parsing html
soup = BeautifulSoup(html, "html.parser")
# Getting <p> tag
element = soup.p
print(element)

Additionally, we can retrieve the name and attributes of the tag:

# importing Beautiful Soup
from bs4 import BeautifulSoup
#Initializing html string
html = "<html><div><p class='intro'>Hello, world!</p></div></html>"
#Parsing html
soup = BeautifulSoup(html, "html.parser")
# Getting <p> tag
element = soup.p
# printing name and attributes of tag
print("Tag:", element.name)
print("Attributes:", element.attrs)

Navigating upwards

To move upwards in the HTML tree, we can make use of the parent attribute. This attribute allows us to access the direct parent element of the current element.

parent = element.parent
print("Parent:", parent)

In addition to the immediate parent, we can also retrieve all ancestors of a given element using the parents attribute. This will include all the hierarchical predecessors of the element.

parents = element.parents
print("Ancesstors:")
for parent in parents:
print(parent)

Navigating downwards

To move downwards in the HTML tree, we can make use of the descendants attribute. This attribute allows us to access the children of the current element.

descendants = element.descendants
print("Descendants: ")
for descendant in descendants:
print(descendant)

Navigating sideways

We can also navigate sideways to traverse elements that share the same parent. To move to the next sibling element, we can utilize the next_sibling attribute. This allows us to access the element that comes immediately after the current element within the same parent.

sibling_next = element.next_sibling
print("Next Sibling:", sibling_next)

We can also move to the previous sibling element using the previous_sibling attribute. This provides access to the element that precedes the current element within the same parent.

sibling_previous = element.previous_sibling
print("Previous Sibling:", sibling_previous)

Finding elements

Locating specific elements is a core skill in web scraping. Utilizing Beautiful Soup's searching and filtering capabilities, we can effectively extract the information we need. Following are the three main functions that Beautiful Soup provides to find elements:

Beautiful Soup also allows us to find elements specifically using attributes like class and ID etc.

Prettifying the output

Printing parsed HTML gives us a straight string which is hard to read. To increase readability and convenience, we can make use of prettify() function.

Extracting text and attributes

Beautiful Soup also allows us to extract text content from HTML tags while stripping away the markup. We can use the text attribute of the element as follows:

text = element.text
print("Text:", text)

We can also use get_text() to retrieve the text of an element.

Beautiful Soup also enables us to extract attributes of the elements. We can extract attributes from HTML elements using dictionary-like access or the .get() method:

class_p = element['class']
print("Class:", class_p)

We can also get other attributes like href etc.

Conclusion

Beautiful Soup has revolutionized data extraction from websites. We covered its installation, navigating the HTML tree, accessing tags and attributes, moving upwards, downwards, and sideways within the tree, finding elements using various functions, and improving output readability to get started with web scraping using this powerful library.

Learn more about:

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved