Parsing HTML in Python allows easy access to the HTML attributes and tags. For example, find and fetch the element by id or class attribute, name of the tag, etc.

In Python, you can easily parse HTML using the following ways.

Use BeautifulSoup4 to Parse HTML

Pip command to install the BeautifulSoup4 library:

pip install beautifulsoup4

First, call BeautifulSoup(html, 'html.parser') to create a  BeautifulSoup object out of html. Here, html is the string of the HTML code.

from bs4 import BeautifulSoup

html = """<!DOCTYPE html>
            <html>
            <head>
            <title>Gift Page</title>
            </head>
            <body>
            <h1>My First Gift</h1>
            <p class='content'>The gift was a car Toy. I liked it.</p>
            <p class='content'>I played with the toy with my friend <a href='linkToRaju'>Raju</a>.</p>

            <p id='comment'>Thanks</p>
            </body>
          </html>"""
          
soup = BeautifulSoup(html, 'html.parser')

Use the soup object to navigate, search and access the data from HTML.

>>> soup.title
# <title>Gift Page</title>

>>> soup.a
# <a href="linkToRaju">Raju</a>

>>> soup.a['href']
# 'linkToRaju'

>>> soup.find_all('p')
# [<p class="content">The gift was a car Toy. I liked it.</p>, <p class="content">I played with the toy with my friend <a href="linkToRaju">Raju</a>.</p>, <p id="comment">Thanks</p>]

>>> soup.find(id='comment')
# <p id="comment">Thanks</p>

>>> soup.find_all('p', {'class': 'content'})
# [<p class="content">The gift was a car Toy. I liked it.</p>, <p class="content">I played with the toy with my friend <a href="linkToRaju">Raju</a>.</p>]

>>> soup.find_all('p')[0].text
# 'The gift was a car Toy. I liked it.'

For more details about bs4, read here.

Adarsh Kumar

I am an engineer by education and writer by passion. I started this blog to share my little programming wisdom with other programmers out there. Hope it helps you.

Leave a Reply