How to Parse HTML in Python?

Python Tutorials

Parsing HTML in Python allows easy access to the HTML attributes and tags. For example, find and fetch the element by id or class attribute, name of the tag, etc.

In Python, you can easily parse HTML using the following ways.

Use BeautifulSoup4 to Parse HTML

Pip command to install the BeautifulSoup4 library:

pip install beautifulsoup4

First, call BeautifulSoup(html, 'html.parser') to create a  BeautifulSoup object out of html. Here, html is the string of the HTML code.

from bs4 import BeautifulSoup

html = """<!DOCTYPE html>
            <html>
            <head>
            <title>Gift Page</title>
            </head>
            <body>
            <h1>My First Gift</h1>
            <p class='content'>The gift was a car Toy. I liked it.</p>
            <p class='content'>I played with the toy with my friend <a href='linkToRaju'>Raju</a>.</p>

            <p id='comment'>Thanks</p>
            </body>
          </html>"""
          
soup = BeautifulSoup(html, 'html.parser')

Use the soup object to navigate, search and access the data from HTML.

>>> soup.title
# <title>Gift Page</title>

>>> soup.a
# <a href="linkToRaju">Raju</a>

>>> soup.a['href']
# 'linkToRaju'

>>> soup.find_all('p')
# [<p class="content">The gift was a car Toy. I liked it.</p>, <p class="content">I played with the toy with my friend <a href="linkToRaju">Raju</a>.</p>, <p id="comment">Thanks</p>]

>>> soup.find(id='comment')
# <p id="comment">Thanks</p>

>>> soup.find_all('p', {'class': 'content'})
# [<p class="content">The gift was a car Toy. I liked it.</p>, <p class="content">I played with the toy with my friend <a href="linkToRaju">Raju</a>.</p>]

>>> soup.find_all('p')[0].text
# 'The gift was a car Toy. I liked it.'

For more details about bs4, read here.

Leave a Reply

Your email address will not be published. Required fields are marked *