Parsing HTML in Python allows easy access to the HTML attributes and tags. For example, find and fetch the element by id or class attribute, name of the tag, etc.
In Python, you can easily parse HTML using the following ways.
Use BeautifulSoup4 to Parse HTML
Pip command to install the BeautifulSoup4 library:
pip install beautifulsoup4
First, call BeautifulSoup(html, 'html.parser')
to create a BeautifulSoup
object out of html
. Here, html
is the string of the HTML code.
from bs4 import BeautifulSoup
html = """<!DOCTYPE html>
<html>
<head>
<title>Gift Page</title>
</head>
<body>
<h1>My First Gift</h1>
<p class='content'>The gift was a car Toy. I liked it.</p>
<p class='content'>I played with the toy with my friend <a href='linkToRaju'>Raju</a>.</p>
<p id='comment'>Thanks</p>
</body>
</html>"""
soup = BeautifulSoup(html, 'html.parser')
Use the soup
object to navigate, search and access the data from HTML.
>>> soup.title
# <title>Gift Page</title>
>>> soup.a
# <a href="linkToRaju">Raju</a>
>>> soup.a['href']
# 'linkToRaju'
>>> soup.find_all('p')
# [<p class="content">The gift was a car Toy. I liked it.</p>, <p class="content">I played with the toy with my friend <a href="linkToRaju">Raju</a>.</p>, <p id="comment">Thanks</p>]
>>> soup.find(id='comment')
# <p id="comment">Thanks</p>
>>> soup.find_all('p', {'class': 'content'})
# [<p class="content">The gift was a car Toy. I liked it.</p>, <p class="content">I played with the toy with my friend <a href="linkToRaju">Raju</a>.</p>]
>>> soup.find_all('p')[0].text
# 'The gift was a car Toy. I liked it.'
For more details about bs4, read here.