Web Scraping in Python (CodeCademy

September 19, 2020

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in. We can import it by using the line:

"html.parser" is one option for parsers we could use. There are other options, like "lxml" and "html5lib" that have different advantages and disadvantages, but for our purposes we will be using "html.parser" throughout.

With the requests skills we just learned, we can use a website hosted online as that HTML:

import requests
from bs4 import BeautifulSoup

webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

print(soup)

BeautifulSoup breaks the HTML page into several types of objects.

Tags

A Tag corresponds to an HTML Tag in the original document. These lines of code:

soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')
print(soup.div)

Would produce output that looks like:

<div id="example">An example div</div>

Accessing a tag from the BeautifulSoup object in this way will get the first tag of that type on the page.

You can get the name of the tag using .name and a dictionary representing the attributes of the tag using .attrs:

print(soup.div.name)
print(soup.div.attrs)

div
{'id': 'example'}

NavigableStrings

NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling .string:

print(soup.div.string)

An example div

Search This Blog

Python Data Science Coding Reference