Web Scraping in Python (CodeCademy
BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in. We can import it by using the line:
"html.parser"
is one option for parsers we could use. There are other options, like "lxml"
and "html5lib"
that have different advantages and disadvantages, but for our purposes we will be using "html.parser"
throughout.
With the requests skills we just learned, we can use a website hosted online as that HTML:
BeautifulSoup breaks the HTML page into several types of objects.
Tags
A Tag corresponds to an HTML Tag in the original document. These lines of code:
soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')
print(soup.div)
Would produce output that looks like:
<div id="example">An example div</div>
Accessing a tag from the BeautifulSoup object in this way will get the first tag of that type on the page.
You can get the name of the tag using .name
and a dictionary representing the attributes of the tag using .attrs
:
print(soup.div.name)
print(soup.div.attrs)
div
{'id': 'example'}
NavigableStrings
NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling .string
:
print(soup.div.string)
An example div
Comments
Post a Comment