Web Scraping Navigation in Tree Data Structures

 To navigate through a tree, we can call the tag names themselves. Imagine we have an HTML page that looks like this:


<h1>World's Best Chocolate Chip Cookies</h1> <div class="banner"> <h1>Ingredients</h1> </div> <ul> <li> 1 cup flour </li> <li> 1/2 cup sugar </li> <li> 2 tbsp oil </li> <li> 1/2 tsp baking soda </li> <li> ½ cup chocolate chips </li> <li> 1/2 tsp vanilla <li> <li> 2 tbsp milk </li> </ul>

If we made a soup object out of this HTML page, we have seen that we can get the first h1 element by calling:

print(soup.h1)
<h1>World's Best Chocolate Chip Cookies</h1>

We can get the children of a tag by accessing the .children attribute:

for child in soup.ul.children: print(child)
<li> 1 cup flour </li> <li> 1/2 cup sugar </li> <li> 2 tbsp oil </li> <li> 1/2 tsp baking soda </li> <li> ½ cup chocolate chips </li> <li> 1/2 tsp vanilla <li> <li> 2 tbsp milk </li>

We can also navigate up the tree of a tag by accessing the .parents attribute:

for parent in soup.li.parents: print(parent)

This loop will first print:

<ul> <li> 1 cup flour </li> <li> 1/2 cup sugar </li> <li> 2 tbsp oil </li> <li> 1/2 tsp baking soda </li> <li> ½ cup chocolate chips </li> <li> 1/2 tsp vanilla </li> <li> 2 tbsp milk </li> </ul>

Then, it will print the tag that contains the ul (so, the body tag of the document). Then, it will print the tag that contains the body tag (so, the html tag of the document).


Find All

If we want to find all of the occurrences of a tag, instead of just the first one, we can use .find_all().

This function can take in just the name of a tag and returns a list of all occurrences of that tag.

print(soup.find_all("h1"))
['<h1>World's Best Chocolate Chip Cookies</h1>', '<h1>Ingredients</h1>']

.find_all() is far more flexible than just accessing elements directly through the soup object. With .find_all(), we can use regexes, attributes, or even functions to select HTML elements more intelligently.

Using Regex

What if we want every <ol> and every <ul> that the page contains? We can select both of these types of elements with a regex in our .find_all():

import re soup.find_all(re.compile("[ou]l"))

What if we want all of the h1 - h9 tags that the page contains? Regex to the rescue again!

import re soup.find_all(re.compile("h[1-9]"))

Using Lists

We can also just specify all of the elements we want to find by supplying the function with a list of the tag names we are looking for:

soup.find_all(['h1', 'a', 'p'])

Using Attributes

We can also try to match the elements with relevant attributes. We can pass a dictionary to the attrs parameter of find_all with the desired attributes of the elements we’re looking for. If we want to find all of the elements with the "banner" class, for example, we could use the command:

soup.find_all(attrs={'class':'banner'})

Or, we can specify multiple different attributes! What if we wanted a tag with a "banner" class and the id "jumbotron"?

soup.find_all(attrs={'class':'banner', 'id':'jumbotron'})

Using A Function

If our selection starts to get really complicated, we can separate out all of the logic that we’re using to choose a tag into its own function. Then, we can pass that function into .find_all()!

def has_banner_class_and_hello_world(tag): return tag.attr('class') == "banner" and tag.string == "Hello world" soup.find_all(has_banner_class_and_hello_world)

This command would find an element that looks like this:

<div class="banner">Hello world</div>

but not an element that looks like this:

<div>Hello world</div>

Or this:

<div class="banner">What's up, world!</div>
Select for CSS Selectors

Another way to capture your desired elements with the soup object is to use CSS selectors. The .select() method will take in all of the CSS selectors you normally use in a .css file!

<h1 class='results'>Search Results for: <span class='searchTerm'>Funfetti</span></h1> <div class='recipeLink'><a href="spaghetti.html">Funfetti Spaghetti</a></div> <div class='recipeLink' id="selected"><a href="lasagna.html">Lasagna de Funfetti</a></div> <div class='recipeLink'><a href="cupcakes.html">Funfetti Cupcakes</a></div> <div class='recipeLink'><a href="pie.html">Pecan Funfetti Pie</a></div>

If we wanted to select all of the elements that have the class 'recipeLink', we could use the command:

soup.select(".recipeLink")

If we wanted to select the element that has the id 'selected', we could use the command:

soup.select("#selected")

Let’s say we wanted to loop through all of the links to these funfetti recipes that we found from our search.

for link in soup.select(".recipeLink > a"): webpage = requests.get(link) new_soup = BeautifulSoup(webpage)

This loop will go through each link in each .recipeLink div and create a soup object out of the webpage it links to. So, it would first make soup out of <a href="spaghetti.html">Funfetti Spaghetti</a>, then <a href="lasagna.html">Lasagna de Funfetti</a>, and so on.


Reading Text

When we use BeautifulSoup to select HTML elements, we often want to grab the text inside of the element, so that we can analyze it. We can use .get_text() to retrieve the text inside of whatever tag we want to call it on.

<h1 class="results">Search Results for: <span class='searchTerm'>Funfetti</span></h1>

If this is the HTML that has been used to create the soup object, we can make the call:

soup.get_text()

Which will return:

'Search Results for: Funfetti'

Notice that this combined the text inside of the outer h1 tag with the text contained in the span tag inside of it! Using get_text(), it looks like both of these strings are part of just one longer string. If we wanted to separate out the texts from different tags, we could specify a separator character. This command would use a . character to separate:

soup.get_text('|')

Now, the command returns:

'Search Results for: |Funfetti'

import requests
from bs4 import BeautifulSoup

prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage"html.parser")

turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them"
for a in turtle_links:
  links.append(prefix+a["href"])
    
#Define turtle_data:
turtle_data = {}

#follow each link:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content"html.parser")
  turtle_name = turtle.select(".name")[0].get_text()
  
  stats = turtle.find("ul")
  stats_text = stats.get_text("|")
  turtle_data[turtle_name] = stats_text.split("|")

  print(turtle_data)

Creating a Data Frame from Web Scrapping

turtle_df = pd.DataFrame.from_dict(your_dictionary, orient='index')
    

Comments

Popular posts from this blog

Binomial Test in Python

Slicing and Indexing in Python Pandas

Python Syntax and Functions Part2 (Summary Statistics)