Webscrapping using BeautifulSoup
- Assign the URL of interest to the variable
url. - Package the request to the URL, send the request and catch the response with a single function
requests.get(), assigning the response to the variabler. - Use the
textattribute of the objectrto return the HTML of the webpage as a string; store the result in a variablehtml_doc. - Create a BeautifulSoup object
soupfrom the resulting HTML using the functionBeautifulSoup(). - Use the method
prettify()onsoupand assign the result topretty_soup. - # Import packagesimport requestsfrom bs4 import BeautifulSoup# Specify url: urlurl = 'https://www.python.org/~guido/'# Package the request, send the request and catch the response: rr = requests.get(url)# Extracts the response as html: html_dochtml_doc = r.text# Create a BeautifulSoup object from the HTML: soupsoup = BeautifulSoup(html_doc)# Prettify the BeautifulSoup object: pretty_souppretty_soup = soup.prettify()# Print the responseprint(pretty_soup)
- In the sample code, the HTML response object
html_dochas already been created: your first task is to Soupify it using the functionBeautifulSoup()and to assign the resulting soup to the variablesoup. - Extract the title from the HTML soup
soupusing the attributetitleand assign the result toguido_title. - Print the title of Guido's webpage to the shell using the
print()function. - Extract the text from the HTML soup
soupusing the methodget_text()and assign toguido_text. - Hit submit to print the text from Guido's webpage to the shell.
- In the sample code, the HTML response object
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Get the title of Guido's webpage: guido_title
guido_title = soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.get_text()
# Print Guido's text to the shell
print(guido_text)
- Use the method
find_all()to find all hyperlinks insoup, remembering that hyperlinks are defined by the HTML tag<a>but passed tofind_all()without angle brackets; store the result in the variablea_tags. - The variable
a_tagsis a results set: your job now is to enumerate over it, using aforloop and to print the actual URLs of the hyperlinks; to do this, for every elementlinkina_tags, you want toprint()link.get('href'). - # Import packagesimport requestsfrom bs4 import BeautifulSoup# Specify urlurl = 'https://www.python.org/~guido/'# Package the request, send the request and catch the response: rr = requests.get(url)# Extracts the response as html: html_dochtml_doc = r.text# create a BeautifulSoup object from the HTML: soupsoup = BeautifulSoup(html_doc)# Print the title of Guido's webpageprint(soup.title)# Find all 'a' tags (which define hyperlinks): a_tagsa_tags = soup.find_all('a')# Print the URLs to the shellfor link in a_tags:print(link.get('href'))
Comments
Post a Comment