Webscrapping using BeautifulSoup
- Assign the URL of interest to the variable
url
. - Package the request to the URL, send the request and catch the response with a single function
requests.get()
, assigning the response to the variabler
. - Use the
text
attribute of the objectr
to return the HTML of the webpage as a string; store the result in a variablehtml_doc
. - Create a BeautifulSoup object
soup
from the resulting HTML using the functionBeautifulSoup()
. - Use the method
prettify()
onsoup
and assign the result topretty_soup
. - # Import packagesimport requestsfrom bs4 import BeautifulSoup# Specify url: urlurl = 'https://www.python.org/~guido/'# Package the request, send the request and catch the response: rr = requests.get(url)# Extracts the response as html: html_dochtml_doc = r.text# Create a BeautifulSoup object from the HTML: soupsoup = BeautifulSoup(html_doc)# Prettify the BeautifulSoup object: pretty_souppretty_soup = soup.prettify()# Print the responseprint(pretty_soup)
- In the sample code, the HTML response object
html_doc
has already been created: your first task is to Soupify it using the functionBeautifulSoup()
and to assign the resulting soup to the variablesoup
. - Extract the title from the HTML soup
soup
using the attributetitle
and assign the result toguido_title
. - Print the title of Guido's webpage to the shell using the
print()
function. - Extract the text from the HTML soup
soup
using the methodget_text()
and assign toguido_text
. - Hit submit to print the text from Guido's webpage to the shell.
- In the sample code, the HTML response object
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Get the title of Guido's webpage: guido_title
guido_title = soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.get_text()
# Print Guido's text to the shell
print(guido_text)
- Use the method
find_all()
to find all hyperlinks insoup
, remembering that hyperlinks are defined by the HTML tag<a>
but passed tofind_all()
without angle brackets; store the result in the variablea_tags
. - The variable
a_tags
is a results set: your job now is to enumerate over it, using afor
loop and to print the actual URLs of the hyperlinks; to do this, for every elementlink
ina_tags
, you want toprint()
link.get('href')
. - # Import packagesimport requestsfrom bs4 import BeautifulSoup# Specify urlurl = 'https://www.python.org/~guido/'# Package the request, send the request and catch the response: rr = requests.get(url)# Extracts the response as html: html_dochtml_doc = r.text# create a BeautifulSoup object from the HTML: soupsoup = BeautifulSoup(html_doc)# Print the title of Guido's webpageprint(soup.title)# Find all 'a' tags (which define hyperlinks): a_tagsa_tags = soup.find_all('a')# Print the URLs to the shellfor link in a_tags:print(link.get('href'))
Comments
Post a Comment