Overview of Web Scraping with Python’s BeautifulSoup and requests library
Web Scraping is the process of collecting structured data from web sites, Suppose you want to monitor price of a specific stock from a website and store it in your computer it would be tedious to copy the price from the website everyday, here’s where Web Scraping comes into play and makes collecting data easier.
In Web Scraping we browse the code (which is mainly written in CSS and HTML) of a website and ‘scrape’ desirable data that we want to store or process.
All websites on the internet are developed using HTML and CSS and to scrape a website we must be aware of the basics of HTML and CSS.
To scrape data from a website using Python we first need BeautifulSoup and requests library.
We can install BeautifulSoup using following command:-
pip install beautifulsoup4
and requests using:-
pip install requests
In our actual code we need to import them using:-
from bs4 import BeautifulSoup
import requests
Using the requests library we send a HTPP request to a URL and fetch the source code of the website.
To be precise, we use requests.get() method which takes a URL of a website as a parameter. For example:-
url = requests.get('http://quotes.toscrape.com/')
requests.get() will return a variable of type <class.requests.models.Response’>. To get the source code we use .text method.
url_text = url.text
Now, we use BeautifulSoup library to convert the source code of a website into a BeautifulSoup object which breaks down the source code into tags and makes the data in the source code easily accessible. For example :-
soup = BeautifulSoup(url_text, 'lxml')
Now using the above soup variable we can extract desirable data from the website.
The webpage we are using as an example is the homepage of quotes.toscrape.com which is used made specifically to practice web scraping.
Now suppose we want all the quotes on the webpage, then to scrape the quotes first we will need to know the exact HTML tag in which the quote exists in order to scrape it. Instead of reading all the source code to locate the tags in which quote exists we can ‘inspect’ the specific part of the webpage to locate the tag in the source code
To locate the tag in the source code which contains the specific data right click the the part and select the ‘Inspect’ option.
After Inspecting the quote we realize that the quote exists in a ‘span’ tag having ‘text’ as a class.
for quote in soup.find_all('span', class_='text'):
print(quote.text)
Thus in the above for loop we iterate over all the ‘span’ tags which have class_=‘text’. The quote variable will contain the tags of the quotes instead of the quotes themselves, to print the actual quote we use .text method on quote variable.
Thus the full code of our example will be:-
from bs4 import BeautifulSoup
import requestsurl = requests.get('http://quotes.toscrape.com/')
url_text = url.textsoup = BeautifulSoup(url_text, 'lxml')for quote in soup.find_all('span', class_='text'):
print(quote.text)
Ouput:-
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”
Note that the above is just a simple example of web scraping most of the websites are very complicated and scraping data off of them is much more difficult.