Web scrapping using Python
What is web scrapping ?
Web scrapping is the process of extracting web data using automated programs. However web scrapping can also be done manually , but in almost all cases automated programs are preferred to harvest large number of data at a faster rate
What is Beautiful Soup ?
Beautiful soap is a package in python which can help with extortion of HTML / XML behind a web page. This article is mostly focused on how to retrieve web data using beautiful soap 4 python.
Required Python libraries:
pip install requests
pip install beautifulsoup4
Let’s get started by importing the packages.
import requests
from bs4 import BeautifulSoup
request.get(URL) is a method which will return the response from the URL with which we will parse the HTML of the webpage. So basically the returned response is passed as arguments to BeautifulSoup().
BeautifulSoap can take two arguments , one the response and the second the format in which the data has to be returned. Lets take a quick look on that part below :
data = requests.get("https://www.cricbuzz.com/cricket-schedule/upcoming-series/international")
data = BeautifulSoup(data.content, 'html.parser')
The URL which i used will have details of all the upcoming international matches played in cricket which is the same we are trying to retrieve through beautiful soap Library.
Now the variable data will have the entire HTML DOM. Let’s use findAll() method will return all the elements with that particular condition. Let’s use this in order to reduce the scope of search by giving a condition
cricket_schedule = data.findAll('a', {'class':'cb-col-33 cb-col cb-mtchs-dy text-bold'})
cricket_schedule will hold all the html elements related to the international matches , let us now iterate through a loop and retrieve the text of each element.
for matches in cricket_schedule :
print(matches.text)
Looks clean and good , let’s run and see the magic
Below is the entire piece of code we used :
import requests
from bs4 import BeautifulSoup
data = requests.get("https://www.cricbuzz.com/cricket-schedule/upcoming-series/international")
data = BeautifulSoup(data.content, 'html.parser')
cricket_schedule = data.findAll('a', {'class':'cb-col-33 cb-col cb-mtchs-dy text-bold'})
for matches in cricket_schedule :
print(matches.text)
Happy Coding !!