Web scrapping using Python

Gokul Menon
2 min readDec 5, 2020

--

What is web scrapping ?

Web scrapping is the process of extracting web data using automated programs. However web scrapping can also be done manually , but in almost all cases automated programs are preferred to harvest large number of data at a faster rate

What is Beautiful Soup ?

Beautiful soap is a package in python which can help with extortion of HTML / XML behind a web page. This article is mostly focused on how to retrieve web data using beautiful soap 4 python.

Required Python libraries:

pip install requests
pip install beautifulsoup4

Let’s get started by importing the packages.

import requests
from bs4 import BeautifulSoup

request.get(URL) is a method which will return the response from the URL with which we will parse the HTML of the webpage. So basically the returned response is passed as arguments to BeautifulSoup().

BeautifulSoap can take two arguments , one the response and the second the format in which the data has to be returned. Lets take a quick look on that part below :

data = requests.get("https://www.cricbuzz.com/cricket-schedule/upcoming-series/international")

data = BeautifulSoup(data.content, 'html.parser')

The URL which i used will have details of all the upcoming international matches played in cricket which is the same we are trying to retrieve through beautiful soap Library.

Now the variable data will have the entire HTML DOM. Let’s use findAll() method will return all the elements with that particular condition. Let’s use this in order to reduce the scope of search by giving a condition

cricket_schedule = data.findAll('a', {'class':'cb-col-33 cb-col cb-mtchs-dy text-bold'})

cricket_schedule will hold all the html elements related to the international matches , let us now iterate through a loop and retrieve the text of each element.

for matches in cricket_schedule :
print(matches.text)

Looks clean and good , let’s run and see the magic

Below is the entire piece of code we used :

import requests
from bs4 import BeautifulSoup

data = requests.get("https://www.cricbuzz.com/cricket-schedule/upcoming-series/international")

data = BeautifulSoup(data.content, 'html.parser')
cricket_schedule = data.findAll('a', {'class':'cb-col-33 cb-col cb-mtchs-dy text-bold'})

for matches in cricket_schedule :
print(matches.text)

Happy Coding !!

--

--

Gokul Menon
Gokul Menon

Written by Gokul Menon

Engineering at FreeNow Tech.

No responses yet