Nexius Analytics is your trusted partner in the realm of software innovation. With a keen focus on cutting-edge technologies and forward-thinking solutions, we offer a comprehensive suite of services to propel your business into the future.

Gallery

Contacts

5th floor D110, 6th road , Rawalpindi, 46000, PK

info@nexiusanalytics.com

+92-315-5724477

Web Scrapping

Practical Web Scraping with Python and the Requests Library :: Your Gateway to Data Freedom

This is the first blog for to practically scrape data in real world scenarios. Web scraping is a valuable skill for extracting data from websites. With the Python programming language and the Requests library, you can easily retrieve web content and use it for various applications, from data analysis to creating custom datasets. In this blog, we’ll walk you through the basics of web scraping using the requests library.

Prerequisites

Before you get started with web scraping, make sure you have the following

  • Python: Make sure you have Python installed on your system.
  • Requests Library: Install the requests library using pip install requests

Web Scrapping Basics

Web scraping involves sending HTTP requests to a website and extracting the DOM structure of the web page. The requests library in Python simplifies this process. When you have the DOM you can get the data with different selectors like XPATH, CSS etc. which we will cover in the upcoming blogs. Now we will walk through you to step-by-step guide to extract data from a website.

As I have told you how to check all the points before deep dive in to web scrapping. Keep in mind these points and after inspecting them, then go for scrapping. if you don’t know about that check them Investigate a Site before Scrape Now for this first practical blog I am taking the very simple and static webpage to crawl. As I know the site is static, that’s why I use request library to start this.

Import request library

import requests

Send HTTP Request to the Site

url = "https://quotes.toscrape.com/"
response = requests. Get(url=url)

Check the response

First step is to check the response of the request you have made to the site. There are many responses like 200, 403, etc. but the successful response is 200. How you can check is by

if response .status_code == 200:
    print('Request successful')
else:
    print('Request failed')

This is the way to check the response of the get request. If you are working in a jupyter notebook then simply print the result variable that will also display the status code to you that either the request is successful or not.

Parse the Web Page

Once you have a successful response, you can parse the web page content. You can access the page’s HTML content using response.text and store it in a variable.

html_content = response .text

Now you can see that there is full DOM loaded in your variable and now you can parse it with different methods the most i preferred and used by me is XPATH and CSS selectors. But first we go with Beautiful soap and slowly we will move forward to advanced techniques then i will teach you that as well.

Use HTML Parsing Libraries

To extract specific data from the HTML, you can use HTML parsing libraries like Beautiful Soup or lxml. Install Beautiful Soup using pip

pip install beautifulsoup4

Here’s a simple example using Beautiful Soup to extract all the quotes from the web page. This is the page in which quotes are available by different authors. Let’s have a look on this webpage and inspecting that

Now i explain to you how things are going on, I have mentioned in the image with 1 that are the quotes which are available on the web page. Then when I inspect it there is a class “quote” tagged with 2 in the image by which we can get the quotes after that inside that div there is a span tag number with 3. So how we can get them to first select them with quotes class and then find span tag inside it and get the text.

See how we can achieve this with code

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
quotes = soup.find_all(class_ = "quote")

Now inside this quote variable we have all the quotes structure available now what we want to get is all the quotes. So for that we have to iterate over all the quotes and get the text.

for quote in quotes:
    heading  = quote.find("span").text
    print(heading)

Now we have all the required data we want.

Handling Data

Once you’ve extracted data, you can save it to a file, analyze it, or use it for your specific application.

Conclusion

Web scraping with Python and the requests library is a valuable skill for data collection and analysis. With the right techniques and tools, you can gather data from the web and use it for various purposes. Just remember to scrape responsibly and respect the websites you interact with.

Happy scraping!

About Author

Syed Hanad Muqadar

Senior Data Scientist

03405021360 / syedhanad786@gmail.com

Author

nexiusanalytics.com

Leave a comment

Your email address will not be published. Required fields are marked *