January 20, 20244 minutes
Web scraping, the process of extracting data from websites, can be a powerful tool for gathering information from the internet.
In this article, we’ll explore how to scrape Google search results using Python, BeautifulSoup, and other tools. We’ll break down a specific code example and discuss crafting effective selectors for web scraping.
Finally, we’ll address the limitations and challenges of such projects.
Before diving into the code, ensure you have Python installed on your system. Additionally, you’ll need to install a few libraries:
You can install these libraries using pip:
pip install requests beautifulsoup4 rich
The provided Python script is structured to extract and display Google search results. Let’s go through it step by step:
import requests
from bs4 import BeautifulSoup
from rich import print
from urllib.parse import urlparse
from urllib.parse import parse_qs
This section imports the necessary modules. requests
is used for making HTTP requests, BeautifulSoup
for parsing HTML, and rich
for enhanced printing. urlparse
and parse_qs
from urllib.parse
are used for URL parsing.
query = "Python programming"
url = f"https://www.google.com/search?q={query}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Here, the script constructs a Google search URL for a given query, makes an HTTP request to that URL, and then parses the response using BeautifulSoup
.
The next step involves accurately identifying the various sections of Google’s webpage. We are focusing on locating these specific parts:
Thus, for each section, we can extract the specific data we need. In my implementation, the extract_section function handles the parsing of these individual sections.
def extract_results(soup):
main = soup.select_one("#main")
res = []
for gdiv in main.select('.g, .fP1Qef'):
res.append(extract_section(gdiv))
return res
Then, we’re looking to extract data from each section, specifically title, link and description:
The key is to check for presence or not of every elements you’re pulling data from. This avoid exceptions, and prevent crash on the program during the extraction process.
def extract_section(gdiv):
# Getting our elements
title = gdiv.select_one('h3')
link = gdiv.select_one('a')
description = gdiv.find('.BNeawe')
return {
# Extract title's text only if text is found
'title': title.text if title else None,
'link': link['href'] if link else None,
'description': description.text if description else None
}
Let’s run it and boom:
[
{
'title': 'Welcome to Python.org',
'link':
'/url?q=https://www.python.org/&sa=U&ved=2ahUKEwj4wL76qPWDAxVEia8BHWCpAZEQFnoECAIQAg&usg=AOvVaw0ftcoYNT39iYF9FN9-DDSp',
'description': None
},
{
'title': 'Introduction to Python - W3Schools',
'link':
'/url?q=https://www.w3schools.com/python/python_intro.asp&sa=U&ved=2ahUKEwj4wL76qPWDAxVEia8BHWCpAZEQFnoECAYQAg&usg=AOvV
aw1Y76DoERJLKhPAer6y6af0',
'description': None
},
{
'title': 'Learn Python Programming - Programiz',
'link':
'/url?q=https://www.programiz.com/python-programming&sa=U&ved=2ahUKEwj4wL76qPWDAxVEia8BHWCpAZEQFnoECAUQAg&usg=AOvVaw0fo
dl3yhXKlVH6jJnJge3j',
'description': None
},
...
]
Our links are still not exactly right, they are contained in an q
parameter of google’s url
page. Let’s implement extraction for that:
def extract_href(href):
url = urlparse(href)
query = parse_qs(url.query)
if not ('q' in query and query['q'] and len(query['q']) > 0):
return None
return query['q'][0]
And add it in our extract_section
:
def extract_section(gdiv):
...
return {
...
'link': extract_href(link['href']) if link else None,
...
}
Re-run it and voila!
[
{'title': 'Welcome to Python.org', 'link': 'https://www.python.org/', 'description': None},
{
'title': 'Introduction to Python - W3Schools',
'link': 'https://www.w3schools.com/python/python_intro.asp',
'description': None
},
{
'title': 'Learn Python Programming - Programiz',
'link': 'https://www.programiz.com/python-programming',
'description': None
},
{'title': 'Python Programming Tutorials', 'link': 'https://pythonprogramming.net/', 'description': None},
...
]
Selectors are crucial in web scraping. They allow you to target specific elements in the HTML document. This script uses CSS selectors like #main
, .g
, and .fP1Qef
to identify parts of the Google search results page.
These selectors are prone to change if Google updates its HTML structure.
The key for good selectors is to keep them simple, the most specific they are the more prone they are to break on even very slight change.
Write selectors that are parent dependent like #main > .fP1Qef
only when it’s absolutely needed.
Google frequently update their HTML structure. This means that a scraper can break easily and requires regular maintenance. Next step for you, is to maintain those selectors when something break.
Frequent requests to a website from the same IP can lead to your IP being blocked. Next step is to implement proxies to your code, allowing you to support multiple regions.
Scraped data might not always be reliable or accurate. You should always verify and validate the data obtained through web scraping, by implementing extensive tests of your code, and testing your implementation on numerous sample pages.
You are now able to integrate Google’s search results in your project!
If you want to learn more about Web Scraping & Data extraction and discuss about this article, join us on Discord!