If you need web scraping, Python is the go-to language for you. But knowing that it’s the right language to use does not mean we should start coding from scratch. Instead, we have two very appealing options to choose from.
The first is called Scrapy with is a fully-featured python framework used to web scraping. While the alternative is Beautify Soup, a set of functional tools used for extracting data from HTML and XML.
Each of them has its own high and low points. Both do present a means by which web scraping can be carried out. The question now becomes, “Which is best for your scenario?”. As always, the right tool is dictated by the use case. A tool is only as good as its user and the job it’s doing.
What is Web Scraping and why is it important?
Before we go further down this road, we should give a brief intro about what “Web Scraping” is and why someone might be interested in such. The short answer is - It is a process of extracting specified data out from a web page. It goes by other names such as; web harvesting, web data extraction or just data scraping.
A simple example of this would be having a site or page that contains hundreds or even thousands of product listing. You can use the web scraping application/software to extract all the product names and price details alone. Or get the contact listings of all vendors on a page.
Herein lies the use of such applications which we can conveniently call “Data Mining”. It can sift through a page and present you with only the relevant data you seek.
To understand which would be a better choice, we must first take a look at what each consists of and how they go about performing their tasks.
Scrapy - Python framework for web scraping
First, let’s take a look at “Scrapy”. It is, by definition, an open-source collaborative framework used to extract the data from a website. Its performance is extremely fast and it is considered one of the most powerful libraries available out there.
Scrapy is built on top of Twisted, which is an asynchronous networking framework. This means that it uses the non-blocking mechanism in sending requests to users. The asynchronous requests follow non-blocking I/O calls to the server. But its advantages are not limited to this alone.
So what exactly can Scrapy do?
- To start off, it has built-in support for extracting data from HTML sources using XPath expression and CSS expression.
- It consists of a portable library i.e(written in Python and runs on Linux, Windows, Mac, and BSD).
- It can be easily extendable.
- Its performance speed can be said to be up to 20x faster than other libraries.
- It is both memory and CPU efficient.
- Plus with a bit of creativity, you can build extensive and robust applications.
- There is also strong community support for developers but a bit light on documentation for beginners.
Scrapy amazon reviews example:
import scrapy
class AmazonReviewsSpider(scrapy.Spider):
name = 'amazon_reviews'
allowed_domains = ['amazon.de']
myBaseUrl = "https://www.amazon.de/Neues-Apple-MacBook-256GB-Speicherplatz/product-reviews/B07S58MJHK/?reviewerType=all_reviews&pageNumber="
start_urls=[]
for n in range(100):
start_urls.append(myBaseUrl+str(n))
def parse(self, response):
data = response.css('#cm_cr-review_list')
reviews = data.css('.review-rating')
comments = data.css('.review-text')
# Combining the results
for n, review in enumerate(reviews):
yield{'stars': ''.join(review.xpath('.//text()').extract()),
'comment': ''.join(comments[n].xpath(".//text()").extract())
}
Based on: https://blog.datahut.co/scraping-amazon-reviews-python-scrapy/Beautiful Soup - library for pulling data out of HTML and XML
Next, we have ”Beautiful Soup”. It is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
While the name sounds like something made by a hungry individual, it is, however, a very beautiful tool for web scrappers because of its core features. It can help the programmer to quickly extract the data from a certain web page.
Using Beautiful Soup is not a one-stop solution. To get the most out of it, you will need to use a few libraries. A library is needed to make a request to the website because it can’t able to make a request to a particular server. To overcome this issue, it takes the help of the most popular library named Requests or Urlib2. These libraries will help us to make our request to the server.
After downloading the HTML, XML data into our local Machine, Beautiful Soup requires an External parser to parse the downloaded data. The most famous parsers are — lxml’s XML parser, lxml’s HTML parser, HTML5lib, HTML.parser.
Some of the advantages of Beautiful soup include:
It is easy for beginners to learn and master, even if you are migrating from another language. For example, if we want to extract all the links from the webpage, it can be simply done as follows:
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') for link in soup.find_all('a'): # It helps to find all anchor tag's print(link.get('href'))
In the above code, we are using the html.parser to parse the content of the html_doc. this is one of the strongest reasons for developers to use Beautiful soup as a web scraping tool.
It has good comprehensive documentation which helps us to learn things quickly.
It has good community support to figure out the issues that arise while we are working with this library.
Which is right for you?
By now, a few things become clear. The first being that Scrapy is the more complete of the two applications to use if you are serious about web scraping or if you are dealing with a very large data set (web site with large content of data needed to be extracted).
But while it is a comprehensive tool, it is overkill if you just have a simple bare-bone task to run. In this scenario, you would need something simple, quick and requires minimal coding. Just because you have a hammer doesn’t mean everything becomes a nail.
The aim of any developer is to get the job done with the most minimal of coding. Hence, if your case requires a simple and quick solution, Beautiful Soup will serve you well. Whereas, if you are dealing with complex data sets and you are in the market to build a robust application around or featuring web scraping, then Scrapy might be a better fit.