Web Scraping with Python: Collecting Data from the Modern Web
Download
Introduction
This book seeks to put an end to many of these common questions and misconcep‐
tions about web scraping, while providing a comprehensive guide to most common
web-scraping tasks
This book is designed to serve not only as an introduction to web scraping, but as a
comprehensive guide to scraping almost every type of data from the modern Web.
Although it uses the Python programming language, and covers many Python basics,
it should not be used as an introduction to the language.
What Is Web Scraping?
The automated gathering of data from the Internet is nearly as old as the Internet
itself. Although web scraping is not a new term, in years past the practice has been
more commonly known as screen scraping, data mining, web harvesting, or similar
variations. General consensus today seems to favor web scraping, so that is the term
I’ll use throughout the book, although I will occasionally refer to the web-scraping
programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other
than a program interacting with an API (or, obviously, through a human using a web
browser). This is most commonly accomplished by writing an automated program
that queries a web server, requests data (usually in the form of the HTML and other
files that comprise web pages), and then parses that data to extract needed informa‐
tion.
In practice, web scraping encompasses a wide variety of programming techniques
and technologies, such as data analysis and information security. This book will cover
the basics of web scraping and crawling (Part I), and delve into some of the advanced
topics in Part II.
Why Web Scraping?
If the only way you access the Internet is through a browser, you’re missing out on a
huge range of possibilities. Although browsers are handy for executing JavaScript,
displaying images, and arranging objects in a more human-readable format (among
other things), web scrapers are excellent at gathering and processing large amounts of
data (among other things). Rather than viewing one page at a time through the nar‐
row window of a monitor, you can view databases spanning thousands or even mil‐
lions of pages at once.
In addition, web scrapers can go places that traditional search engines cannot. A
Google search for “cheapest flights to Boston” will result in a slew of advertisements
and popular flight search sites. Google only knows what these websites say on their
content pages, not the exact results of various queries entered into a flight search
application. However, a well-developed web scraper can chart the cost of a flight to
Boston over time, across a variety of websites, and tell you the best time to buy your
ticket.
You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliar
with APIs, see Chapter 4.) Well, APIs can be fantastic, if you find one that suits your
purposes. They can provide a convenient stream of well-formatted data from one
server to another. You can find an API for many different types of data you might want to use such as Twitter posts or Wikipedia pages. In general, it is preferable to use
an API (if one exists), rather than build a bot to get the same data. However, there are
several reasons why an API might not exist:
• You are gathering data across a collection of sites that do not have a cohesive API.
• The data you want is a fairly small, finite set that the webmaster did not think
warranted an API.
• The source does not have the infrastructure or technical ability to create an API.
Even when an API does exist, request volume and rate limits, the types of data, or the
format of data that it provides might be insufficient for your purposes.
This is where web scraping steps in. With few exceptions, if you can view it in your
browser, you can access it via a Python script. If you can access it in a script, you can
store it in a database. And if you can store it in a database, you can do virtually any‐
thing with that data.
There are obviously many extremely practical applications of having access to nearly
unlimited data: market forecasting, machine language translation, and even medical
diagnostics have benefited tremendously from the ability to retrieve and analyze data
from news sites, translated texts, and health forums, respectively.
Even in the art world, web scraping has opened up new frontiers for creation. The
2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of
English-language blog sites for phrases starting with “I feel” or “I am feeling.” This led
to a popular data visualization, describing how the world was feeling day by day and
minute by minute.
Regardless of your field, there is almost always a way web scraping can guide business
practices more effectively, improve productivity, or even branch off into a brand-new
field entirely.
Home Web Development Web Scraping with Python