Web Scraping With Any Headless Browser: A Puppeteer Tutorial

Published

Extracting data online for research has evolved significantly, especially with the emergence of innovative and adaptive web scraping techniques that make manual data gathering easier.

Web Scraping With Any Headless Browser

You can accomplish data scraping jobs using a Hypertext Transfer Protocol (HTTP) client or web browser. However, if you stumble upon a dynamic website, you can’t achieve the same task. Fortunately, headless browsers have been designed and developed purposely for scraping dynamic web pages.

You’ll discover throughout this article how to retrieve data online using any compatible headless web browser and Puppeteer. In short, this article serves as a thorough Puppeteer tutorial on headless data extraction. However, if you wish to learn even more and see an in-depth Puppeteer tutorial, Oxylabs’ website has an article just for you.

Technical Terms Explained

In the following subsections, you’ll encounter a few technical words that you need to know in further comprehensible detail.

i. Web Scraping

Web scraping is a structured way of collecting web data usually executed in an automated fashion. It is otherwise known as web harvesting or web data extraction by amateurs and professionals alike.

As one of the most frequently used data scraping techniques today, web scraping is visible in market research and news monitoring, among other applications.

ii. Headless Web Browser

Internet browsers today have a graphical user interface (GUI), also known in this context as “head,” for faster and more user-friendly software use, like Chrome. However, there are other browser variants designed and developed for web scraping. Take the headless web browser, for instance.

A headless browser doesn’t have a GUI, but you can execute it using a command-line interface (CLI) or network communication instead. The headless feature or mode runs on servers without a dedicated display and validates programming language functions like those written in JavaScript.

In selected browsers, it also allows you to implement and run large-scale web application tests or log on from one web page to another with no human operation.

iii. Puppeteer

Puppeteer is a software library with a high-level application programming interface (API) that mainly controls headless browsers via a “devtools” (web development tools) protocol. It’s fully compatible with the JavaScript-based runtime environment Node.js or simply Node.

Aside from automated web app testing, professionals and hobbyists also use Puppeteer for web scraping due to overall maximum efficiency.

iv. Node.js

Node.js is an open-source JavaScript runtime system that executes JS code outside a web browser and features back-end support.

It enables developers to use the JavaScript programming language to code command-line tools and start server-side scripts for dynamic web page content generation.

Benefits of Scraping With A Headless Browser Via Puppeteer

Scraping dynamic websites using a headless browser via Puppeteer gives you a reasonable amount of benefits. Such advantages include the following:

i. Faster Data Scraping

Use a compatible headless browser together with Puppeteer, and you’ll experience a more rapid means of scraping web pages for valuable data compared to a full (non-headless) browser. Puppeteer’s default non-GUI mode is the main factor behind this optimal performance.

ii. Accelerated Test Automation

The brilliant combination of a headless browser and the Puppeteer library makes enhanced test automation possible, too. Not only can you automate one or several UI tests, but also you can apply the same configuration to manually initiated form submissions and keyboard input.

iii. Better Performance Diagnosis

A Puppeteer-powered headless browser lets you capture your website’s timeline trace. This obtained log will aid in diagnosing any possible performance issues.

Headless Chrome and Puppeteer Setup Guide

The upcoming portion of this Puppeteer tutorial will concentrate on installing and setting up Headless Chrome and then Puppeteer. Since Node.js is a prerequisite for this tutorial, we highly recommend you log on to the Node.js official website for the complete and separate installation guide.

Step 1 – Setting Up Headless Chrome and Puppeteer

  • Install Puppeteer via the “npm” command to include the most stable, updated headless browser version and wait for a few minutes for this setup to complete.

npm i puppeteer –save

Step 2 – Setting Up Your Project

  • Navigate to your project directory, start a new file from there, and open that file with your preferred code editor.
  • Within your script, import Puppeteer and obtain the uniform resource locator (URL) or web address from several command-line arguments.

const puppeteer = require(‘puppeteer’);

const url = process.argv[2];

if (!url) {

    throw “Please provide a URL as the first argument”;

}

  • Define an async function and refer to the code below.

async function run () {

    const browser = await puppeteer.launch();

    const page = await browser.newPage();

    await page.goto(url);

    await page.screenshot({path: ‘screenshot.png’});

    browser.close();

}

run();

  • Ensure that the final code looks identical to the one shown below.

const puppeteer = require(‘puppeteer’);

const url = process.argv[2];

if (!url) {

    throw “Please provide URL as a first argument”;

}

async function run () {

    const browser = await puppeteer.launch();

    const page = await browser.newPage();

    await page.goto(url);

    await page.screenshot({path: ‘screenshot.png’});

    browser.close();

}

run();

  • Finally, navigate your project root directory and execute the following command to do a test screenshot.

node screenshot.js https://github.com

Conclusion

It takes patience and time to practice headless scraping via Puppeteer, especially with the lack of a GUI and frequent tool interaction via command lines. When you become accustomed, though, your web data gathering routine will improve to a greater extent.

Photo of author

Lucy Bennett

Lucy Bennett is a Contributing Editor at iLounge. She has been writing about Apple and technology for over six years. Prior to joining iLounge, Lucy worked as a writer for several online publications.