Puppeteer, created by Google, is a Node.js library offering an advanced API for managing both headless and headful browsers via the DevTools Protocol.
Retrieving the HTML of a page is useful in scenarios where we need to work with the raw HTML of a page, whether it’s for web scraping, data extraction, or other tasks that involve manipulating or analyzing the page’s structure.
To get the HTML content of the current page, we use Puppeteer's page.content()
function. It returns a Promise that resolves to the HTML string of the entire page.
await page.content();
The await
keyword in JavaScript is used to pause the execution of the script until the Promise returned by the following method is resolved.
Execute the following code by clicking the "Run" button and see the HTML content of the opened page logged in the "Terminal" tab.
const puppeteer = require('puppeteer'); (async () => { // Launch a headless browser const browser = await puppeteer.launch({ args: ['--no-sandbox'] }); // Open a new page const page = await browser.newPage(); // Navigate to the desired URL await page.goto('https://www.scrapethissite.com/login/'); // Get the HTML content of the page const html = await page.content(); // Log the extracted HTML content console.log(html); // Close the browser await browser.close(); })();
You may have observed that the browser is opened in the background, as you didn’t see it open here. This is because, in Puppeteer, the browser is launched in headless mode (no visible GUI) by default.
Line 1: We import the Puppeteer library using the require
function in Node.js. This action loads the Puppeteer module, making all of its functionality accessible within the script under the variable name puppeteer
.
Line 2: We define an asynchronous function using the async
keyword. Inside this function:
On lines 4–6, we launch the browser with Puppeteer.
On line 8, we create a new page.
On line 10, we open the desired URL.
On line 12, we extract the HTML of the opened page.
On line 14, we log the HTML of the page.
On line 16, we close the browser.
Note: We are passing the
--no-sandbox
argument to thepuppeteer.launch()
function to disable sandboxing to open the browser on the Educative platform. If you're running the script on your local machine, this argument might be unnecessary in your command.
Unlock your potential: Puppeteer fundamentals series, all in one place!
To deepen your understanding of Puppeteer, explore our series of Answers below:
What is Puppeteer?
Learn about Puppeteer, a Node.js library that provides a high-level API for browser automation using headless Chrome or Chromium.
How to check for the browser version in Puppeteer
Discover how to retrieve the browser version using Puppeteer's browser.version()
method.
How to open the browser in headful mode with Puppeteer
Explore how to launch a visible browser instance by disabling the headless mode in Puppeteer.
How to get web page HTML with Puppeteer
Learn how to extract and manipulate a webpage’s HTML content using Puppeteer’s evaluate()
method.
What is the use of the setViewport method in Puppeteer?
Understand how setViewport()
customizes the browser’s viewport size for responsive testing and screenshots.
What is code coverage in Puppeteer?
Learn how to analyze unused JavaScript and CSS in web pages to optimize performance using Puppeteer’s coverage tool.
What is visual regression testing in Puppeteer?
Discover how Puppeteer can capture and compare screenshots to detect visual changes in web applications.
What is an accessibility test in Puppeteer?
Explore how Puppeteer, combined with accessibility tools like axe-core
, helps evaluate web accessibility compliance.
Free Resources