Deploy a Web Scraper using Puppeteer, Node.js and Docker on Koyeb
20 minKoyeb provides developers the fastest way to deploy full stack applications and APIs globally. Want to deploy your Node.js application globally in minutes? Sign up today and deploy 2 services for free forever!
Introduction
Web scraping is the process of extracting meaningful data from websites. Although it can be done manually, nowadays there are several developer-friendly tools that can automate the process for you.
In this tutorial we are going to create a web scraper using Puppeteer, a Node library developed by Google to perform several automated tasks using the Chromium engine. Web scraping is just one of the several applications that makes Puppeteer shine. In fact, according to the official documentation on Github Puppeteer can be used to:
- Generate screenshots and PDFs of web pages.
- Crawl a Single-Page Application and generate pre-rendered content.
- Automate form submission, UI testing, keyboard input, and interface interactions.
- Create an up-to-date, automated testing environment.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.
In this guide, we will first create scripts to showcase Puppeteer capabilities, then we will create an API to scrape pages via a simple HTTP API call using Express.js and deploy our application on Koyeb.
Requirements
To successfully follow and complete this guide, you need:
- Basic knowledge of JavaScript.
- A local development environment with Node.js installed
- Basic knowledge of Express.js
- Docker installed on your machine
- A Koyeb account to deploy and run the application
- A GitHub account to version and deploy your application code on Koyeb
This tutorial does not require any prior knowledge of Puppeteer as we will go through every step of setting up and running a web scraper. However, make sure your version of Node.js is at least 10.18.1 as we are using Puppeteer v3+.
For more information take a look at the official readme on the Puppeteer Github repository.
Steps
To deploy a web scraper using Puppeteer, Express.js, and Docker on Koyeb, you need to follow these steps:
- Initializing the project
- Your first Puppeteer application
- Puppeteer in action
- Scrap pages via a simple API call using Express
- Deploy the app on Koyeb
Initializing the project
Get started by creating a new directory that will hold the application code. To a location of your choice, create and navigate to a new directory by executing the following commands:
Inside the freshly created folder, we will create a Node.js application skeleton containing the Express.js dependencies that we will need to build our scraping API. In your terminal run:
You will be prompted with a set of questions to populate the initial package.json
file including the project name, version, and description.
Once the command has been completed, your package.json
content should be similar to the following:
Creating the skeleton of the application is going to come handy to organize our files, especially later on when creating we will create our API endpoints.
Next, add Pupetteer, the library we will use to perform the scraping as a project dependency by running:
Last, install and configure nodemon. While optional in this guide, nodemon will allow us to automatically restart our server when file changes are detected. This is a great tool to improve the development experience when developing locally. To install nodemon in your terminal, run:
Then, in your package.json
add the following section so we will be able to launch the application in development running npm dev
using nodemon and in production running npm start
.
Execute the following command to launch the application and ensure everything is working as expected:
Open your browser at http://localhost:3000
and you should see the Express welcome message.
Your first Puppeteer application
Before diving into more advanced Puppeteer web scraping capabilities, we will create a minimalist application to take a webpage screenshot and save the result in our current directory. As mentioned previously, Puppeteer provides several features to control Chrome/Chromium and the ability to take screenshots of a webpage comes very handy.
For this example, we will not dig into each parameter of the screenshot
method as we mainly want to confirm our installation works properly.
Create a new JavaScript file named screenshot.js
in your project directory puppeteer-on-koyeb
by running:
To take a screenshot of a webpage, our application will:
- Use Puppeteer, and create a new instance of
Browser
. - Open a webpage
- Take a screenshot
- Close the page and browser
Add the code below to the screenshot.js
file:
As you can see, we are taking a screenshot of the Koyeb homepage and saving the result as a png
file in the root folder of the project.
You can now run the application by running:
Once the execution is completed, a screenshot is saved in the root folder of the application. You created your first automation using Puppeteer!
Puppeteer in action
Simple Scraper
In this section, we will create a more advanced scenario to scrap and retrieve information from a website's page. For this example, we will use the Stack Overflow questions pages and instruct Puppeteer to extract each question and excerpt present on the webpage HTML.
Before jumping into the code, in your browser, open the devTools to inspect the webpage source code. You should see a similar block for each question in the HTML:
We will use the JavaScript methods querySelectorAll
and querySelector
to extract both question and excerpt, and return as result an array of objects.
querySelectorAll
: Will be used to to collect each question element.document.querySelectorAll('.question-summary')
querySelector
Will be used to extract the question title by callingquerySelector('.question-hyperlink').innerText
and the excerpt usingquerySelector('.excerpt').innerText
Back to your terminal, and create a new folder lib
containing a file called scraper.js
:
Inside the file scraper.js
add the following code:
Although it looks way more complex than the screenshot.js
script, it is actually performing the same actions except for the scraping one. Let's list them:
- Create an instance of Browser and open a page.
- Go to the URL (and wait for the website to load).
- Extract the information from the website and collect the questions into an array of objects.
- Close the browser and return the tools list as question-excerpt pair.
You might feel confused about the scraping syntax of:
We first call page.evaluate
to interact with the page DOM and then we start extracting the question and the excerpt.
Moreover, in the code above, we transformed the result from document.querySelectorAll
into a JavaScript array to be able to call map
on it and return the pair { question, excerpt }
for each converter tool.
To run the function, we can import it into a new file singlePageScraper.js
, and run it. Create the file in the root directory of your application:
Then, copy the code below that import the singlePageScraper
function and call it:
Run the script by executing the following command in your terminal:
The following output appears in your terminal showing the questions and excerpts retrieved from the StackOverflow questions page:
Multi page scraper
In the previous example, we learned how to scrap and retrieve information from a single page. We can now go even further and instruct Puppeteer to explore and extract information from multiple pages.
For this scenario, we will scrap and extract questions and excerpts from a pre-defined number of StackOverflow questions pages.
Our script will:
- Receive the number of pages to scrap as a parameter.
- Extract questions and excerpts from a page.
- Programmatically click on the "next page" element.
- Repeats point 2 and point 3 until the number of pages to scrap is reached.
Based on the previous function singlePageScraper
we created in lib/scraper.js
we will create a new function taking as an argument the number of pages to scrape.
Let's take a look at the HTML source code of the page to select the correct button element we will emulate the click to go to the next page:
The Puppeteer class Page
provides a handy method click
that accepts CSS selectors to simulate a click on an element. In our case, to go to the next page, we decide to use the .pager > a:last-child
selector.
In the lib/scraper.js
file, create a new function called multiPageScraper
:
Since we are collecting questions and related excerpts for multiple pages, we use a for loop to retrieve the list of questions for each page.
Each questions
retrieved for a page is then concatenated in an array questions
which is returned once the fetching is completed.
As we did for the single page scraper example, create a new file multiPageScraper.js
in the root directory to import and call the multiPageScraper
function:
Then, add the following code:
For the purpose of the script, we are hardcoding the number of pages to fetch to 2
. We will make this dynamic when we will build the API.
In your terminal execute the following command to run the script:
The following output appears in your terminal showing the questions and excerpts retrieved for each page scraped:
In the next section, we will write a simple API server containing one endpoint to scrap a user-defined number of pages and return the list of questions and excerpts scraped.
Scrap pages via a simple API call using Express
We are going to create a simple Express.js API server having an endpoint /questions
that accepts a query parameter pages
and returns the list of questions and excerpts from page 1 to the page sent as parameter.
For instance, to retrieve the first three pages of questions and their excerpts from Stack Overflow, we will call:
To create the questions
endpoit, go to the routes
directory and create new file question.js
:
Then, add the code below to the question.js
file:
What the code does is:
- Get the query parameter
pages
. - Call the newly created function
multiPageScraper
and pass thepages
value. - Return the array of questions back to the client.
We now need to define the questions
route in the Express router. To do so, open app.js
and add the following
Note that thanks to the Express generator we used to initialize our project, we do not have to setup our server from scratch, a few middlewrides are already setup for us.
Run the server again and try the to call the /questions
API endpoint using either the browser or cURL.
Here is the ouput you should get running it from your terminal using cURL:
And we are done! We now have a web scraper that with minimal changes can scrap any websites.
Deploy the app on Koyeb
Now that we have a working server, we can demonstrate how to deploy the application on Koyeb. Koyeb simple user interface gives you two choices to deploy our app:
- Deploy native code using git-driven deployment
- Deploy pre-built Docker containers from any public or private registries.
Since we are using Puppeteer, we need some extra system packages installed so we will deploy on Koyeb using Docker.
In this guide, I won't go through the steps of creating a Dockerfile and pushing it to the Docker registry but if you are interested in learning more, I suggest you start with the official documentation.
Before we start working with Docker, we have to perform a change into the /lib/scaper.js
file:
These extra parameters are required to properly run Puppeteer inside a Docker container.
Dockerize the application and push it to the Docker Hub
Get started by creating a Dockerfile containing the following:
This is a fairly simple Dockerfile. We inherit from the Node alpine base image, install the dependencies required by Puppeteer, add our web scraper application code and indicate how to run it.
We can now build the Docker image by running the following command:
Take care to replace <YOUR_DOCKER_USERNAME>
with your Docker Hub username.
Once the build succeeded, we can push our image to the Docker Hub running:
Deploy the app on Koyeb
Log in to the Koyeb control panel. On the Overview tab, click on the Create Web Service button to begin:
- Select Docker as the deployment method.
- Enter the Docker image you just pushed
<YOUR_DOCKER_USERNAME>/puppeteer-on-koyeb
to the Docker Hub as the Docker image. - In the Exposed ports section, change the port value to
3000
. - Choose a name for your App and Service and click Deploy.
You will automatically be redirected to the Koyeb App page where you can follow the progress of your application deployment. Once your app is deployed, click on the Public URL ending with koyeb.app
.
Then ensure everything is working as expected by retrieving the two first pages of questions from Stack Overflow running:
If everything is working fine, you should see the list of questions returned by the API.
Conclusion
First of all, congratulations on reaching this point! It was a long journey but we now have all the basic knowledge to successfully create and deploy a web scraper.
Starting from the beginning, we played with Puppeteer screenshot capabilities and slowly built up a fairly robust scraper that can automatically change the page to retrieve questions from StackOverflow.
Successively, we moved from a simple script to a running Express API server which exposes a specific endpoint to call the script and scrap a dynamic number of pages based on the query parameter sent along with the API call.
Finally, the cherry on the top is the deployment of our server with Koyeb: Thanks to the simplicity of its deployment using pre-built Docker images we can now perform our scraping in a production environment.
Questions or suggestions to improve this guide? Join us on the community platform to chat!