Building a Web Scraper Using Puppeteer and Next.js API Routes

December 16, 2024

6 min read

Due to the needs of a project, I'm revisiting Puppeteer after some time. Since it's been a while since I last used it, I did a quick review and decided to document it.

To make the explanation easier, I created a new Next.js application. For those who don't have the patience to read the entire article, you can click here to go directly to the source code.

What is Puppeteer?

Here is the official definition of Puppeteer from the Google team:

info

Puppeteer is a JavaScript library which provides a high-level API to automate both Chrome and Firefox over the Chrome DevTools Protocol and WebDriver BiDi.

Use it to automate anything in the browser, from taking screenshots and generating PDFs to navigating through and testing complex UIs and analysing performance.

I will not go into detail about the use cases of web crawling. I believe that the information you can find online will be more detailed than what I can explain.

What are we going to build?

We will create a demonstration application combining Next.js API Routes with Puppeteer.

Why Next.js? Next.js is just one option; you can easily follow the steps outlined in this article and implement the same functionality in similar frameworks, such as Node.js or Express.js.

This demonstration application will be divided into three sections: web scraping, webpage screenshots, and scraping and automating specific components.

Getting Started: Prerequisites

Before you start, you need to install puppeteer's related libraries. Execute the following command to install it:

BASH

npm i puppeteer-core @sparticuz/chromium puppeteer

info

In general, you only need to install puppeteer-core and @sparticuz/chromium. So, why do you need to install puppeteer? I will explain it later.

Next, we need to build an API route to initialize the puppeteer environment.

Note that I am using App Route, if you are using Page Router, you may need to make some adjustments.

app/api/scraper/basic/route.ts
export async function POST(req: NextRequest) {
    return NextResponse.json({})
}

There is one thing to note here, the configuration of puppeteer in the development environment and the production environment is different, and some adjustments need to be made in the corresponding environment. Without further ado, let's look at the code directly:

app/api/scraper/basic/route.ts
import puppeteer from "puppeteer-core";

const CHROMIUM_EXECUTABLE_PATH =
  "https://github.com/Sparticuz/chromium/releases/download/v131.0.1/chromium-v131.0.1-pack.tar";

export async function POST(req: NextRequest) {
    const isProduction = process.env.NODE_ENV === "production";

    const browser = await puppeteer.launch({
        args: isProduction ? chromium.args : puppeteer.defaultArgs(),
        defaultViewport: chromium.defaultViewport,
        executablePath:
        process.env.CHROME_EXECUTABLE_PATH ||
        (await chromium.executablePath(CHROMIUM_EXECUTABLE_PATH)),
  });

   return NextResponse.json({})
}

There are two things we need to configure here, process.env.CHROME_EXECUTABLE_PATH and CHROME_EXECUTABLE_PATH.

First is the value of process.env.CHROME_EXECUTABLE_PATH, which you can find by typing chrome://version in your Chrome browser and copying it to your CHROME_EXECUTABLE_PATH variable under your .env.

executable_path_1

The value of the CHROME_EXECUTABLE_PATH will depends on your version of @sparticuz/chromium. For example, if your @sparticuz/chromium version is 131.0.1, then you go to this link and copy the corresponding version url address.

executable_path_2

The above is the first method, which is the method I recommend. The second method requires the installation of the puppeteer library, which means it requires three libraries to implement, while the first method only requires two libraries.

Below is the implementation of the second method:

app/api/scraper/basic/route.ts
import puppeteer from "puppeteer-core";
import local_puppeteer from "puppeteer";

const CHROMIUM_EXECUTABLE_PATH =
  "https://github.com/Sparticuz/chromium/releases/download/v131.0.1/chromium-v131.0.1-pack.tar";

export async function POST(req: NextRequest) {
    const isProduction = process.env.NODE_ENV === "production";

    let browser;

    if (isProduction) {
        browser = await puppeteer.launch({
        args: chromium.args,
        defaultViewport: chromium.defaultViewport,
        executablePath: await chromium.executablePath(CHROMIUM_EXECUTABLE_PATH),
        });
    } else {
        browser = await local_puppeteer.launch({
        args: local_puppeteer.defaultArgs(),
        });
    }

     return NextResponse.json({})
}

If you have to say what the second method has an advantage, it is probably that you don't need to set the value of process.env.CHROME_EXECUTABLE_PATH in the development environment.

Use Cases

Basic: Web data crawling

Web scraping is a relatively basic use of puppeteer, so let's start with that.

Let's go back to the route in /api/scraper/basic/route.ts and make some changes:

app/api/scraper/basic/route.ts
import puppeteer from "puppeteer-core";

const CHROMIUM_EXECUTABLE_PATH =
  "https://github.com/Sparticuz/chromium/releases/download/v131.0.1/chromium-v131.0.1-pack.tar";

const isProduction = process.env.NODE_ENV === "production";

export async function POST(req: NextRequest) {
    const { siteUrl } = await req.json();

    const browser = await puppeteer.launch({
        args: isProduction ? chromium.args : puppeteer.defaultArgs(),
        defaultViewport: chromium.defaultViewport,
        executablePath:
        process.env.CHROME_EXECUTABLE_PATH ||
        (await chromium.executablePath(CHROMIUM_EXECUTABLE_PATH)),
  });

  const page = await browser.newPage();
  await page.goto(siteUrl);
  const pageTitle = await page.title();
  await browser.close();

  return NextResponse.json(
    {
      pageTitle,
    },
    { status: 200 }
  );
}

By observing the code above, you should quickly understand its purpose: using the value of siteUrl to navigate to the specified page and retrieve its title.

Of course, puppeteer can also capture more data about web pages. If you are interested, you can find out here.

Now that we have completed the basic routing, the last thing we need to do is to set up an interface on the client side that can trigger the routing and the result is as follows:

result_1

Capture Page Screenshot

Puppeteer provides a screenshot interface, which we can use without spending too much effort. Here is the code:

app/api/scraper/screenshot/route.ts
export async function POST(req: NextRequest) {
    const { siteUrl } = await req.json();

    const browser = await puppeteer.launch({
        args: isProduction ? chromium.args : puppeteer.defaultArgs(),
        defaultViewport: chromium.defaultViewport,
        executablePath:
        process.env.CHROME_EXECUTABLE_PATH || (await chromium.executablePath(CHROMIUM_EXECUTABLE_PATH)),
    });

    const page = await browser.newPage();
    await page.goto(siteUrl);
    const screenshot: Uint8Array<ArrayBufferLike> = await page.screenshot();
    await browser.close();

    const buffer = Buffer.from(screenshot);
    const ssBlob = new Blob([buffer], { type: "image/png" });

    const response = new NextResponse(ssBlob, {
        headers: {
        "Content-Type": "image/png",
        "Content-Disposition": "inline; filename=screenshot.png",
        },
        status: 200,
    });

    return response;
}

As you can see, after getting a screenshot of the specified web address, the endpoint responds with a value containing the binary data of the image. Using this data, we can download the screenshot image on the client side.

app/page.tsx
  const handleTakeSs = async () => {
    await fetch("/api/scraper/screenshot", {
      method: "POST",
      body: JSON.stringify({
        siteUrl: "http://localhost:3000",
      }),
    })
      .then((res) => res.blob())
      .then((blob) => {
        const url = window.URL.createObjectURL(blob);

        // Create an anchor element to initiate the download
        const a = document.createElement("a");
        a.href = url;
        a.download = "ss.png";
        document.body.appendChild(a);

        // Trigger the download
        a.click();

        // Clean up the URL object
        window.URL.revokeObjectURL(url);
      });
  };

Here is a comparison of browsers and screenshots:

result_2

Scraping Individual Components

In some cases, you may want to capture data or screenshot only specific components of a web page.

Especially for reusable components, you only need to transfer the necessary props values to reflect the desired values in the component.

Therefore, we need to create a component first. Here I made a simple user info card, as shown below:

info_card

Then under the api/scraper/unit route:

app/api/scraper/unit/route.ts

export async function POST(req: NextRequest) {
  const body = await req.json();

  const ReactDOMServer = (await import("react-dom/server")).default;
  const module = await import("@/app/components/info-card");
  const htmlTemplate = ReactDOMServer.renderToStaticMarkup(
    module.default(body)
  );

  const browser = await puppeteer.launch({
    args: isProduction ? chromium.args : puppeteer.defaultArgs(),
    defaultViewport: chromium.defaultViewport,
    executablePath:
      process.env.CHROME_EXECUTABLE_PATH ||
      (await chromium.executablePath(CHROMIUM_EXECUTABLE_PATH)),
  });

  const page = await browser.newPage();

  await page.setContent(await htmlTemplate);

  const screenshot: Uint8Array<ArrayBufferLike> = await page.screenshot();
  await browser.close();

  const buffer = Buffer.from(screenshot);
  const ssBlob = new Blob([buffer], { type: "image/png" });

  const response = new NextResponse(ssBlob, {
    headers: {
      "Content-Type": "image/png",
      "Content-Disposition": "inline; filename=screenshot.png",
    },
    status: 200,
  });

  return response;
}

Then, provide the corresponding data of the component's props when calling the endpoint.

TS
 await fetch("/api/scraper/unit", {
      method: "POST",
      body: JSON.stringify({
        name,
        bio,
        socialLink,
        imageLink,
      }),
    })

Finally, you will get the following screenshot.

20241217015734309

This doesn't look good, does it?🤔

If you review our routing code, you'll notice that our page instance isn't navigating to the target page to scrape data. Instead, it reads the component's structure and uses Puppeteer to create an instance where the component is rendered.

In simple terms, the component is rendering onto a blank page, which lacks any stylesheets or scripts from the local project.

Fixing this issue isn't too difficult, especially since the styles for our component are generated by TailwindCSS. Therefore, we just need to inject TailwindCSS's stylesheet into the "blank" page.

TS
await page.addStyleTag({
    url: "https://cdn.jsdelivr.net/npm/[email protected]/dist/tailwind.min.css",
});

//add before screenshot

Calling the endpoint again, we get the following result:

20241217021203222

Timeout Issue

info

If your server has a default timeout value for requests and the processing time exceeds that limit, the server will respond with a "Request Timeout (408)" error.

Puppeteer is a bandwidth-intensive tool, especially if you're using free CDNs like Vercel. You're likely to encounter "Request Timeout" issues since Vercel's default timeout is only 10 seconds.

To address this, you need to extend the maximum timeout duration for your routes. This can be achieved by configuring the timeout setting in your API route, as shown below:

TS
export const maxDuration = 20;

export async function POST(req: NextRequest) {
    //your code
}

What's next?

You can combine Puppeteer with various tools to extend its functionality.

For example, you can upload images generated by Puppeteer to the cloud and return the corresponding URL. Another use case is generating HTML using component rendering methods and sending it to specific email addresses with email tools like MailKit, Resend, or SendGrid.

Additionally, you can use Puppeteer to scrape data and integrate it with AI tools to generate insights, summaries, or recommendations.

As for the rest, I'll leave it to your imagination and specific needs!

What is Puppeteer?​

What are we going to build?​

Getting Started: Prerequisites​

Use Cases​

Basic: Web data crawling​

Capture Page Screenshot​

Scraping Individual Components​

Timeout Issue​

What's next?​

What is Puppeteer?

What are we going to build?

Getting Started: Prerequisites

Use Cases

Basic: Web data crawling

Capture Page Screenshot

Scraping Individual Components

Timeout Issue

What's next?