Skip to content
Back to Blog
PuppeteerChromiumNode.jsSEOheadless browserweb scrapingDeepAudit AIJavaScriptwebdevbuildinpublic

How We Built a Free SEO Audit Tool with Puppeteer and Chromium

Crystal A. Gutierrez12 min read

When we started building our SEO auditing pipeline, the first architectural decision we made was also the most important one: we would never parse raw HTML.

Most SEO audit tools still rely on that approach because it is lightweight and efficient for static websites. But frontend frameworks like React, Next.js, and Vue changed the landscape completely.

We kept running into the same issue: traditional parsers were auditing code that users and Google never actually saw.

So we took a different route: real browser rendering with Puppeteer and headless Chromium.

Here is how we built the system and what we learned along the way.

The Core Problem With HTML Parsers

Most SEO auditors work something like this:

const response = await fetch(url);
const html = await response.text();
// parse html and inspect tags

For static sites, that works fine. But JavaScript applications often return almost empty HTML responses initially:

<html>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.chunk.js"></script>
  </body>
</html>

The actual SEO-relevant content — H1 tags, meta descriptions, structured data, canonical tags, images, internal links — does not exist yet. It gets generated after JavaScript executes in the browser. An HTML parser never sees that content.

Googlebot renders JavaScript. Your audit tool should too.

In our internal testing, more than half of the React-based sites we audited returned incomplete metadata before rendering. Some pages were missing titles, canonical tags, structured data, or even visible heading content entirely until JavaScript finished executing.

That gap became impossible to ignore.

Why We Chose Puppeteer

We evaluated several options before deciding on our stack:

  • Playwright — excellent tooling, but heavier than we needed for a Chromium-only workflow
  • Selenium — powerful, but designed more for browser testing than rendering audits
  • Cheerio + axios — extremely fast, but limited to static HTML parsing
  • Puppeteer — lightweight Chromium automation with a straightforward API and strong ecosystem support

Puppeteer ultimately made the most sense for our use case. We did not need multi-browser automation. We needed rendering accuracy.

The Rendering Pipeline

Here is a simplified version of the core audit flow:

const puppeteer = require('puppeteer');

async function auditPage(url) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  await page.setUserAgent(
    'Mozilla/5.0 (compatible; DeepAuditBot/1.0; +https://axiondeepdigital.com)'
  );

  const resources = [];
  page.on('request', (req) => resources.push(req));

  await page.goto(url, {
    waitUntil: 'networkidle2',
    timeout: 30000,
  });

  await autoScroll(page);

  const dom = await page.evaluate(() => document.documentElement.outerHTML);

  await browser.close();
  return { dom, resources };
}

The key detail is waitUntil: 'networkidle2'. This tells Puppeteer to wait until there are no more than two active network requests for at least 500ms. Without this step, audits frequently captured incomplete pages before JavaScript finished rendering critical content.

Handling Lazy-Loaded Content

Many sites only load images and components once the user scrolls down the page. A simple page load misses large portions of the content entirely. To solve this, we implemented an incremental scrolling helper:

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 200;
      const timer = setInterval(() => {
        window.scrollBy(0, distance);
        totalHeight += distance;
        if (totalHeight >= document.body.scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

This scrolling behavior triggers intersection observers, lazy load listeners, and deferred image requests in much the same way as real user interaction would.

Building the Audit Engine

Once we had a fully rendered page, we built the audit engine as a collection of independent modules:

checks/
  meta/
    title.js
    description.js
    og-tags.js
    canonical.js
  headings/
    h1-presence.js
    heading-hierarchy.js
  images/
    alt-text.js
    lazy-load-detection.js
    oversized-images.js
  performance/
    render-blocking.js
    resource-hints.js
    font-loading.js
  structured-data/
    json-ld-validation.js
    schema-types.js
  links/
    internal-links.js
    broken-links.js
    anchor-text.js

Each check receives the rendered DOM, network resource data, and page metrics, and returns a standardized result object:

{
  check: 'h1-presence',
  status: 'pass',
  message: 'H1 tag found: "Your Page Title"',
  impact: 'high'
}

This modular structure saved us repeatedly as the platform expanded. It let us disable problematic checks quickly, add new audit rules independently, prioritize issues by impact, and generate cleaner reporting output.

Challenges We Did Not Anticipate

Timeout handling. Some pages are genuinely slow. Large JavaScript bundles, third-party scripts, and API delays can dramatically increase render time. We redesigned the pipeline so incomplete audits return partial results instead of failing entirely.

Bot detection. Some sites actively detect headless browsers and serve different content. We mitigated this using realistic user agents, browser fingerprint adjustments, and standard viewport sizes — but it remains an ongoing challenge.

Single page app routing. Dynamic client-side routing made broader crawling behavior unreliable. We simplified the pipeline to audit only the exact URL requested, which made results far more predictable.

Memory management. Chromium gets expensive fast under concurrency. Early versions launched a fresh browser instance for every audit request. Under load, memory escalated faster than we expected. Even one improperly terminated Chromium process could accumulate enough memory to destabilize a worker node. We eventually learned that browser lifecycle management mattered just as much as the audit logic itself.

What We Would Do Differently

If we were starting over, we would implement a reusable browser pool from day one. Launching a fresh Chromium instance for every audit works initially but becomes inefficient quickly at scale. We would also invest earlier in DOM snapshot caching — rendering is by far the most expensive part of the pipeline, and caching rendered snapshots for repeat audits would reduce both overhead and infrastructure costs.

Final Thoughts

Building a browser-rendered SEO auditing system proved far more demanding than parsing static HTML, but it also exposed how incomplete traditional auditing approaches had become for JavaScript-driven applications.

What started as a rendering experiment eventually forced us to rethink nearly every assumption traditional SEO tools make about how websites should be analyzed in modern frontend environments.

The result became the foundation for DeepAudit AI — our public SEO auditing platform. You can audit any page using the same Chromium-based rendering pipeline we built internally. No account required. Results in about 60 seconds.

Related services

Ready to build a website that performs?

Let us audit your current site, identify the biggest opportunities, and build a plan to grow your traffic and leads.