webdevPuppeteerChromiumNode.jsSEOJavaScriptAWS LambdaDeepAudit AI

How We Built DeepAudit AI with Puppeteer and Chromium

Crystal A. GutierrezMay 22, 20269 min read

Our first architectural decision was that we would never parse raw HTML.

That was the wrong lesson, and it took us a while to see it.

The right one is that we would never stop at raw HTML. The server response tells you what existed before any JavaScript ran. Google parses that response, extracts links from it, and reads its directives before rendering anything. Some crawlers never process anything else. A weak initial document is not irrelevant just because the browser eventually replaces it.

So the real design is two captures: what the server sent, and what Chromium built. The difference between them is usually the finding.

Why the browser was necessary anyway

A client-rendered application can return an initial document that is little more than a root element and some script tags. After the scripts run, the browser may add the main heading, the body copy, the navigation, the canonical, the structured data, the forms, and every internal link.

An HTML-only tool misses all of it. A browser-only tool makes the opposite mistake: it finds the finished page and never tells you the whole thing depends on JavaScript executing successfully.

We stopped arguing about which one was the real page and started saving both.

Why Puppeteer

We looked at Playwright, Selenium, and static parsers.

Our requirements were narrow: one browser engine, a JavaScript API the team already knew, direct access to the page and network lifecycle, and something that would run headless in Lambda. We did not need a cross-browser test suite.

Puppeteer fit the system and the team we had. Playwright would have been a perfectly defensible choice. This was not an objectively correct architectural decision, it was a fit decision, and pretending otherwise would be dishonest.

networkidle2 is a heuristic, not a definition of done

Our navigation still uses waitUntil: "networkidle2" with a 20-second ceiling. It is a good default. It is not proof that a page finished rendering.

Network-idle rules tell you network activity stayed below a threshold for a while. They do not tell you a component reached a ready state, a font settled, a delayed timer will not fire, or a third-party widget will ever finish.

We learned this the hard way on our own site. An earlier version waited for complete network silence, which never arrives on pages where analytics and background requests keep firing. Perfectly healthy pages were being flagged as failures because the capture never resolved. Waiting harder is not the same as waiting correctly.

A timeout should not erase the evidence you already collected. Ours returns partial results and says which stages completed, because a missing schema finding is worthless if navigation died before the script that injects the schema ever ran.

We deliberately do not scroll

This is where we changed our minds most sharply, and where a lot of audit tools go wrong.

The obvious move is to auto-scroll the page so intersection observers fire and every lazy-loaded image and component appears. It feels thorough. It produces a rich, complete DOM.

It also produces a DOM Google will never build. Google Search does not scroll and does not click. Content that only materializes on interaction may never reach the version Google indexes.

So DeepAudit does not scroll and does not click. It loads the page, waits, and records what is there without interaction. We would rather be limited in the same way our target is limited than hand you a flattering report on content no crawler will ever see.

If you do add a scrolling pass, keep it as a *separate* capture and bound it hard: max iterations, max distance, max elapsed time, and a stable-height check. An infinite-scroll page will otherwise keep your auditor alive forever.

Check the response before you trust the DOM

page.goto() returns a response object, and Puppeteer does not throw on a 404. If you ignore that return value you can produce a beautiful, confident SEO audit of an error page.

We record the status, the final URL, the redirect chain, and the headers before evaluating a single rule.

Public URLs are hostile input

This is the section the original version of this article was missing entirely, and it is the most important one for anyone building this.

An audit tool takes a URL from a stranger and loads it in a real browser. You are not running a crawler. You are running an untrusted code execution service that makes outbound network requests from inside your infrastructure. That is a server-side request forgery engine unless you build it not to be.

Ours intercepts every request and aborts anything that resolves to a private address:

// SSRF: intercept all requests, block private/internal destinations
if (isPrivateIp(host)) {
  req.abort("blockedbyclient");
}

isPrivateIp() covers RFC1918 space, loopback, IPv4-mapped IPv6, and 169.254.x.x link-local, which is also the AWS instance metadata endpoint. That last one matters: an unguarded headless browser pointed at 169.254.169.254 will happily fetch your cloud credentials and hand them to a page you did not write.

We resolve the host and check the resolved address, not just the string, because a hostname can resolve to a private IP and a redirect can land somewhere the original URL did not.

The rest of the job is bounded too: fixed navigation and total deadlines, HTTP and HTTPS only, and a browser closed in a finally so a throw during evaluation cannot leak a Chromium process.

Rendering accuracy does not matter if your auditor becomes a way into your own infrastructure.

About the user agent

We should be straight about this, because it is a real methodological choice.

DeepAudit does not announce itself. It presents as Chrome on a Pixel 5, because mobile rendering is what we are trying to reproduce and because that is the experience most visitors actually get.

That is a tradeoff, not a virtue. A declared crawler UA is more transparent and gets blocked more often. A browser UA sees what a browser sees and is less honest about what it is. We chose the browser profile and we are telling you we did, which is the only defensible version of that choice.

What we do not do is try to defeat bot detection. When a site blocks us, we detect it and say so:

cloudflare: "This site uses Cloudflare bot protection and blocked our
             automated browser. The audit cannot analyze the actual page."

Reporting "we were blocked" is a real finding. Quietly evading the block and reporting a score is a lie with a number attached.

What a rule should return

The original design returned a shape like this:

{ check: "h1-presence", status: "pass", impact: "high" }

The problem is impact: "high". Finding an H1 is an observation. Its business impact depends on the page, the visible title, the heading structure, the search demand, and what Google actually processes. The detector cannot know any of that.

Detection and interpretation have to be separate fields. A result should carry what was observed, where the evidence came from, how confident the detector is, whether a human needs to look, and which rule version produced it.

Provenance is the part people skip. "Canonical missing" is not a finding. These are three different findings:

The canonical was absent from the HTTP response.
The canonical appeared only after JavaScript ran.
The response and the rendered DOM declare *different* canonicals.

A report that collapses those into one red icon is not helping anyone.

We also learned to keep a status for "inconclusive." A page that timed out, a rule that did not apply, and a verified defect should not look identical.

We got a rule wrong, and it shipped

Our H1 rule failed any page with more than one H1. Multiple H1s are not automatically a defect: Google reads headings for structure, and HTML5 sectioning permits them.

That single overly strict rule inflated the H1 failure number in every city study we published and every client report we generated, until we caught it and downgraded it to a warning. A polished report does not make a detector correct.

Every rule needs a version, a test, and a way to be wrong in public.

Chromium lifecycle mattered as much as the audit logic

That is still the truest sentence in this article.

A fresh browser per request is fine at low volume and wasteful fast. One leaked process destabilizes a worker. The browser close belongs in a finally, not at the end of the happy path, because the happy path is not where the leaks come from.

If we started again, we would build browser lifecycle management before we built most of the SEO rules.

We would also separate capture from analysis sooner. Rendering is expensive; once you have a good snapshot, many rule versions can analyze the same evidence without loading the page again. But a cached DOM is evidence, not truth. Every snapshot needs a timestamp, a browser version, a viewport, a UA, and an interaction state attached, and personalized or fast-changing pages should not be cached at all.

Our first 14-site pilot

Before we opened this publicly, we ran it against 14 websites from our own prospecting list. The point was never to measure the web. It was to find out which of our rules were wrong and which captures were failing.

Of the 14 pages: 12 produced at least ten HTML-validator errors, and 10 scored under 50 on the accessibility check. One marketing agency's homepage had 196 words. One SEO agency had no structured data on its own site. And one company had 12 internal URLs, including services and contact, that never completed inside our capture window.

Two honest caveats, because we have been telling everyone else to state theirs.

We did not keep the raw scan records, and we never wrote down which accessibility scorer produced that number or what timeout the URLs blew through. So those two counts cannot be independently checked, including by us. That is a process failure, and it is exactly why the later studies ship a dataset.

And a pilot of 14 purposively chosen prospects estimates nothing. It cannot tell you what share of the market has these problems. What it did do is expose which findings we were interpreting badly, which turned out to be most of them:

A raw count of validator errors says nothing about severity. Ten obsolete attributes are not a <head> that closes early and eats your metadata. We had to classify them before the number meant anything.

A low automated accessibility score marks a page for review. It does not establish WCAG conformance either way, and no automated tool does.

The 196-word homepage was worth *reading*, not scoring. Google has no minimum word count. The real question was whether those 196 words explained what the agency did, and they did not.

The missing schema was an opportunity to make visible facts explicit, not proof that Google could not understand the site. Plenty of pages rank with no JSON-LD at all.

And 12 timed-out URLs is a finding about our scanner's capture limit until someone confirms the pages also fail for real visitors. A browser timeout and a broken contact page are not the same event, and we reported them as if they were.

That pilot taught us more than any percentage in it. A detector can identify a condition. It still needs scope, evidence, and a human before it becomes a recommendation.

Where it is now

The first public version audited one page and ran about 60 checks.

The current system runs 60-plus checks across 9 categories and does a bounded same-site deep scan of up to 50 pages. The code in this article reflects the early single-page architecture, and we would rather show you that history than pretend the mature version existed on day one.

If an SEO audit never executes JavaScript, it may never see the problems your users and search engines actually experience. And if it inspects only the rendered DOM, it may never see the weak document the server sent before the application started, the conflicting directive, or the fact that everything on the page depends on one script succeeding.

That is the sentence this whole system is built around.

The lesson underneath all of it has not changed:

Do not trust only the source. Do not trust only the rendered DOM. Do not trust one readiness event. And do not trust an automated rule just because it returned a confident answer.

Capture the evidence at each stage, record where it came from, and let the disagreement tell you where to look.

That is what Puppeteer actually gave us. Not a copy of Google. A repeatable browser where we could finally see what changed after the server response arrived.

Run the audit if you want to see the output. Then open View Source and compare it yourself.

Related services

SEO Services →

Written by

Crystal A. Gutierrez

Chairperson & Infrastructure Lead, Axion Deep Digital

The reason every deployment stays up, every domain resolves, and every environment runs clean. Infrastructure and operations across all Axion Deep products and client projects.

View full profile & credentials →

Ready to build a website that performs?

Let us audit your current site, identify the biggest opportunities, and build a plan to grow your traffic and leads.

Free Site Audit Get in Touch