End-to-end a11y testing

Run a real browser, scan a real page, and decide — carefully — what you fail the build on.

~ 14 min read·lesson 6 of 8

0 / 8

End-to-end accessibility testing scans a real, rendered page in a real browser. Unlike component tests, the layout is computed, the CSS is applied, and JavaScript has run. That means contrast checks, hidden-element detection, and rule-of-page checks (single <main>, single <h1>) actually work.

But because end-to-end tests are slower and broader, they fail more often — and false alarms erode trust fast. The skill is twofold: get the scan running, then decide what's worth blocking a deploy over.

Why end-to-end matters

Three things only a real browser can tell you:

What the page actually looks like. Computed styles, applied themes, browser defaults — all in play.
What state the page is in. End-to-end tests can log in, navigate, fill a form, then scan after each step. Component tests render in isolation; they never see the page mid-flow.
What CSS-driven a11y problems exist. Focus indicators with outline: none and no replacement, contrast that breaks under dark mode, content that's hidden visually but still in the tab order — all CSS issues.

The standard tool combination is Playwright + @axe-core/playwright. Cypress has equivalent integrations (cypress-axe); the patterns transfer directly.

Playwright + axe

tests/a11y.spec.ts

import { test, expect } from "@playwright/test";
import AxeBuilder from "@axe-core/playwright";

test("home page has no critical or serious axe violations", async ({ page }) => {
await page.goto("/");

const results = await new AxeBuilder({ page })
  .withTags(["wcag2a", "wcag2aa"])    // scope the rule set
  .analyze();

// fail only on the worst impact levels
const blockers = results.violations.filter(v =>
  v.impact === "critical" || v.impact === "serious"
);

expect(blockers).toEqual([]);
});

A few things worth pointing out:

withTags(["wcag2a", "wcag2aa"]) scopes the run to the WCAG A and AA rules. Skip this and axe runs every rule in its catalog, including experimental and best-practice ones. Most teams only want WCAG A/AA.
Filtering by impact is key. axe categorises violations as minor, moderate, serious, critical. Failing only on serious/critical keeps the build honest without becoming a chore.
expect(blockers).toEqual([]) produces a useful diff — Playwright prints the first failed entries with selectors so you can find them.

You'll typically have one a11y.spec.ts per logical page or flow: home, sign-up, checkout, settings.

Tip

Run the same flow logged-in and logged-out. Authenticated views often have their own modals, drawers, and dynamic state that don't appear in the public version.

Scoping the scan

AxeBuilder lets you narrow what gets scanned, which matters when third-party widgets or experimental sections would otherwise drown your real findings:

tests/a11y.spec.ts

const results = await new AxeBuilder({ page })
.include("main")             // only inside the main landmark
.exclude("#intercom-iframe")  // skip the third-party chat widget
.disableRules(["region"])    // turn off a rule we handle elsewhere
.analyze();

Three small disciplines:

Include the part you own; exclude the parts you don't. A failing axe rule inside an iframe you don't control is noise.
Disable rules deliberately, with a comment. disableRules should always sit next to a comment explaining why. Otherwise it'll outlive the reason.
Don't use exclude to hide bugs. Excluding a known-broken section is fine if there's a ticket. Quietly ignoring it is technical debt that compounds.

A scoped axe run: include what you own, exclude third-party widgets, fail on serious and critical only.

What to fail the build on

Pick a default that's strict enough to matter and loose enough to live with. A widely used baseline:

Fail the build on critical and serious impact violations on WCAG A and AA rules.
Report moderate and minor as warnings (logged in CI, but not blocking).
Don't fail on best-practice rules unless your team has explicitly opted in.

That posture catches the bugs that genuinely break the experience while leaving room for the team to clean slowly without every PR turning red.

Watch out

Be especially careful with the color-contrast rule on pages that have user-generated content. A red build because someone uploaded a low-contrast logo isn't useful — scope the rule to your content.

For a brand-new project, raising the bar is easy: start strict. For an existing codebase with a backlog, baseline first (we'll cover that in lesson 7), then tighten.

Managing noise

The best end-to-end a11y suite is one that fails only when something real breaks. Otherwise people start ignoring it. A few habits keep noise down:

Stable selectors. Don't run scans in states that depend on flaky data (e.g. before a third-party script loads). Wait for networkidle or a known element.
One scan per state. A login flow scanned at three meaningful states (logged-out, in form, logged-in) gives more value than one giant scan after the whole flow runs.
Snapshot violations, not screenshots. Comparing the list of violations between runs is more stable than comparing pixels.
Quarantine, don't delete. When a flaky rule keeps barking, move it to a separate, non-blocking job until you understand it. Don't disable and forget.

Tip

Treat your a11y CI like your other tests: when it fails, fix it that day. Don't let "merge over the failure" become a team habit — once it does, every signal you put in the suite gets ignored.

Check yourself

check your understanding

Why does end-to-end testing catch contrast issues that jest-axe usually misses?

check your understanding

A reasonable default is to block the build on which axe impact levels?

check your understanding

Your build fails because a third-party chat widget has axe violations you can't fix. The right move is:

check your understanding

Which of these is not a good reason to fail the build on an a11y test?

← previousComponent-level testing6 / 8next →Running an audit on an existing app