lesson 7

Canonical, robots, hreflang

The SEO plumbing — duplicate-content control with canonical, indexing rules with robots, language alternates with hreflang.

~ 14 min read·lesson 7 of 8

0 / 8

This lesson is the trio of <link> and <meta> tags that talk to search engines about which version of a page to index, whether to index it at all, and which language version to show. Get them right and search engines understand your site structure; get them wrong and you split your authority across duplicate URLs that all rank for nothing.

rel=canonical

A single page can be reachable at several URLs:

/article/sourdough-season
/article/sourdough-season?utm_source=newsletter
/article/sourdough-season?ref=twitter
https://www.example.com/article/sourdough-season vs https://example.com/...
A printable /article/sourdough-season/print version

To a search engine, those are all separate URLs. Without help, it sees five copies of the same content and either picks one arbitrarily, splits the ranking signals across all of them, or marks the duplicates as low-quality.

rel="canonical" tells search engines "the canonical URL for this page is X — treat all these other URLs as variants of it".

canonical.html

<link rel="canonical" href="https://example.com/article/sourdough-season">

Every variant URL ships the same canonical link. The link points at the version you want indexed. Search engines consolidate the ranking signals onto the canonical URL.

A few rules:

Use absolute URLs. Always include the protocol and domain. Relative paths work but are easy to get wrong.

The canonical can point to itself. A page's canonical pointing at its own URL is fine — and explicit. It tells search engines "yes, this URL is the one to index".

The canonical should match the page's content closely. Pointing two unrelated pages at the same canonical is a red flag.

Canonicalize across HTTPS, www, and trailing-slash variants. Pick one canonical form and stick to it. Most teams pick "https, no www, no trailing slash on articles". Whatever you pick, the canonical link makes it explicit.

utm-canonical.html

<!-- The page is /article/sourdough-season?utm_source=newsletter -->
<link rel="canonical" href="https://example.com/article/sourdough-season">

The UTM-tagged URL is what the user sees. The canonical strips the tracking parameters. Search engines index only the clean URL; the tracking parameter is preserved for analytics on the visitor's session.

Watch out

Don't canonicalize a translated page to its English original — they are different content. Use hreflang for translations (covered below). Canonical is for the same content reachable via multiple URLs, not for related content.

check your understanding

Your blog post is at /posts/sourdough-season. A user shared it as /posts/sourdough-season?utm_source=twitter. What canonical URL should you ship on the tagged version?

meta robots

The robots meta tag controls whether search engines index a page and whether they follow its links.

robots-noindex.html

<meta name="robots" content="noindex, nofollow">

The values you'll use most:

noindex — do not include this page in search results.
index — do include (the default; you rarely write it).
nofollow — do not follow the links on this page (do not pass ranking signals through them).
follow — do follow links (the default).
none — shorthand for noindex, nofollow.
all — shorthand for index, follow.

You can mix and match. noindex, follow is common — keep the page out of results but follow its links so search engines discover other pages.

thank-you.html

<!-- Order confirmation page: don't index, but follow internal links -->
<meta name="robots" content="noindex, follow">

Common pages to noindex: thank-you pages, internal search results, faceted filters with low-value combinations, account dashboards, draft previews.

There's also a robots.txt file at your site root that controls crawling (whether search engines fetch the page at all). The two work together but solve different problems:

robots.txt — "do not even fetch this URL".
noindex meta — "fetch it, just do not index it".

If you Disallow a URL in robots.txt, search engines never fetch it, never see the meta tag, and may still index the URL based on links to it (with a "no description available" snippet). For "really do not index this", noindex meta is the more reliable signal.

Tip

For non-HTML files (PDFs, images), use the X-Robots-Tag HTTP header instead of the meta tag — there's nowhere in a PDF to put a meta tag. Same values, header instead of meta.

hreflang for languages

When your site has the same content in multiple languages or for multiple regions, hreflang tells search engines which version to show to which user.

hreflang.html

<!-- On every language version of this page -->
<link rel="alternate" hreflang="en" href="https://example.com/sourdough">
<link rel="alternate" hreflang="es" href="https://example.com/es/sourdough">
<link rel="alternate" hreflang="pt-BR" href="https://example.com/pt-BR/sourdough">
<link rel="alternate" hreflang="x-default" href="https://example.com/sourdough">

Each <link> lists one language version. The hreflang attribute is the BCP 47 language code (en, es, pt-BR, zh-Hans).

x-default is the version to show when no other language matches the user. Often the English (or default) version.

A few rules:

The hreflang set is reciprocal. Every version must list every other version and itself. The Spanish page lists <link rel="alternate" hreflang="es" href="..."> for itself, plus links to English and Portuguese. Forget one and search engines flag the set as broken.

Use real language codes. en is English (any region), en-GB is British English specifically. es is Spanish (any region), es-MX is Mexican Spanish.

The URL has to be different per language. If your /sourdough page changes language based on a cookie and the URL stays the same, hreflang cannot help — every version needs its own URL.

A more readable way to maintain this: put the hreflang links in your sitemap.xml instead of in every page's HTML. That centralizes the list and makes the reciprocity easier to keep in sync.

check your understanding

You have your post in English (/sourdough) and Spanish (/es/sourdough). On the English page, you add <link rel="alternate" hreflang="es" href="/es/sourdough"> only. Is the setup complete?

Putting them together

A typical English-language article that has a Spanish translation, with no UTM params on the URL:

combined.html

<head>
<title>How sourdough fermentation works · Daily Bread Bakery</title>
<meta name="description" content="...">

<link rel="canonical" href="https://example.com/sourdough">

<link rel="alternate" hreflang="en" href="https://example.com/sourdough">
<link rel="alternate" hreflang="es" href="https://example.com/es/sourdough">
<link rel="alternate" hreflang="x-default" href="https://example.com/sourdough">

<!-- meta robots is omitted -> defaults to index, follow -->
</head>

Each page in the set has the same hreflang block. Each has its own self-canonical (the canonical points at the page's own URL). The English page's canonical is the English URL; the Spanish page's canonical is the Spanish URL.

These three tags — canonical, robots, hreflang — are the SEO plumbing. Most pages need a canonical and not much else. Pages with multiple URLs benefit from canonical heavily. Pages that should not be indexed need noindex. Multi-language sites need the hreflang set.

check your understanding

You have an admin dashboard at /admin. You want to make absolutely sure it doesn't appear in Google search results. Which combination is most reliable?

check your understanding

Your site has English and German versions. The English page is at /article; the German page is at /de/article. You add <link rel="canonical" href="https://example.com/article"> on the German page. What's the consequence?

check your understanding

A blog post page is reached at https://www.example.com/post-1 and at https://example.com/post-1. The two URLs are identical content. To consolidate the ranking signals onto one URL, what's the smallest change?

← previousWeb app manifest7 / 8next →Structured data (JSON-LD)