HTML media · 6 / 7
lesson 6

Captions and subtitles

track and WebVTT — captions for accessibility, subtitles for translation, descriptions for what the picture shows.

~ 15 min read·lesson 6 of 7
0 / 7

Captions are not just for deaf and hard-of-hearing users — they are the most-used accessibility feature on YouTube and the platform that did the most to mainstream them. People watch videos in noisy cafes, in libraries, on muted feeds, in second languages. Add captions and your audience grows.

This lesson is the markup side: the <track> element that pairs caption files with a <video> or <audio>, the WebVTT file format the captions live in, and the difference between captions, subtitles, and descriptions.

The track element

<track> lives inside <video> or <audio> and points at an external file containing the timed text. The browser reads the file and renders the captions on top of the video.

captioned-video.html
<video src="bake.mp4" controls width="800" height="450" poster="bake-still.jpg">
<track
  kind="captions"
  src="bake-en.vtt"
  srclang="en"
  label="English"
  default>

<track
  kind="subtitles"
  src="bake-es.vtt"
  srclang="es"
  label="Español">
</video>

Each <track> has a few attributes:

  • src — URL of the WebVTT file.
  • kind — what kind of track this is (covered below).
  • srclang — language code (BCP 47, like en, es, pt-BR).
  • label — human-readable name shown in the captions menu.
  • default — this track is enabled by default.

The browser exposes the tracks in the player's CC button. The user picks which one to show, or none. Multiple tracks of the same kind are allowed — typically one per language.

WebVTT files

WebVTT is a plain-text file format for timed text. Lightweight, readable, easy to write by hand or generate from a transcription tool.

bake-en.vtt
WEBVTT

00:00:01.000 --> 00:00:04.000
First, mix the flour and water in a large bowl.

00:00:05.000 --> 00:00:09.000
Let it rest for thirty minutes, then add the salt.

00:00:10.500 --> 00:00:13.500
Knead until the dough passes the windowpane test.

The file starts with WEBVTT (literal text, mandatory). After that, each cue is:

  1. A start and end timestamp separated by -->.
  2. The caption text on the next line(s).
  3. A blank line to separate cues.

Timestamps are HH:MM:SS.mmm. You can drop the hour for short videos: 00:04.000. Decimals down to milliseconds are supported.

WebVTT also supports basic styling: bold (<b>), italic (<i>), classes (<c.shouting>WHAT?!</c>), positioning (align:start, position:30%), and even speaker labels:

dialogue.vtt
WEBVTT

00:00:01.000 --> 00:00:03.500
<v Maya>The dough is ready when it pulls away from the bowl.

00:00:03.500 --> 00:00:05.000
<v Sam>How can you tell?

<v Maya>...</v> marks the speaker. Some players show the name; others let you style it via CSS.

The full WebVTT spec covers regions, styling cues with CSS pseudo-elements (::cue), and ruby annotations for East Asian languages. For most projects, plain timestamped lines are 95% of what you need.

Tip

Tools like Whisper, Otter, and most editing software export WebVTT directly. Auto-transcribe, then proofread. Clean transcripts are a fraction of the cost they were in 2018.

captions vs subtitles vs descriptions

The kind attribute is more nuanced than it looks. Three values, each with a real meaning.

kind="captions" — for users who cannot hear the audio. Includes dialogue and sound effects: "[door slams]", "[laughter]", "[suspenseful music]". Same language as the audio. Captions assume the user is not hearing anything.

kind="subtitles" — for users who can hear but cannot understand the language. Just dialogue, no sound effects. Often a translation: an English video with Spanish subtitles.

kind="descriptions" — for users who cannot see the video. Audio descriptions of what is happening on screen: "the baker pulls the loaf from the oven and sets it on a wire rack". Read aloud by the screen reader (or a separate audio track in some setups).

A video may legitimately have all three: captions in the original language, subtitles in 12 translations, and descriptions for blind users.

Two more values exist:

  • kind="chapters" — chapter markers shown in the timeline.
  • kind="metadata" — invisible cues you can read from JavaScript; useful for syncing other UI to the video timeline.

The default if you omit kind is subtitles. Most projects need captions for accessibility and add subtitles for translations.

check your understanding
Your video is in English. You want to make it accessible to deaf English-speaking users. Which kind of track is right?

default and srclang

A <track> with default is enabled the moment the video loads. Without default, the user has to open the captions menu and pick one.

Only one track of each kind should be default. If two captions tracks both claim default, the browser picks the first one and ignores the second.

srclang is the language code of the track. Use BCP 47 codes — en for English, es for Spanish, pt-BR for Brazilian Portuguese, zh-Hans for Simplified Chinese. The browser uses srclang to match a default track to the user's language preferences.

multilingual.html
<video src="bake.mp4" controls>
<track kind="captions" src="bake-en.vtt" srclang="en" label="English" default>
<track kind="subtitles" src="bake-es.vtt" srclang="es" label="Español">
<track kind="subtitles" src="bake-pt-BR.vtt" srclang="pt-BR" label="Português (Brasil)">
<track kind="subtitles" src="bake-zh-Hans.vtt" srclang="zh-Hans" label="简体中文">
<track kind="descriptions" src="bake-desc.vtt" srclang="en" label="Audio descriptions">
</video>

English captions are on by default for everyone. Spanish, Portuguese, and Chinese subtitles are available from the menu. The descriptions track is also available for screen-reader users.

The label is what shows in the menu. Write it in the target language so a Spanish-speaker sees "Español" not "Spanish". Same for the Portuguese and Chinese entries.

Watch out

Captioned videos with no caption track are inaccessible to a sizable share of your audience — Wikipedia estimates 5 to 15% of the population, depending on the country and definition. Treat captions as required for any video that conveys information through speech, the same way alt text is required for informative images.

check your understanding
You want to ship a video with English captions on by default and Spanish subtitles available in the menu. The right markup is:
check your understanding
You write a WebVTT cue: 00:01:30 --> 00:01:34. The browser does not show this cue. What's likely wrong?
check your understanding
Your video has dialogue, background music, and on-screen text. Captions need to convey all of it to deaf viewers. The cleanest WebVTT pattern for the music is:
← prevnext lesson →
KeepLearningcertificate
for completing
HTML media
0 of 7 read