Captions and subtitles
track and WebVTT — captions for accessibility, subtitles for translation, descriptions for what the picture shows.
Captions are not just for deaf and hard-of-hearing users — they are the most-used accessibility feature on YouTube and the platform that did the most to mainstream them. People watch videos in noisy cafes, in libraries, on muted feeds, in second languages. Add captions and your audience grows.
This lesson is the markup side: the <track> element that pairs caption files with a <video> or <audio>, the WebVTT file format the captions live in, and the difference between captions, subtitles, and descriptions.
The track element
<track> lives inside <video> or <audio> and points at an external file containing the timed text. The browser reads the file and renders the captions on top of the video.
<video src="bake.mp4" controls width="800" height="450" poster="bake-still.jpg"> <track kind="captions" src="bake-en.vtt" srclang="en" label="English" default> <track kind="subtitles" src="bake-es.vtt" srclang="es" label="Español"> </video>
Each <track> has a few attributes:
src— URL of the WebVTT file.kind— what kind of track this is (covered below).srclang— language code (BCP 47, likeen,es,pt-BR).label— human-readable name shown in the captions menu.default— this track is enabled by default.
The browser exposes the tracks in the player's CC button. The user picks which one to show, or none. Multiple tracks of the same kind are allowed — typically one per language.
WebVTT files
WebVTT is a plain-text file format for timed text. Lightweight, readable, easy to write by hand or generate from a transcription tool.
WEBVTT 00:00:01.000 --> 00:00:04.000 First, mix the flour and water in a large bowl. 00:00:05.000 --> 00:00:09.000 Let it rest for thirty minutes, then add the salt. 00:00:10.500 --> 00:00:13.500 Knead until the dough passes the windowpane test.
The file starts with WEBVTT (literal text, mandatory). After that, each cue is:
- A start and end timestamp separated by
-->. - The caption text on the next line(s).
- A blank line to separate cues.
Timestamps are HH:MM:SS.mmm. You can drop the hour for short videos: 00:04.000. Decimals down to milliseconds are supported.
WebVTT also supports basic styling: bold (<b>), italic (<i>), classes (<c.shouting>WHAT?!</c>), positioning (align:start, position:30%), and even speaker labels:
WEBVTT 00:00:01.000 --> 00:00:03.500 <v Maya>The dough is ready when it pulls away from the bowl. 00:00:03.500 --> 00:00:05.000 <v Sam>How can you tell?
<v Maya>...</v> marks the speaker. Some players show the name; others let you style it via CSS.
The full WebVTT spec covers regions, styling cues with CSS pseudo-elements (::cue), and ruby annotations for East Asian languages. For most projects, plain timestamped lines are 95% of what you need.
Tools like Whisper, Otter, and most editing software export WebVTT directly. Auto-transcribe, then proofread. Clean transcripts are a fraction of the cost they were in 2018.
captions vs subtitles vs descriptions
The kind attribute is more nuanced than it looks. Three values, each with a real meaning.
kind="captions" — for users who cannot hear the audio. Includes dialogue and sound effects: "[door slams]", "[laughter]", "[suspenseful music]". Same language as the audio. Captions assume the user is not hearing anything.
kind="subtitles" — for users who can hear but cannot understand the language. Just dialogue, no sound effects. Often a translation: an English video with Spanish subtitles.
kind="descriptions" — for users who cannot see the video. Audio descriptions of what is happening on screen: "the baker pulls the loaf from the oven and sets it on a wire rack". Read aloud by the screen reader (or a separate audio track in some setups).
A video may legitimately have all three: captions in the original language, subtitles in 12 translations, and descriptions for blind users.
Two more values exist:
kind="chapters"— chapter markers shown in the timeline.kind="metadata"— invisible cues you can read from JavaScript; useful for syncing other UI to the video timeline.
The default if you omit kind is subtitles. Most projects need captions for accessibility and add subtitles for translations.
default and srclang
A <track> with default is enabled the moment the video loads. Without default, the user has to open the captions menu and pick one.
Only one track of each kind should be default. If two captions tracks both claim default, the browser picks the first one and ignores the second.
srclang is the language code of the track. Use BCP 47 codes — en for English, es for Spanish, pt-BR for Brazilian Portuguese, zh-Hans for Simplified Chinese. The browser uses srclang to match a default track to the user's language preferences.
<video src="bake.mp4" controls> <track kind="captions" src="bake-en.vtt" srclang="en" label="English" default> <track kind="subtitles" src="bake-es.vtt" srclang="es" label="Español"> <track kind="subtitles" src="bake-pt-BR.vtt" srclang="pt-BR" label="Português (Brasil)"> <track kind="subtitles" src="bake-zh-Hans.vtt" srclang="zh-Hans" label="简体中文"> <track kind="descriptions" src="bake-desc.vtt" srclang="en" label="Audio descriptions"> </video>
English captions are on by default for everyone. Spanish, Portuguese, and Chinese subtitles are available from the menu. The descriptions track is also available for screen-reader users.
The label is what shows in the menu. Write it in the target language so a Spanish-speaker sees "Español" not "Spanish". Same for the Portuguese and Chinese entries.
Captioned videos with no caption track are inaccessible to a sizable share of your audience — Wikipedia estimates 5 to 15% of the population, depending on the country and definition. Treat captions as required for any video that conveys information through speech, the same way alt text is required for informative images.
00:01:30 --> 00:01:34. The browser does not show this cue. What's likely wrong?