Language

On this page

Language

Overview

What the language setting is

Every document is written in a language. A screen reader needs to know which one so it can pronounce the words correctly. The language setting is a small piece of information stored inside your PDF that names that language, for example English or Hindi or French.

A screen reader does not understand the words it reads. It converts written text into speech using pronunciation rules, and those rules are different for every language. The same letters sound completely different depending on which language the reader thinks the text is in. So before it speaks a single word, the screen reader checks the declared language and loads the matching set of rules.

When the language is declared correctly, the document sounds right. When it is missing, the screen reader has to guess, and it often guesses wrong. When the wrong language is declared, every word is run through the wrong rules. A reader with a print disability then hears speech that is hard to follow, or in the worst cases impossible to understand.

What a reader loses when the language is wrong

A screen reader user opens an English document that has no declared language. The screen reader falls back to its own default, which may be a different language entirely, or it applies English rules unevenly. Names, dates, and ordinary words come out mispronounced. The reader can usually still work out the meaning, but every sentence takes more effort, and some words are simply lost.

The harder case is a passage in a second language. An English report quotes one sentence in French. If that sentence is not marked as French, the screen reader reads it with English pronunciation rules. The result is not a French accent that a listener can decode. It is a string of sounds that does not match any real French word, so the quotation becomes meaningless. The reader knows something was there, but cannot tell what it said.

In depth

The whole document language declared correctly

This is the correct state. The document carries a single declared language that matches the language the document is actually written in. A screen reader reads it from start to finish using the right pronunciation rules, and the speech sounds natural.

For a PDF, the document language is set in one place near the top of the file, in what the format calls the catalog. The entry is named /Lang. When it holds a valid language code, for example en for English or hi for Hindi, the screen reader has everything it needs for a document written entirely in that one language.

A concrete example. Your document is a course syllabus written in English. The catalog declares /Lang as en. A screen reader user opens it and hears clear, correctly pronounced English from the title to the final line. Nothing in the language setting stands between the reader and the content.

A machine can confirm that this entry exists and that the code is a real, valid language code. What a machine cannot confirm on its own is whether the declared language is the right one. That gap is the next case.

The wrong language declared

A document can declare a language and still be wrong, because the declared language does not match the language on the page. This passes the basic machine check, since a valid code is present, but it fails the reader, because the code names the wrong language.

A concrete example. Your document is written in English, but the /Lang entry says fr for French. A machine sees a valid language code and may report success. A screen reader, trusting that code, reads your English text using French pronunciation rules. The word "the" does not sound like "the." Names and technical terms come out distorted. The whole document is hard to follow, even though every word on the page is correct English.

This is why a present language code is not the same as a correct one. A machine can check that a language is declared and that it is a valid code. A person still has to confirm that the declared language is the language the document is actually in.

A passage in another language not marked

A document can have the right overall language and still mishandle parts of itself. When a passage inside the document is written in a different language from the rest, that passage needs its own language marking. In a PDF, a part in another language can carry its own /Lang on the structure that wraps it, separate from the document-level setting.

A concrete example, before and after.

Before: an English report includes the sentence "La liberté est le droit de faire tout ce que les lois permettent" as a quotation. The document declares English, and nothing marks this sentence as French. A screen reader reads it with English rules. The listener hears a run of sounds that match no French words and carry no meaning. The quotation is lost.

After: the same sentence is wrapped in a structure element whose /Lang is set to fr for French. The screen reader switches to French pronunciation for just that sentence, then returns to English for the surrounding text. The quotation now sounds like French and a French-speaking listener can understand it.

This applies to any second-language content of meaningful length: a quotation, a phrase, a term, a passage. Single proper names are usually left alone, but a sentence or a passage in another language should be marked.

Where machine checking stops

A machine can do two useful things here. It can confirm that a document language is declared at all, and it can confirm that the declared value is a valid language code rather than a typo or an empty entry. Those checks are real and worth running.

A machine cannot do two other things. It cannot confirm that the declared language is the correct one for the words on the page, because that requires reading and recognising the language. And it cannot reliably find every passage written in a second language and confirm each one is marked, because that too requires recognising the languages involved. Those judgements need a person. For more on this division of labour, see the topic on what automated checking can and cannot find.

Reference detail

Standards mapping

Standard	Identifier	What it requires
WCAG, Language of Page	Success Criterion 3.1.1, Level A	The default human language of the document can be programmatically determined
WCAG, Language of Parts	Success Criterion 3.1.2, Level AA	The human language of each passage or phrase can be programmatically determined, except for proper names, technical terms, and words of indeterminate language
Matterhorn Protocol	Checkpoint 11, Declared natural language	The document and its parts declare their natural language

Level A is the minimum level. Level AA is the level most laws and policies require, and it includes everything in Level A. For how these levels fit together, see the topic on WCAG and the four POUR principles.

Where the language lives in a PDF

Setting	Where it is stored	Purpose
Document language	The `/Lang` entry in the PDF catalog	Names the default language for the whole document
Language of a part	A `/Lang` entry on the structure element that wraps the passage	Overrides the document language for that passage only

The catalog is the top-level dictionary of the PDF file. A structure element is a node in the tag tree, the structure tree a screen reader follows. A part in a different language carries its own /Lang so the screen reader switches pronunciation for that part and then returns to the document default.

Common mistakes

Mistake	Effect on the reader
No document language set	The screen reader guesses, and may read the text with the wrong pronunciation rules
The wrong language set	The screen reader reads every word using rules for a language the document is not in
Foreign-language passages not marked	A passage in another language is read with the wrong pronunciation and may be unintelligible

The fix in all three cases: declare the document language, and mark any passage written in a different language with its own language.

Authoritative sources

W3C, "Understanding Success Criterion 3.1.1: Language of Page" https://www.w3.org/WAI/WCAG21/Understanding/language-of-page.html 2024 ↩
W3C, "Understanding Success Criterion 3.1.2: Language of Parts" https://www.w3.org/WAI/WCAG21/Understanding/language-of-parts.html 2024 ↩
PDF Association, "The Matterhorn Protocol 1.1" https://pdfa.org/resource/the-matterhorn-protocol/ 2021 ↩
WebAIM, "PDF Accessibility" https://webaim.org/ 2024 ↩