Fonts and text

On this page

Fonts and text

Overview

What "real text" means in a PDF

A PDF can show words on a page in two very different ways, and only one of them is readable by anyone other than a sighted reader.

The first way is real, selectable text. The letters are stored as characters, drawn on the page using an embedded font, and each character can be recovered as the letter it actually is. You can select it with your mouse, copy it, search for it, and a screen reader can read it aloud. This is what you want.

The second way is a picture of words. A scanned page, or a screenshot of a paragraph, is an image. It looks like text to a sighted reader, but there is nothing behind it. There are no characters to select, copy, search, or read aloud. To a screen reader, the page is blank.

For real text to work, two things have to be right. The font has to be embedded in the file, so the right letter shapes are available wherever the file is opened. And each font needs a mapping, called a ToUnicode mapping, that says which character each shape stands for. ToUnicode is what lets the actual letters be recovered, so the word "electric" comes back as "electric" and not as a string of nonsense.

What a reader loses when fonts or text are wrong

When the text is a picture with no real text behind it, a screen reader user gets silence. The page may be full of words a sighted reader can see, and the reader with a print disability hears nothing at all. The fix here is not a text description. The fix is to run optical character recognition, called OCR, which reads the picture and adds a real text layer underneath it.

When the font is missing its ToUnicode mapping, the text is there but it comes out wrong. The screen reader reads gibberish, or skips the word, or speaks a string of unrelated letters. Copy and search break in the same way: you select the word "electric" and paste a meaningless string. The page looks perfect on screen and is unusable underneath.

In depth

Fonts embedded with a correct ToUnicode mapping

This is the case where everything works. The font is embedded in the file, so the letter shapes travel with the document and render the same on any machine. Each font carries a ToUnicode mapping, so every shape on the page can be traced back to the character it represents.

A short example. Your report contains the sentence "The electric grid failed in 2019." The font is embedded, and its ToUnicode mapping is complete. A screen reader user hears "The electric grid failed in 2019." A reader using copy and paste gets exactly those characters. A reader searching the file for "electric" finds it. Nothing is lost between what is on the page and what reaches the reader.

A machine can confirm this case. Software can open the file, check that each font is embedded, and check that each font has a ToUnicode mapping. When both are present, this part of the document is sound, and the check is reliable.

A font with a missing or broken ToUnicode mapping

Here the words are real text, not a picture, but the link between the shapes and the characters is broken. The font draws the right shapes on screen, so a sighted reader sees the word correctly. Underneath, the ToUnicode mapping is missing or wrong, so the characters cannot be recovered.

A before-and-after example. Before: the word "electric" is set in a font that was embedded without a correct ToUnicode mapping. On screen it reads "electric." A screen reader user hears a run of meaningless letters, or silence. A reader who copies the word and pastes it gets something like "" or "e1ec7r1c." Searching the file for "electric" finds nothing, because the file does not know the word is there. After: the font is re-embedded with a correct ToUnicode mapping. The shapes on screen are unchanged, but now the characters can be recovered. The screen reader says "electric," copy and paste returns "electric," and search finds it.

A machine can detect this case too. Software can see that a font is embedded but lacks a usable ToUnicode mapping, and can flag it. This is a common and damaging fault precisely because it is invisible to the eye: the page looks finished, so the problem is easy to miss without a check.

A picture of words with no real text behind it

This is the scanned page, or any image of text. A document is photographed or scanned, and the result is stored as an image. It shows words, but there are no characters in the file at all. There is no font to embed and no ToUnicode mapping to fix, because there is no text layer to begin with.

An example. A library scans a printed chapter and saves it as a PDF. On screen it looks like a normal page of text. A screen reader user opening it hears nothing, because the page is a single image. Selecting text selects nothing. Searching the file returns no results. The fix is not alternative text, and it is not re-embedding a font. The fix is OCR: software reads the image, recognises the words, and adds a real, selectable text layer underneath the picture. After OCR, the same page has real text behind the image, and a screen reader can read it, copy works, and search works.

OCR is also where the standards meet image-of-text rules. A page that exists only as an image of text is the thing WCAG's images-of-text criterion is about, and a scanned page with no text layer is what OCR exists to repair.

Where machine checking stops for fonts and text

Most of this element is machine-checkable, which is unusual. Software can confirm that fonts are embedded, that each font has a ToUnicode mapping, and that a given page has no text layer and is therefore a scan that needs OCR. These are reliable, automatable checks.

What a machine cannot fully judge is the quality of what OCR produces. OCR can add a text layer, but it can also misread letters, especially on faint scans, unusual fonts, or non-Latin scripts. A machine can confirm that a text layer now exists. Only a person, or a careful comparison against the image, can confirm that the recovered text actually matches the words on the page. So the presence of fonts, mappings, and a text layer is machine-checkable; the correctness of OCR output still benefits from a human check. For more on this split, see the topic on what automated checking can and cannot find.

Reference detail

Standards mapping

Standard	Identifier or detail	What it covers here
Matterhorn Protocol 1.1	Checkpoint 31 Fonts	Fonts must be embedded
Matterhorn Protocol 1.1	Checkpoint 10 Character mappings	Each font needs a correct ToUnicode mapping so characters can be recovered
WCAG	1.4.5 Images of Text, Level AA	Use real text rather than a picture of text where possible; a scanned page is an image of text until OCR adds a text layer

WCAG relates to this element indirectly, through real, selectable text and through criterion 1.4.5 Images of Text. The font and mapping rules themselves are the PDF/UA side, expressed as Matterhorn checkpoints 31 and 10. For how these standards fit together, see the topic on PDF/UA and the Matterhorn Protocol, and the topic on WCAG and the four POUR principles.

What carries the text in the file

Element	Role
Embedded font	Carries the letter shapes inside the file, so text renders the same anywhere
ToUnicode mapping	Maps each shape to the character it represents, so text can be read aloud, copied, and searched
Text layer	The real, recoverable characters on a page; absent on a scan until OCR adds one

Common mistakes

Mistake	What happens to the reader	Fix
Fonts not embedded	Text may render with the wrong shapes or fail to display correctly on another machine	Embed the fonts in the file
Missing ToUnicode mapping	Text reads as gibberish or silence, and copy and search return nonsense	Ensure each font has a correct ToUnicode mapping
Scanned document with no OCR text layer	A screen reader user gets a blank, silent page	Run OCR to add a real text layer

Authoritative sources

PDF Association, "The Matterhorn Protocol 1.1" https://pdfa.org/resource/the-matterhorn-protocol/ 2021 ↩
W3C, "Understanding WCAG" https://www.w3.org/WAI/WCAG21/Understanding/ 2024 ↩
WebAIM, "WebAIM: Web Accessibility In Mind" https://webaim.org/ 2024 ↩