EquitableDocs Document Accessibility Guide

What a tagged PDF actually is

Overview

A PDF carries two layers, and the screen reader only sees one

Your PDF holds two layers of information at once.

The first layer is the visual layer. This is everything you see when you open the file: the words, the pictures, the columns, the page numbers, the lines and boxes. It is laid out for the eye.

The second layer is the structure tree, also called the tag tree. This is a separate, ordered list of the meaningful parts of your document, with each part labelled for what it is: this is a heading, this is a paragraph, this is a list, this is a table, this is a figure. A sighted reader never sees the tag tree. A screen reader, which is the software a reader with a print disability uses to hear or feel the page, follows the tag tree and nothing else.

When the tag tree is built well, the screen reader announces a real heading as a heading, reads a table row by row with its column names, and describes a figure with the words you wrote for it. When the tag tree is missing or thin, the screen reader gets very little. It may read the words as one long run with no headings to jump between, or it may read them in the wrong order, or it may say nothing at all for a scanned page that is only a picture of text. The visual layer can look perfect on screen and still carry almost nothing in the tag tree.

Tagged means in the tree, artifact means skipped

Two words describe how each piece of content sits in a tagged PDF. Content that carries meaning is tagged, which means it has a place in the tag tree. Content that is only decoration, a background colour, a divider line, a repeated logo, is marked as an artifact, which means it is left out of the tag tree and the screen reader skips it. Getting this split right is most of what makes a tagged PDF usable: every meaningful thing in the tree with the correct label, and every decorative thing kept out of the way.

In depth

The tag tree is an ordered, labelled list of your content

The tag tree is a structured list. It starts at the top of the document and runs to the end, and each entry names a type of content and holds the actual text or image for that piece. A heading entry holds the heading text and is labelled as a heading. A list entry holds the list and is labelled as a list, with each item inside it. A table entry holds the rows and cells, with the header cells marked so the screen reader can say "this column is Price" before it reads a number.

The screen reader walks this list from top to bottom. It does not look at where things sit on the page. It reads the tree in tree order. So the tag tree is the document the screen reader actually receives, and the visual page is a separate thing that happens to be made from the same content.

A library catalogue card gives the idea once you know what the tree is. The book on the shelf is the visual page: you find it by walking to the right spot. The catalogue card is the tag tree entry: it records what the book is and where it belongs in the order, in a form you can read without seeing the shelf. A screen reader user never walks the shelves. They read the cards.

Tagged content versus artifacts, and what goes wrong

Deciding what is meaningful and what is decoration is a real choice, and tools get it wrong in both directions.

Sometimes decoration ends up tagged. A thin horizontal line that just separates two sections has no meaning to read aloud, but if the software that made the PDF dropped it into the tag tree, the screen reader may announce a stray figure or an empty block in the middle of a sentence. The fix is to mark that line as an artifact so it is skipped.

Sometimes meaningful content ends up as an artifact, or never tagged at all. A page number you can ignore, but a caption under a chart carries information, and if it was marked as an artifact the screen reader user never hears it. Worse, on a scanned page that is only an image of text, none of the words are tagged, because there is no real text in the file at all, only a picture. A sighted reader sees a full page. A screen reader user gets silence until the page is run through optical character recognition, which is software that reads the picture and turns it into real, taggable text. The topic on what automated checking can and cannot find covers why a machine often cannot tell a meaningful graphic from a decorative one.

Role mapping lets a custom tag name still work

PDF has a fixed set of standard tag names that screen readers understand, names like Heading, Paragraph, List, Table, Figure. But the software that creates PDFs does not always use those exact names. A publishing tool might tag every chapter title with its own custom name, something like "ChapterTitle," because that is what the layout template called it. A screen reader does not know what "ChapterTitle" means.

Role mapping is the bridge. It is a small table inside the file that says "the custom tag ChapterTitle should be treated as the standard tag Heading." With that mapping in place, the screen reader sees ChapterTitle, looks it up, learns it means Heading, and announces a heading. Without the mapping, the custom tag is an unknown label, and the screen reader may treat its content as plain text with no heading behaviour. Role mapping is how a document with its own private tag names can still be read correctly, as long as every custom name points to a real standard type.

Reading order is tree order, not page order

The order a screen reader reads in is the order of the tag tree, and that order can differ from where things sit on the page.

Here is a concrete case. A page has a main article in the centre and a short sidebar note in a box across the top, above the article. Your eye reads the sidebar first because it is at the top. But the tag tree is built by the software that made the file, and that software may have written the main article into the tree first and the sidebar box last. So the screen reader reads the entire article, then reads the sidebar note at the very end, even though the sidebar sits visually at the top of the page. Nothing looks wrong on screen. The reading order is still wrong for the person listening.

This is why a document can have every part tagged and still read in a confusing order. The tags can all be correct types, and the tree can still be sequenced in a way no human would choose. A machine can confirm that a tag tree exists and that each item has a label. Only a person listening can confirm that the order makes sense, that the sidebar comes where a reader would expect it, that a caption follows its figure, that a footnote does not interrupt a sentence. Checking the order is human work.

Where the machine check stops

A validator can open your file and confirm a number of things about the tag tree. It can confirm the tree exists, that items carry labels, that custom tags have a role mapping to a standard type, that a figure has some alternative text attached. These are real checks and they catch real problems, an untagged document, a missing role map, an image with no alt text at all.

What the machine cannot do is judge meaning. It cannot tell whether the heading tag is on a real heading or on a line of bold body text. It cannot tell whether a figure's alternative text actually describes the figure or just says "image." It cannot tell whether the tree order matches the order a reader expects. It cannot always tell whether that horizontal line is a meaningful graphic or decoration. So a clean machine result means the plumbing is present, not that the document reads well. The topic on what automated checking can and cannot find goes through this split in detail, and the topic on PDF/UA and the Matterhorn Protocol explains which standards draw the line where.

Reference detail

The standards behind a tagged PDF

Four standards sit behind a tagged PDF, and each does a different job.

Standard Identifier Job
PDF 1.7 ISO 32000-1:2008 The PDF file format itself: how text, images, fonts, and structure are stored. It makes tagging possible but requires no accessibility features. 3
PDF/UA-1 ISO 14289-1:2014 The accessibility profile for PDF: which parts of the format you must use, and how, so the file is usable. 2
Matterhorn Protocol 1.1 PDF Association application note, 2021-04 Turns PDF/UA rules into testable failure conditions. Not an ISO standard. 1
WCAG 2.0 ISO/IEC 40500:2012 Defines what the result must achieve for a reader. Broader than PDF. The current WCAG version is 2.2 (2023); the ISO mapping covers 2.0. 4

PDF/UA conformance covers WCAG for the PDF's own page content. WCAG adds things PDF/UA does not test, colour contrast being the main one.

Standard structure element types

The tag tree is built from standard structure element types. Each tagged item carries one of these types, or a custom name that is role-mapped to one. The element types most often seen in document content are listed below, by the checkpoint area of the Matterhorn Protocol that governs them.

Content carried in the tree Matterhorn checkpoint that governs it
Appropriate, valid tags in a sound tree 09 Appropriate tags
Custom tag names mapped to standard types 02 Role mapping
Whether all real content is tagged at all 01 Real content tagged
Graphics and figures 13 Graphics
Headings 14 Headings
Tables 15 Tables
Lists 16 Lists
Mathematical expressions 17 Mathematical expressions
Page headers and footers 18 Page headers and footers
Notes and references 19 Notes and references

The PDF format defines a set of standard structure element types, the tags themselves. The common ones group as follows: grouping elements (Document, Part, Article shown as Art, Section shown as Sect, Division shown as Div); paragraphs and headings (Paragraph shown as P, a generic Heading shown as H, and the numbered headings H1 to H6); lists (List shown as L, List item shown as LI, Label shown as Lbl, List body shown as LBody); tables (Table, Table row shown as TR, Table header cell shown as TH, Table data cell shown as TD, with optional THead, TBody, TFoot, and Caption); and inline and illustration elements (Span, Link, Note, Reference, Figure, Formula, and Form). Any tag in a real document is either one of these standard types or a custom name that role mapping points back to one of them. 1

How reading order is determined

Reading order is the order of items in the tag tree, read from top to bottom, not the order in which content appears on the page. A reader with a print disability hears the document in tree order. When the tree order and the visual order disagree, the screen reader follows the tree. Confirming that the tree order matches the order a reader expects is a human judgement check, not a machine check.

What the checking tools actually claim

Tool What a pass means
veraPDF Implements only the machine-verifiable subset by design. A pass means "no machine-detectable PDF/UA failures," not "accessible." 5
PAC (axes4) Runs the machine checks and gives a person tools for the human checks, including a screen-reader preview and a structure-tree view. 6

Different tools implement different amounts even of the machine subset, so a pass from one tool is not the same claim as a pass from another, and neither equals conformance.

Authoritative sources


  1. PDF Association, "The Matterhorn Protocol 1.1" https://pdfa.org/resource/the-matterhorn-protocol/ 2021 

  2. International Organization for Standardization, ISO 14289-1:2014 (PDF/UA-1) 2014 

  3. International Organization for Standardization, ISO 32000-1:2008 (PDF 1.7) 2008 

  4. W3C, "Web Content Accessibility Guidelines (WCAG)" https://www.w3.org/WAI/standards-guidelines/wcag/ 2023 

  5. veraPDF Consortium, "veraPDF Documentation" https://docs.verapdf.org/ 2015 

  6. axes4, "PDF Accessibility Checker (PAC)" https://pac.pdf-accessibility.org/ 2024 

  7. PDF Association, "PDF/UA in a Nutshell" https://pdfa.org/resource/pdfua-in-a-nutshell/ 2024