What Makes Documentation AI-Ready: Media

In previous posts, I’ve explored how structural principles, writing quality, and code example pedagogy that serve human readers also serve AI tools. Those posts focused on text-based content found in all technical documentation.

This post addresses media: images, videos, diagrams, and audio. These elements appear less frequently in API documentation than in other technical content, but they’re valuable for tutorials, demonstrations, system architecture explanations, and physical procedures.

The patterns that make media accessible to AI tools aren’t new requirements. They’re the same accessibility standards that have served vision-impaired readers, users who have difficulty hearing audio content, and people with cognitive disabilities for decades. Once again, the fundamentals endure.

What accessibility standards tell us about media

The Web Content Accessibility Guidelines (WCAG) have provided clear requirements for accessible media since the late 1990s. These requirements weren’t created for AI—they were created for people who can’t see images, can’t hear audio, or need alternative ways to process information. To address these cases, they recommend:

For images:

Provide text alternatives (alt text) that describe content and function
Include longer descriptions for complex images like diagrams
Don’t rely on color alone to convey information
Ensure text in images is also available as actual text

For video:

Provide synchronized captions for deaf and hard-of-hearing users
Provide audio descriptions for blind and vision-impaired users
Provide transcripts that include both dialogue and important visual information
Identify speakers in multi-person videos

For audio:

Provide text transcripts
Identify speakers
Note non-speech sounds that convey meaning

These requirements are the result of thirty years of refinement based on how people with disabilities use content.

How AI tools process images

I asked Claude how it processes images in documentation. The patterns it described align exactly with accessibility requirements.

Claude can process images, but only when they’re provided directly—screenshots, diagrams, photos, charts. Unfortunately, Claude can’t extract images from web pages. When Claude encounters <img src="diagram.png" alt="System architecture"> in a webpage, it only sees the alt text—'System architecture'—not the actual image. This same limitation is familiar to screen reader users. In both cases, alt text becomes the primary interface to the images in the page.

In our conversation, Claude identified these patterns for processing images. Each one aligning with accessibility requirements. I’ve added comments to Claude’s responses in italics for additional clarification and context.

What helps Claude process images

These aspects help Claude process documents with images and help people using screen readers.

Screenshots with readable text

UI elements, error messages, code snippets
Text must be clear enough to read/OCR

Diagrams with text labels

Architecture diagrams, flowcharts, system maps
Labels identify components and show relationships
Arrows and connections have descriptive text

Charts and graphs with data labels

Axis labels, data points, legends
Text alternatives provide the data the chart represents

Annotated images

Arrows, callouts, highlights pointing to specific elements
Text explains what’s being indicated

What creates barriers

These situations complicate Claude’s image processing and make it hard for people using screen readers.

Low resolution or heavy compression

Text becomes unreadable for OCR
Details blur, making description difficult

Color-coded information without labels

“The red line shows X” with no legend
Relies on color perception alone

Images without context

Screenshot with no surrounding explanation
Diagram with unlabeled components
Reference to “upper right” without clear landmark

Text embedded as images

Code as screenshots instead of actual text
Can’t be copied, searched, or reliably read

These barriers affect vision-impaired users and AI tools equally. The solutions are the same for both audiences: provide text alternatives, use clear labels, and add context.

Writing effective alt text

You should write alt text as though that’s the only way your reader will learn the information from the image. One way to do this is to describe the information in the document content that precedes the text, using the image as reinforcement. Another is to use the image’s alt text tag.

WCAG provides clear guidance on alt text

The WCAG guidelines for effective alt text are summarized here.

Describe content and function

Not: “Image of a button”
Better: “Save button with disk icon”

Be concise but complete

Not: “Screenshot”
Better: “Error message: ‘Connection timeout after 30 seconds. Check network settings.'”

Provide context

Not: “Diagram”
Better: “System architecture diagram showing three-tier application: web server connects to application server, which connects to database”

For complex images, use longer descriptions [in the text]

Alt text gives overview, longer description (in caption or linked document) provides details

Don’t be redundant

If surrounding text says, “The following diagram shows the authentication flow,” alt text doesn’t need to repeat “diagram”

How AI tools process video

It doesn’t (without your help).

AI tools can’t process video directly and so they have no access to any content: no frames, no audio, no visual information from video files.

This is identical to how blind and vision-impaired users experience video: they rely entirely on audio descriptions and transcripts.

Claude identified what makes video content accessible. As with images, each requirement aligns with accessibility standards:

What makes video accessible to all audiences

Claude described the following aspects of video and audio transcripts that make them easier to process. Visually impaired audiences also benefit from these recommendations.

Human-edited transcripts

Proper punctuation and paragraph breaks
Logical topic sections
Technical terms spelled correctly (not auto-transcript errors)

Speaker identification

“Sarah: [text]” vs unmarked dialogue
Critical for interviews and panel discussions

Timestamps with section headers

“00:45 – Introduction to the problem”
“03:20 – Proposed solution”
Allows navigation to specific information

Visual content described explicitly

Not: “As you can see here…”
Better: “The diagram shows three components: authentication layer, API gateway, and database. The auth layer connects to the gateway via OAuth…”

This is valuable information if you’re writing video scripts. Describe things as though the viewer can’t see the images in the video.

Non-verbal content noted

“[demonstrates on screen]”
“[shows code example]”
“[points to diagram]”

Key resources linked

Code repositories mentioned
Slides or presentation materials
Related documentation

What creates barriers:

Claude described these aspects that make transcripts more confusing and harder to process.

Wall of text with no structure

No paragraph breaks
All one run-on without topic divisions

Missing visual context

“This part right here does X” (which part?)
“Look at how this changes” (describes screen without narrating)
References to “above” or “below” without description

Auto-generated transcripts without editing

Technical terms mangled (“Kubernetes” becomes “cooper nettys”)
No punctuation or paragraph structure
Speaker changes unmarked

Ambiguous references

“It connects to this, then it processes that”
Worse than in written text because viewers can’t see referents

Why video transcripts matter

Transcripts serve multiple audiences beyond accessibility compliance:

Deaf and hard-of-hearing users – Can’t access audio information
People in sound-sensitive environments – Libraries, open offices, public transit
Non-native speakers – Reading is often easier than listening
People who prefer reading – Faster to scan text than watch video
Search engines – Can’t index video content without transcripts
AI tools – Can’t process video without text alternatives

A quality transcript makes video content accessible to all these audiences simultaneously.

The AI perspective confirms established practice

When I asked Claude what it needs from images and video, every requirement it identified matches WCAG accessibility standards. For example:

Alt text for images (WCAG 1.1.1)
Captions for video (WCAG 1.2.2)
Audio descriptions or transcripts (WCAG 1.2.3, 1.2.8)
Text alternatives for complex images (WCAG 1.1.1)

AI tools don’t introduce new requirements for media. They validate requirements that accessibility advocates have documented and refined over the past thirty years. The surprise isn’t that AI needs text alternatives—it’s that we’ve had clear standards for providing them since 1999, yet many organizations still treat accessibility as optional.

Bringing it together: Media for all readers

The previous articles in this series showed that AI tools need the same structural principles, writing quality, and pedagogical approaches that serve human readers. Media follows the same pattern.

The fundamentals persist. Whether you’re serving a screen reader user, a deaf student, someone in a quiet library, or an AI tool processing your documentation, the same principles apply: provide text alternatives, describe visual content explicitly, structure information clearly.

Meeting accessibility standards serves all your audiences—human and AI alike.