The Audio Layer

Type: ReferenceCreated: Team: Platform

draft

Why narration is a separate file

Text-to-speech reading raw markdown is unlistenable — it speaks the punctuation, the code, the syntax, and the heading hashes. Every doc that wants audio gets a hand-written .audio.txt companion. The narration adapts the doc for the ear: paraphrasing diagrams, smoothing acronyms, structuring breath. No file → no audio reader on the page.

File shape

Path mirrors the markdown URL. docs/foo/bar.md → docusaurus/static/audio/foo/bar.audio.txt.
Open with one or two sentences before the first ## . This becomes a synthetic "Introduction" section that the reader speaks under the page title — without it, the listener jumps straight to the first H2 and never hears the doc title. Don't restate the title (the reader speaks it for you at higher pitch); use the intro to add real context — what the doc covers, who it's for.
## Title → H2 section. ### Title → H3 section. Don't go deeper.
Section titles must mirror the markdown headings exactly — that's how the per-heading play triggers map to sections.
Don't echo a heading in its first sentence. The reader speaks each heading for you at higher pitch.
Skip frontmatter, page-level metadata, and visual-only callouts.

Sentence rules

A period (or !?) ends an utterance. The TTS engine plays each utterance independently — that gives prev/next finer granularity and dodges Chrome's silent-cutoff bug on long sentences.
15–25 words per sentence. Break run-ons.
Convert mid-sentence colons and semicolons to periods: "Two things: A and B" → "Two things. First, A. Second, B."
Em-dashes stay inside an utterance.
Aim for 30 seconds to 2 minutes per section. The title sentence is the natural breath — don't try to add pauses with ... (SpeechSynthesis ignores them).

Speaking the unspeakable

Source	Write as
Times — `09:00`, `5:30 P M`	`9 A M`, `5 30 P M`
Numbered steps — `Step 1`	`Step one`
Acronyms not pronounced as words	Spell with spaces: `U R L`, `A P I`, `I D`
Code identifiers in prose — `MANUAL_ENTRY_REQUEST`	Natural names: `Manual Entry Request`. Keep literal flag names when discussing the flag itself: `isCompleted flips true`.
Latin abbreviations	`e.g.` → `for example`, `i.e.` → `that is`, `etc.` → `and so on`
Markdown syntax — `bold`, `_italic_`, backticks, link URLs	Drop entirely (keep link text)

Embedded content

Mermaid diagrams — describe in third person: open with the diagram type, then walk its structure. "Here is a state machine with three states. The flow begins at…"
Code blocks — don't read line-by-line. One or two sentences on what the snippet does.
Numbered lists — "Step one. Step two." or "First. Second. Third."
Bulleted lists — preface with the count if it's known: "Three things to know. First, …". Otherwise just sentences, blank-line separated.
Small tables — paraphrase the contrast in prose. Larger tables — summarise and point to the page: "See the table on the page for the full breakdown."

Cross-references

Don't skip "See also" — give each linked item one short sentence so the listener knows what's there without navigating. In body text, use "See the X chapter" / "covered in the Y section" — drop URLs.

Pronunciations

Some product, team, or feature names get mispronounced. Register phonetic spellings here so authors and the AI generator stay consistent.

Term	Spell as	Notes
Skapp	(TBD — confirm with team)	Product name

When adding an entry: pick a phonetic spelling, run a 5-second test on the on-page reader to confirm it sounds right, then commit.

Validation

The audio file's sections must line up exactly with the markdown's headings. Every ## or ### in the markdown should have a matching one in the audio file, and vice versa. CI fails the PR when they don't match.

Two ways things break:

A heading exists in the markdown but not the audio file. Clicking the heading's play button would do nothing.
A marker exists in the audio file but not the markdown. The reader would announce a section that's gone from the page.

The check is forgiving on capitals and punctuation — ## Foo: Bar and ## foo bar count as the same.

Why narration is a separate file​

File shape​

Sentence rules​

Speaking the unspeakable​

Embedded content​

Cross-references​

Pronunciations​

Validation​