feat: add Text to Speech endpoints with timestamp alignment and documentation #36

twangodev · 2026-01-06T21:23:41Z

Summary by CodeRabbit

Release Notes

New Features
- Added Text-to-Speech endpoints with word-level timestamp alignment: standard (POST /v1/tts/with_timestamp) and streaming (POST /v1/tts/with_timestamp/stream) variants.
- API responses now include precise timing data for each word segment, enabling subtitle generation, karaoke, and accessibility features.
Documentation
- Added comprehensive endpoint documentation for new timestamp-aligned TTS capabilities.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…meters

…entation

coderabbitai · 2026-01-06T21:23:52Z

📝 Walkthrough

Walkthrough

Introduces a new TTS endpoint with word-level timestamp alignment capabilities. Adds POST /v1/tts/with_timestamp and its streaming variant to the OpenAPI specification with new schemas, updated parameter descriptions, and corresponding documentation and navigation entries.

Changes

Cohort / File(s)	Summary
Documentation `api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx`	New MDX documentation file for TTS with Timestamp endpoint; describes POST /v1/tts/with_timestamp functionality, response schema (audio_base64, text, alignment segments), and use cases (subtitles, karaoke, accessibility, audio editing)
OpenAPI Schema `api-reference/openapi.json`	New endpoints: POST /v1/tts/with_timestamp and POST /v1/tts/with_timestamp/stream with full operation definitions; new schemas: TimestampSegment (text, start, end) and TTSWithTimestampResponse (audio_base64, text, alignment); updated existing schemas (ProsodyControl, ReferenceAudio) with expanded descriptions; revised default values for temperature, top_p, chunk_length; added new parameters (max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, early_stop_threshold) across TTS endpoints with detailed descriptions
Navigation `docs.json`	Added two new navigation entries under TTS & ASR (v1) group: tts-with-timestamp and tts-with-timestamp-stream

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: improve docs for openapi.json, expose new params for tts #28: Shares modifications to OpenAPI TTS schema and exposes identical new TTS tuning parameters (max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks).

Poem

🐰 Timestamps now align with words so fine,
Each syllable marked with start and end time,
Karaoke dancers, subtitles bright—
Our fuzzy friends made timing right! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add Text to Speech endpoints with timestamp alignment and documentation' accurately describes the main changes: two new TTS endpoints (with_timestamp and with_timestamp/stream) are being added with corresponding OpenAPI documentation and MDX files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx (1)
44-49: Consider adding a cross-reference to the streaming endpoint.

The use cases are well-documented. For completeness, consider adding a brief note or link to the streaming variant (/v1/tts/with_timestamp/stream) for users who need real-time progressive alignment. As per coding guidelines, use a relative path for the internal link.
Suggested addition
 ## Use Cases

 - **Subtitle generation**: Automatically create synchronized subtitles for video content
 - **Karaoke-style highlighting**: Highlight words as they are spoken
 - **Accessibility features**: Provide visual indicators synchronized with audio playback
 - **Audio editing**: Precisely locate and edit specific words in generated speech
+
+For real-time streaming with progressive alignment, see [Streaming Text to Speech with Timestamp](./tts-with-timestamp-stream).
api-reference/openapi.json (2)
755-866: Consider documenting the audio format for audio_base64.

The /v1/tts/with_timestamp endpoint returns audio_base64 but doesn't specify the audio format (WAV, MP3, etc.) in the request parameters or response schema. Unlike the main /v1/tts endpoint, there's no format parameter. Consider adding a note in the description or schema to clarify what format the base64-encoded audio uses by default.
Suggested description enhancement
 "/v1/tts/with_timestamp": {
   "post": {
     "summary": "Text to Speech with Timestamp Alignment",
-    "description": "Generates speech from text and returns word-level timestamp alignment. The complete audio is generated first, then aligned with the text to provide precise timing information for each segment.",
+    "description": "Generates speech from text and returns word-level timestamp alignment. The complete audio is generated first, then aligned with the text to provide precise timing information for each segment. Audio is returned as base64-encoded WAV format.",
359-392: Consider aligning descriptions between JSON and MessagePack schemas.

The application/msgpack schema descriptions for the new fields (max_new_tokens, repetition_penalty, min_chunk_length, etc.) are shorter and less detailed than their application/json counterparts. For consistency and to avoid developer confusion, consider using the same detailed descriptions across both content types.

For example, max_new_tokens:

JSON: "Maximum number of audio tokens to generate per text chunk. Higher values allow generating longer audio segments per chunk..."

msgpack: "Maximum number of tokens to generate per chunk..."

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between decef17 and e04eb6b.

📒 Files selected for processing (3)

api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx
api-reference/openapi.json
docs.json

🧰 Additional context used

📓 Path-based instructions (1)

**/*.mdx

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.mdx: Document just enough for user success - not too much, not too little
Prioritize accuracy and usability of information
Make content evergreen when possible
Search for existing information before adding new content. Avoid duplication unless it is done for a strategic reason
Check existing patterns for consistency
Include title in YAML frontmatter: Clear, descriptive page title
Include description in YAML frontmatter: Concise summary for SEO/navigation
Use second-person voice ("you") in documentation content
Include prerequisites at the start of procedural content
Test all code examples before publishing
Match style and formatting of existing pages
Include both basic and advanced use cases in documentation
Include language tags on all code blocks
Include alt text on all images
Use relative paths for internal links
Never use emojis in professional content
Do not skip frontmatter on any MDX file
Do not use absolute URLs for internal links
Do not include untested code examples

Files:

api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Agent

🔇 Additional comments (5)

docs.json (1)

170-171: LGTM!

The navigation entries for the new timestamp endpoints are correctly placed within the TTS & ASR (v1) group and follow the existing path naming conventions.

api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx (1)

1-7: LGTM!

The frontmatter correctly includes the required title and description fields, and the openapi reference links to the correct endpoint path. The icon choice is appropriate for timestamp functionality.

api-reference/openapi.json (3)

180-213: LGTM! Well-documented new parameters.

The new TTS parameters (max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, early_stop_threshold) are well-documented with sensible defaults and appropriate constraints. The descriptions clearly explain the purpose and trade-offs of each parameter.

987-1104: LGTM! Streaming endpoint is well-defined.

The streaming variant correctly:

Limits the model parameter to s1 only (the only model supporting streaming alignment)

Clearly describes the SSE format in the description

Uses text/event-stream content type

Documents the JSON structure within each SSE event

3410-3478: LGTM! Schemas are well-structured.

The TimestampSegment and TTSWithTimestampResponse schemas are correctly defined:

TimestampSegment requires all fields (text, start, end) as expected for alignment data

TTSWithTimestampResponse correctly marks only audio_base64 as required, allowing text and alignment to be nullable for streaming use cases

Descriptions are clear and consistent with the endpoint documentation

Copilot

Pull request overview

This pull request adds two new Text-to-Speech endpoints with timestamp alignment capabilities and comprehensively improves documentation across existing TTS API parameters. The changes enable developers to generate speech with word-level timing information, useful for subtitles, karaoke-style highlighting, and accessibility features.

Adds two new TTS endpoints with timestamp alignment (non-streaming and streaming variants)
Updates documentation for existing TTS parameters with more detailed, user-friendly descriptions
Changes default values for temperature, top_p, chunk_length, and opus_bitrate parameters
Adds new parameters: max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, and early_stop_threshold

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
docs.json	Adds navigation entries for the two new timestamp-enabled TTS endpoints
api-reference/openapi.json	Defines two new TTS endpoints with timestamp alignment, updates parameter descriptions and defaults for existing endpoints, and adds new schema definitions (TimestampSegment, TTSWithTimestampResponse)
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx	Provides documentation for the non-streaming timestamp endpoint with examples and use cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T21:27:34Z

docs.json

                "pages": [
                  "api-reference/endpoint/openapi-v1/text-to-speech",
+                  "api-reference/endpoint/openapi-v1/tts-with-timestamp",
+                  "api-reference/endpoint/openapi-v1/tts-with-timestamp-stream",


The docs.json references a documentation file 'api-reference/endpoint/openapi-v1/tts-with-timestamp-stream' that is missing from this PR. While the OpenAPI specification defines the streaming endpoint at line 987-1224, the corresponding MDX documentation file needs to be created to match the pattern used for other endpoints.

Suggested change

"api-reference/endpoint/openapi-v1/tts-with-timestamp-stream",

Copilot · 2026-01-06T21:27:35Z

api-reference/openapi.json

                  "chunk_length": {
-                    "default": 200,
+                    "default": 300,
                    "description": "Chunk length to be used for the speech",


The description for 'chunk_length' in the streaming endpoint is not updated with the detailed explanation. For consistency with the main TTS endpoint, consider updating this to the more comprehensive description: "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries."

Suggested change

"description": "Chunk length to be used for the speech",

"description": "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries.",

Copilot · 2026-01-06T21:27:35Z

api-reference/openapi.json

+                  },
+                  "min_chunk_length": {
+                    "default": 50,
+                    "description": "Minimum chunk length for text splitting",


The description for 'min_chunk_length' in the streaming endpoint is too brief and lacks the detail provided in the main TTS endpoint. For consistency and better developer understanding, consider updating this to match the more comprehensive description from line 194: "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length."

Suggested change

"description": "Minimum chunk length for text splitting",

"description": "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length.",

Copilot · 2026-01-06T21:27:35Z

api-reference/openapi.json

+                  },
+                  "condition_on_previous_chunks": {
+                    "default": true,
+                    "description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference",


The description for 'condition_on_previous_chunks' in the streaming endpoint is inconsistent with the main TTS endpoint and potentially confusing. The main endpoint (line 202) provides a clearer explanation: "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments." Consider updating for consistency and clarity.

Suggested change

"description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference",

"description": "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments.",

Copilot · 2026-01-06T21:27:36Z

api-reference/openapi.json

+                  },
+                  "early_stop_threshold": {
+                    "default": 1.0,
+                    "description": "Early stop if this fraction of samples are finished",


The description for 'early_stop_threshold' in the streaming endpoint is too terse. For consistency with the main TTS endpoint (line 208), consider updating to: "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0."

Suggested change

"description": "Early stop if this fraction of samples are finished",

"description": "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0.",

twangodev added 3 commits December 4, 2025 22:28

fix: improve docs for openapi.json, expose new params for tts

fec9901

fix: update descriptions and defaults in openapi.json for speech para…

4f35a1d

…meters

feat: add Text to Speech endpoints with timestamp alignment and docum…

e04eb6b

…entation

Copilot AI review requested due to automatic review settings January 6, 2026 21:23

Copilot started reviewing on behalf of twangodev January 6, 2026 21:24 View session

mintlify bot deployed to staging January 6, 2026 21:24 View deployment

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

Copilot AI reviewed Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Text to Speech endpoints with timestamp alignment and documentation #36

feat: add Text to Speech endpoints with timestamp alignment and documentation #36

Uh oh!

twangodev commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"description": "Chunk length to be used for the speech",
	"description": "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries.",

	"description": "Minimum chunk length for text splitting",
	"description": "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length.",

	"description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference",
	"description": "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments.",

	"description": "Early stop if this fraction of samples are finished",
	"description": "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0.",

feat: add Text to Speech endpoints with timestamp alignment and documentation #36

Are you sure you want to change the base?

feat: add Text to Speech endpoints with timestamp alignment and documentation #36

Uh oh!

Conversation

twangodev commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

twangodev commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading