Skip to content

Conversation

@twangodev
Copy link
Collaborator

@twangodev twangodev commented Jan 6, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Text-to-Speech endpoints with word-level timestamp alignment: standard (POST /v1/tts/with_timestamp) and streaming (POST /v1/tts/with_timestamp/stream) variants.
    • API responses now include precise timing data for each word segment, enabling subtitle generation, karaoke, and accessibility features.
  • Documentation

    • Added comprehensive endpoint documentation for new timestamp-aligned TTS capabilities.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings January 6, 2026 21:23
@coderabbitai
Copy link

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

Introduces a new TTS endpoint with word-level timestamp alignment capabilities. Adds POST /v1/tts/with_timestamp and its streaming variant to the OpenAPI specification with new schemas, updated parameter descriptions, and corresponding documentation and navigation entries.

Changes

Cohort / File(s) Summary
Documentation
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx
New MDX documentation file for TTS with Timestamp endpoint; describes POST /v1/tts/with_timestamp functionality, response schema (audio_base64, text, alignment segments), and use cases (subtitles, karaoke, accessibility, audio editing)
OpenAPI Schema
api-reference/openapi.json
New endpoints: POST /v1/tts/with_timestamp and POST /v1/tts/with_timestamp/stream with full operation definitions; new schemas: TimestampSegment (text, start, end) and TTSWithTimestampResponse (audio_base64, text, alignment); updated existing schemas (ProsodyControl, ReferenceAudio) with expanded descriptions; revised default values for temperature, top_p, chunk_length; added new parameters (max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, early_stop_threshold) across TTS endpoints with detailed descriptions
Navigation
docs.json
Added two new navigation entries under TTS & ASR (v1) group: tts-with-timestamp and tts-with-timestamp-stream

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Timestamps now align with words so fine,
Each syllable marked with start and end time,
Karaoke dancers, subtitles bright—
Our fuzzy friends made timing right!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add Text to Speech endpoints with timestamp alignment and documentation' accurately describes the main changes: two new TTS endpoints (with_timestamp and with_timestamp/stream) are being added with corresponding OpenAPI documentation and MDX files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx (1)

44-49: Consider adding a cross-reference to the streaming endpoint.

The use cases are well-documented. For completeness, consider adding a brief note or link to the streaming variant (/v1/tts/with_timestamp/stream) for users who need real-time progressive alignment. As per coding guidelines, use a relative path for the internal link.

Suggested addition
 ## Use Cases

 - **Subtitle generation**: Automatically create synchronized subtitles for video content
 - **Karaoke-style highlighting**: Highlight words as they are spoken
 - **Accessibility features**: Provide visual indicators synchronized with audio playback
 - **Audio editing**: Precisely locate and edit specific words in generated speech
+
+For real-time streaming with progressive alignment, see [Streaming Text to Speech with Timestamp](./tts-with-timestamp-stream).
api-reference/openapi.json (2)

755-866: Consider documenting the audio format for audio_base64.

The /v1/tts/with_timestamp endpoint returns audio_base64 but doesn't specify the audio format (WAV, MP3, etc.) in the request parameters or response schema. Unlike the main /v1/tts endpoint, there's no format parameter. Consider adding a note in the description or schema to clarify what format the base64-encoded audio uses by default.

Suggested description enhancement
 "/v1/tts/with_timestamp": {
   "post": {
     "summary": "Text to Speech with Timestamp Alignment",
-    "description": "Generates speech from text and returns word-level timestamp alignment. The complete audio is generated first, then aligned with the text to provide precise timing information for each segment.",
+    "description": "Generates speech from text and returns word-level timestamp alignment. The complete audio is generated first, then aligned with the text to provide precise timing information for each segment. Audio is returned as base64-encoded WAV format.",

359-392: Consider aligning descriptions between JSON and MessagePack schemas.

The application/msgpack schema descriptions for the new fields (max_new_tokens, repetition_penalty, min_chunk_length, etc.) are shorter and less detailed than their application/json counterparts. For consistency and to avoid developer confusion, consider using the same detailed descriptions across both content types.

For example, max_new_tokens:

  • JSON: "Maximum number of audio tokens to generate per text chunk. Higher values allow generating longer audio segments per chunk..."
  • msgpack: "Maximum number of tokens to generate per chunk..."
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between decef17 and e04eb6b.

📒 Files selected for processing (3)
  • api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx
  • api-reference/openapi.json
  • docs.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.mdx

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.mdx: Document just enough for user success - not too much, not too little
Prioritize accuracy and usability of information
Make content evergreen when possible
Search for existing information before adding new content. Avoid duplication unless it is done for a strategic reason
Check existing patterns for consistency
Include title in YAML frontmatter: Clear, descriptive page title
Include description in YAML frontmatter: Concise summary for SEO/navigation
Use second-person voice ("you") in documentation content
Include prerequisites at the start of procedural content
Test all code examples before publishing
Match style and formatting of existing pages
Include both basic and advanced use cases in documentation
Include language tags on all code blocks
Include alt text on all images
Use relative paths for internal links
Never use emojis in professional content
Do not skip frontmatter on any MDX file
Do not use absolute URLs for internal links
Do not include untested code examples

Files:

  • api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Agent
🔇 Additional comments (5)
docs.json (1)

170-171: LGTM!

The navigation entries for the new timestamp endpoints are correctly placed within the TTS & ASR (v1) group and follow the existing path naming conventions.

api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx (1)

1-7: LGTM!

The frontmatter correctly includes the required title and description fields, and the openapi reference links to the correct endpoint path. The icon choice is appropriate for timestamp functionality.

api-reference/openapi.json (3)

180-213: LGTM! Well-documented new parameters.

The new TTS parameters (max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, early_stop_threshold) are well-documented with sensible defaults and appropriate constraints. The descriptions clearly explain the purpose and trade-offs of each parameter.


987-1104: LGTM! Streaming endpoint is well-defined.

The streaming variant correctly:

  • Limits the model parameter to s1 only (the only model supporting streaming alignment)
  • Clearly describes the SSE format in the description
  • Uses text/event-stream content type
  • Documents the JSON structure within each SSE event

3410-3478: LGTM! Schemas are well-structured.

The TimestampSegment and TTSWithTimestampResponse schemas are correctly defined:

  • TimestampSegment requires all fields (text, start, end) as expected for alignment data
  • TTSWithTimestampResponse correctly marks only audio_base64 as required, allowing text and alignment to be nullable for streaming use cases
  • Descriptions are clear and consistent with the endpoint documentation

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds two new Text-to-Speech endpoints with timestamp alignment capabilities and comprehensively improves documentation across existing TTS API parameters. The changes enable developers to generate speech with word-level timing information, useful for subtitles, karaoke-style highlighting, and accessibility features.

  • Adds two new TTS endpoints with timestamp alignment (non-streaming and streaming variants)
  • Updates documentation for existing TTS parameters with more detailed, user-friendly descriptions
  • Changes default values for temperature, top_p, chunk_length, and opus_bitrate parameters
  • Adds new parameters: max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, and early_stop_threshold

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
docs.json Adds navigation entries for the two new timestamp-enabled TTS endpoints
api-reference/openapi.json Defines two new TTS endpoints with timestamp alignment, updates parameter descriptions and defaults for existing endpoints, and adds new schema definitions (TimestampSegment, TTSWithTimestampResponse)
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx Provides documentation for the non-streaming timestamp endpoint with examples and use cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"pages": [
"api-reference/endpoint/openapi-v1/text-to-speech",
"api-reference/endpoint/openapi-v1/tts-with-timestamp",
"api-reference/endpoint/openapi-v1/tts-with-timestamp-stream",
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs.json references a documentation file 'api-reference/endpoint/openapi-v1/tts-with-timestamp-stream' that is missing from this PR. While the OpenAPI specification defines the streaming endpoint at line 987-1224, the corresponding MDX documentation file needs to be created to match the pattern used for other endpoints.

Suggested change
"api-reference/endpoint/openapi-v1/tts-with-timestamp-stream",

Copilot uses AI. Check for mistakes.
"chunk_length": {
"default": 200,
"default": 300,
"description": "Chunk length to be used for the speech",
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for 'chunk_length' in the streaming endpoint is not updated with the detailed explanation. For consistency with the main TTS endpoint, consider updating this to the more comprehensive description: "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries."

Suggested change
"description": "Chunk length to be used for the speech",
"description": "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries.",

Copilot uses AI. Check for mistakes.
},
"min_chunk_length": {
"default": 50,
"description": "Minimum chunk length for text splitting",
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for 'min_chunk_length' in the streaming endpoint is too brief and lacks the detail provided in the main TTS endpoint. For consistency and better developer understanding, consider updating this to match the more comprehensive description from line 194: "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length."

Suggested change
"description": "Minimum chunk length for text splitting",
"description": "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length.",

Copilot uses AI. Check for mistakes.
},
"condition_on_previous_chunks": {
"default": true,
"description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference",
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for 'condition_on_previous_chunks' in the streaming endpoint is inconsistent with the main TTS endpoint and potentially confusing. The main endpoint (line 202) provides a clearer explanation: "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments." Consider updating for consistency and clarity.

Suggested change
"description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference",
"description": "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments.",

Copilot uses AI. Check for mistakes.
},
"early_stop_threshold": {
"default": 1.0,
"description": "Early stop if this fraction of samples are finished",
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for 'early_stop_threshold' in the streaming endpoint is too terse. For consistency with the main TTS endpoint (line 208), consider updating to: "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0."

Suggested change
"description": "Early stop if this fraction of samples are finished",
"description": "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0.",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants