-
Notifications
You must be signed in to change notification settings - Fork 7
feat: add Text to Speech endpoints with timestamp alignment and documentation #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughIntroduces a new TTS endpoint with word-level timestamp alignment capabilities. Adds POST /v1/tts/with_timestamp and its streaming variant to the OpenAPI specification with new schemas, updated parameter descriptions, and corresponding documentation and navigation entries. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx (1)
44-49: Consider adding a cross-reference to the streaming endpoint.The use cases are well-documented. For completeness, consider adding a brief note or link to the streaming variant (
/v1/tts/with_timestamp/stream) for users who need real-time progressive alignment. As per coding guidelines, use a relative path for the internal link.Suggested addition
## Use Cases - **Subtitle generation**: Automatically create synchronized subtitles for video content - **Karaoke-style highlighting**: Highlight words as they are spoken - **Accessibility features**: Provide visual indicators synchronized with audio playback - **Audio editing**: Precisely locate and edit specific words in generated speech + +For real-time streaming with progressive alignment, see [Streaming Text to Speech with Timestamp](./tts-with-timestamp-stream).api-reference/openapi.json (2)
755-866: Consider documenting the audio format foraudio_base64.The
/v1/tts/with_timestampendpoint returnsaudio_base64but doesn't specify the audio format (WAV, MP3, etc.) in the request parameters or response schema. Unlike the main/v1/ttsendpoint, there's noformatparameter. Consider adding a note in the description or schema to clarify what format the base64-encoded audio uses by default.Suggested description enhancement
"/v1/tts/with_timestamp": { "post": { "summary": "Text to Speech with Timestamp Alignment", - "description": "Generates speech from text and returns word-level timestamp alignment. The complete audio is generated first, then aligned with the text to provide precise timing information for each segment.", + "description": "Generates speech from text and returns word-level timestamp alignment. The complete audio is generated first, then aligned with the text to provide precise timing information for each segment. Audio is returned as base64-encoded WAV format.",
359-392: Consider aligning descriptions between JSON and MessagePack schemas.The
application/msgpackschema descriptions for the new fields (max_new_tokens,repetition_penalty,min_chunk_length, etc.) are shorter and less detailed than theirapplication/jsoncounterparts. For consistency and to avoid developer confusion, consider using the same detailed descriptions across both content types.For example,
max_new_tokens:
- JSON: "Maximum number of audio tokens to generate per text chunk. Higher values allow generating longer audio segments per chunk..."
- msgpack: "Maximum number of tokens to generate per chunk..."
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdxapi-reference/openapi.jsondocs.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.mdx
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.mdx: Document just enough for user success - not too much, not too little
Prioritize accuracy and usability of information
Make content evergreen when possible
Search for existing information before adding new content. Avoid duplication unless it is done for a strategic reason
Check existing patterns for consistency
Include title in YAML frontmatter: Clear, descriptive page title
Include description in YAML frontmatter: Concise summary for SEO/navigation
Use second-person voice ("you") in documentation content
Include prerequisites at the start of procedural content
Test all code examples before publishing
Match style and formatting of existing pages
Include both basic and advanced use cases in documentation
Include language tags on all code blocks
Include alt text on all images
Use relative paths for internal links
Never use emojis in professional content
Do not skip frontmatter on any MDX file
Do not use absolute URLs for internal links
Do not include untested code examples
Files:
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Agent
🔇 Additional comments (5)
docs.json (1)
170-171: LGTM!The navigation entries for the new timestamp endpoints are correctly placed within the TTS & ASR (v1) group and follow the existing path naming conventions.
api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx (1)
1-7: LGTM!The frontmatter correctly includes the required
titleanddescriptionfields, and theopenapireference links to the correct endpoint path. The icon choice is appropriate for timestamp functionality.api-reference/openapi.json (3)
180-213: LGTM! Well-documented new parameters.The new TTS parameters (
max_new_tokens,repetition_penalty,min_chunk_length,condition_on_previous_chunks,early_stop_threshold) are well-documented with sensible defaults and appropriate constraints. The descriptions clearly explain the purpose and trade-offs of each parameter.
987-1104: LGTM! Streaming endpoint is well-defined.The streaming variant correctly:
- Limits the
modelparameter tos1only (the only model supporting streaming alignment)- Clearly describes the SSE format in the description
- Uses
text/event-streamcontent type- Documents the JSON structure within each SSE event
3410-3478: LGTM! Schemas are well-structured.The
TimestampSegmentandTTSWithTimestampResponseschemas are correctly defined:
TimestampSegmentrequires all fields (text,start,end) as expected for alignment dataTTSWithTimestampResponsecorrectly marks onlyaudio_base64as required, allowingtextandalignmentto be nullable for streaming use cases- Descriptions are clear and consistent with the endpoint documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds two new Text-to-Speech endpoints with timestamp alignment capabilities and comprehensively improves documentation across existing TTS API parameters. The changes enable developers to generate speech with word-level timing information, useful for subtitles, karaoke-style highlighting, and accessibility features.
- Adds two new TTS endpoints with timestamp alignment (non-streaming and streaming variants)
- Updates documentation for existing TTS parameters with more detailed, user-friendly descriptions
- Changes default values for temperature, top_p, chunk_length, and opus_bitrate parameters
- Adds new parameters: max_new_tokens, repetition_penalty, min_chunk_length, condition_on_previous_chunks, and early_stop_threshold
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs.json | Adds navigation entries for the two new timestamp-enabled TTS endpoints |
| api-reference/openapi.json | Defines two new TTS endpoints with timestamp alignment, updates parameter descriptions and defaults for existing endpoints, and adds new schema definitions (TimestampSegment, TTSWithTimestampResponse) |
| api-reference/endpoint/openapi-v1/tts-with-timestamp.mdx | Provides documentation for the non-streaming timestamp endpoint with examples and use cases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "pages": [ | ||
| "api-reference/endpoint/openapi-v1/text-to-speech", | ||
| "api-reference/endpoint/openapi-v1/tts-with-timestamp", | ||
| "api-reference/endpoint/openapi-v1/tts-with-timestamp-stream", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs.json references a documentation file 'api-reference/endpoint/openapi-v1/tts-with-timestamp-stream' that is missing from this PR. While the OpenAPI specification defines the streaming endpoint at line 987-1224, the corresponding MDX documentation file needs to be created to match the pattern used for other endpoints.
| "api-reference/endpoint/openapi-v1/tts-with-timestamp-stream", |
| "chunk_length": { | ||
| "default": 200, | ||
| "default": 300, | ||
| "description": "Chunk length to be used for the speech", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for 'chunk_length' in the streaming endpoint is not updated with the detailed explanation. For consistency with the main TTS endpoint, consider updating this to the more comprehensive description: "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries."
| "description": "Chunk length to be used for the speech", | |
| "description": "Controls the size of text segments for processing. The text is split into chunks before synthesis. Higher values produce longer continuous speech segments but increase memory usage and latency. Lower values reduce latency but may affect naturalness at chunk boundaries.", |
| }, | ||
| "min_chunk_length": { | ||
| "default": 50, | ||
| "description": "Minimum chunk length for text splitting", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for 'min_chunk_length' in the streaming endpoint is too brief and lacks the detail provided in the main TTS endpoint. For consistency and better developer understanding, consider updating this to match the more comprehensive description from line 194: "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length."
| "description": "Minimum chunk length for text splitting", | |
| "description": "Minimum number of characters required before splitting text into a new chunk. Prevents creation of very short audio segments that might sound unnatural. Higher values ensure longer continuous segments but may increase latency for the first audio. Works in conjunction with chunk_length.", |
| }, | ||
| "condition_on_previous_chunks": { | ||
| "default": true, | ||
| "description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for 'condition_on_previous_chunks' in the streaming endpoint is inconsistent with the main TTS endpoint and potentially confusing. The main endpoint (line 202) provides a clearer explanation: "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments." Consider updating for consistency and clarity.
| "description": "Whether to condition generation on previous chunks. If references are provided, uses references only; otherwise uses first chunk as reference", | |
| "description": "Enables voice consistency across multiple text chunks. When true, uses the audio from previous chunks as context for generating subsequent chunks, maintaining consistent voice characteristics throughout long texts. When false, each chunk is generated independently. Disable only if you want deliberately varied intonation between segments.", |
| }, | ||
| "early_stop_threshold": { | ||
| "default": 1.0, | ||
| "description": "Early stop if this fraction of samples are finished", |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for 'early_stop_threshold' in the streaming endpoint is too terse. For consistency with the main TTS endpoint (line 208), consider updating to: "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0."
| "description": "Early stop if this fraction of samples are finished", | |
| "description": "Controls when to stop audio generation early during batch processing. Value of 1.0 means wait for all samples to complete (best quality). Lower values (e.g., 0.8) stop when 80% of samples are done, potentially reducing latency but may cut off audio prematurely. Most users should leave this at 1.0.", |
Summary by CodeRabbit
Release Notes
New Features
POST /v1/tts/with_timestamp) and streaming (POST /v1/tts/with_timestamp/stream) variants.Documentation
✏️ Tip: You can customize this high-level summary in your review settings.