Last updated on November 25, 2025
Amazon Polly Cheat Sheet
- A text-to-speech (TTS) service
- Uses advanced deep learning technologies to convert text into natural, lifelike speech
- It supports saving text into MP3, OGG, and PCM file formats.
- Offers Standard and Neural TTS (NTTS)
Common Use Cases
- Increase customer engagement
- Language learning applications
- Helps visually impaired individuals to consume digital content
- Testing in-game dialogs
- Voice response
Concepts
- Speech Synthesis Markup Language (SSML)
- Uses XML-based tags to modify different aspects of the text-to-speech output.
- Can control pitch, speaking style, speech rate, and volume.
- Standard TTS
- Concatenates short speech snippets together.
- Limited in terms of producing different speaking styles.
- Neural TTS
- Produces higher quality speech output than Standard TTS.
- Neural TTS supports two speaking styles:
- Conversational
- Newscaster
- Speech Mark
- Refers to the metadata that describes the synthesized speech
- Speech Mark has four types:
- Sentence
- Word
- Viseme
- SSML
Features
- Amazon Polly accepts plain text, UTF-8, and SSML as inputs.
- Pronounces out abbreviations and acronyms
- Interprets date/time and unit of measurements.
- Homograph disambiguation
- For example, “St.” can be read as ”saint” or “street.” Amazon Polly is capable of identifying their difference depending on a given context.
- Custom lexicon
- Supports customizing the pronunciation of words uncommon to the selected language.
- Speech Synthesis Markup Language (SSML) additions:
emphasisto stress words.breakto insert pauses of custom lengths.prosodyfor adjusting pitch, rate, and volume dynamically.say-asto specify interpretation (dates, times, numbers, phone numbers).
- Speech Marks (Metadata) additions:
- Can be used for lip-syncing in animations (especially
viseme). - Useful for highlighting text as spoken in apps.
- Can be used for lip-syncing in animations (especially
- Generative (Gen) Voice Engine
- Beyond Standard, Neural (NTTS), and Long‑Form, Amazon Polly now supports a Generative voice engine.
- The generative engine is more expressive, with context-dependent prosody (intonation, pausing, etc.).
- New generative voices were added: as of August 2025, seven new expressive voices (e.g., US English Salli, Canadian French Liam, Polish Ola/Ewa, etc.).
- In November 2025, five more generative voices were launched (Austrian German Hannah, Irish English Niamh, Brazilian Portuguese Camila, Belgian Dutch Lisa, Korean Seoyeon) plus expansion into new AWS regions (Seoul, Singapore, Tokyo).
- Voice Persona Across Languages (“Polyglot / Multilingual Identity”)
- Polly supports polyglot voices, meaning a single voice persona (same “speaker identity”) can speak in multiple languages.
- Example: Matthew (US English) voice identity is used in other locales (Pedro in US Spanish, Daniel in German, Liam in Canadian French, Andrés in Mexican Spanish, Sergio in European Spanish, Rémi in French).
- This is very useful for brand consistency across regions.
- Additional Regional / Voice Support
- New AWS Region for Neural voices: Europe (Zurich).
- Polly now supports Asia Pacific (Malaysia) region for both Neural and Standard voices.
- New neural voice: Korean Jihye.
- New neural English (Singapore) voice: Jasmine.
- Brand Voice
- Brand Voice (custom neural voice) is still supported: organizations can work with AWS to build a unique voice persona exclusive to them.
- This can help distinguish your brand with a vocal identity in IVR, contact centers, or other applications.
Standard vs Neural TTS
- Neural TTS updates:
- Supports more expressive emotions, not just Conversational or Newscaster.
- Can produce multiple voices per language, including brand voices (custom neural voices).
- Standard TTS limits:
- Cannot produce multiple speaking styles simultaneously.
- Less natural prosody (intonation and rhythm).
Amazon Polly Pricing
- Standard TTS
- $4.00 per 1 million characters
- Neural TTS
- $16.00 per 1 million characters
- Include free tier: 5 million characters per month for the first 12 months (Standard TTS).
- Mention additional cost considerations:
- Charges are per character, not per request.
- Neural voices are 4x more expensive than standard voices.
Note: If you are studying for the AWS Certified Machine Learning Specialty exam, we highly recommend that you take our AWS Certified Machine Learning – Specialty Practice Exams and read our Machine Learning Specialty exam study guide.
Validate Your Knowledge
Question 1
A Business Process Outsourcing (BPO) company uses Amazon Polly to translate plaintext documents to speech for its voice response system. After testing, some acronyms and business-specific terms are being pronounced incorrectly.
Which approach will fix this issue?
- Use a
visemeSpeech Mark. - Use pronunciation lexicons.
- Convert the scripts into Speech Synthesis Markup Language (SSML) and use the
pronunciationtag. - Convert the scripts into Speech Synthesis Markup Language (SSML) and use the
emphasistag to guide the pronunciation.
For more AWS practice exam questions with detailed explanations, visit the Tutorials Dojo Portal:
Amazon Polly Cheat Sheet References:
https://aws.amazon.com/polly/faqs/
https://aws.amazon.com/polly/features/
https://docs.aws.amazon.com/polly/latest/dg/how-text-to-speech-works.html
https://aws.amazon.com/polly/pricing/















