Amazon Textract Cheat Sheet

Last updated on November 25, 2025

A fully managed document analysis service for detecting and extracting information from scanned documents.
Returns extracted data as key-value pairs (e.g., Name: John Doe)
Supports virtually any type of documents
Can detect text written in Standard English alphabet and ASCII symbols.
Mention integration options: works with AWS SDKs (Python, Java, Node.js, etc.), AWS CLI, and Boto3.
Supports image files (PNG, JPG) and PDFs (single and multipage).
Can detect handwriting in multiple languages (not just English).
Works natively with S3 (you can process documents stored in S3 directly).

Common Use Cases:

Text extraction for Natural Language Processing (NLP) Applications
Maintaining document compliance
Invoice processing: Extract invoice number, date, and total amount automatically.
Receipt scanning: Automatically categorize expenses for accounting.
Legal document review: Identify contract clauses, parties, and dates.
Healthcare forms: Extract patient information and medical codes.
Data migration: Move data from legacy paper forms into structured databases.

Amazon Textract returns a confidence score for each identified element, which indicates the probability that a given prediction is correct.
A low-confidence score can be rerouted to Amazon Augmented AI (A2I) for further human review.
The asynchronous operation allows you to process multipage PDF documents.
Detect Document Text API
- Uses optical character recognition (OCR) technology to extract printed text and handwriting from a document.
Analyze Document API
- Extracts printed text, handwriting, and other data from tables and key-value pairs from forms.

Detect Document Text API:

Analyze Document API:

New additions:

StartDocumentTextDetection / StartDocumentAnalysis: asynchronous APIs for large or multipage documents.
GetDocumentTextDetection / GetDocumentAnalysis: retrieve results for asynchronous operations.