Azure AI Vision Cheat Sheet
Azure AI Vision is a powerful cloud-based service that extracts detailed and meaningful information from images and videos. It includes various features, such as fundamental image analysis, which helps identify common picture elements, and advanced object detection that recognizes specific items or people within a scene. The service also offers facial recognition, identifying individuals based on their facial features. Additionally, it provides spatial understanding, which helps interpret the relationships and context of objects in an environment. Developers can take advantage of pre-trained models, which are ready to use, and customizable options to tailor the service to their specific needs, making it easy to integrate sophisticated visual intelligence into their applications.
Major Product Categories
-
Image Analysis
- Image Tagging – Automatically identifies thousands of recognizable objects, living beings, scenery, and actions within images and returns a list of relevant tags with confidence scores. Useful for image categorization and search.
- Object Detection – Pinpoints the location of specific objects within an image by providing bounding box coordinates (x, y, width, height) and assigning labels with confidence scores. Crucial for tasks like inventory management, defect detection, and autonomous driving.
- Instance Segmentation – Goes beyond object detection by outlining the precise boundaries (polygon masks) of each individual instance of an object in an image. Enables pixel-level understanding of the visual scene.
- Image Classification – Categorizes an entire image into one or more predefined classes based on its overall content. Useful for content moderation and automated image sorting.
- Brand Detection – Identifies the presence of commercial brands in images by recognizing logos. Valuable for market research and brand monitoring.
- Face Detection – Detect human faces within an image and generate rectangle coordinates for each detected face.
- Adult Content Detection – Analyzes images and returns a score indicating the likelihood of containing adult or racy content. Important for content filtering and safety measures.
- Image Type Detection – Determines if an image is a photograph, clip art, or a line drawing. Useful for optimizing image processing pipelines.
- Color Scheme Analysis – Identifies the dominant foreground and background colors in an image, along with accent colors. Useful for aesthetic analysis and design applications.
- Smart Cropping – Suggests optimal rectangular crops of an image based on content saliency, helping to create visually appealing thumbnails and previews.
-
Optical Character Recognition (OCR)
- Extracts printed and handwritten text from images, documents, and even scenes in videos.
- Provides the text content along with bounding boxes indicating the location of each word or text line.
- Offers options for synchronous (for smaller images) and asynchronous (for larger, multi-page documents) processing.
- Supports a wide variety of languages, enabling global applicability.
- Can detect text orientation and automatically correct it for better readability.
-
Face service
- Face Detection – Locates human faces in images and returns bounding boxes for each detected face. Can estimate the number of faces and their rough positions.
- Face Recognition – Identifies previously enrolled faces from a database. Requires a training step to associate faces with unique person IDs. Useful for security, access control, and personalized experiences.
- Face Verification – Determines the likelihood that two faces belong to the same person by comparing their facial features. Used for authentication and identity confirmation.
- Face Grouping – Organizes a set of unidentified faces into groups based on visual similarity. Helpful for organizing photo collections.
- Face Attributes – Analyzes detected faces and predicts attributes such as age, gender, emotion (happiness, sadness, anger, etc.), presence of facial hair, makeup, and head pose. Provides more profound insights into facial characteristics.
- Liveness Detection (Preview) – Helps to prevent spoofing attacks by verifying that a detected face is a real, live person and not a photograph or mask.
Key Concepts
- Computer Vision Resource – Your gateway to accessing all Azure AI Vision services. You create and manage this resource in the Azure portal. It provides the necessary endpoints and allows you to manage your API keys and billing.
- Endpoints – Each Azure AI Vision feature has specific API endpoints for making requests (e.g., an endpoint for image analysis, another for OCR). You’ll need to use the correct endpoint based on the functionality you want to use.
- API Keys – Unique authentication keys associated with your Computer Vision resource. These keys are required in the headers of your API requests to authorize access to the services. Treat these keys as secrets.
- Transactions – Most Azure AI Vision features are billed based on the number of transactions. Understanding what constitutes a transaction for each feature is crucial for cost management. For example, analyzing one image might be one transaction, while processing a multi-page document with OCR could involve multiple transactions.
- Bounding Boxes – Represented as a set of coordinates (typically top-left x, top-left y, width, height) that define a rectangular region around an detected object or text element within an image.
- Confidence Scores – A numerical value between 0 and 1 (or 0% and 100%) that indicates the AI model’s certainty in its prediction. Higher scores generally mean a more reliable prediction. You can often set thresholds to filter results based on confidence scores.
- Vector Embeddings – Dense numerical representations of images learned by deep learning models. Images with similar visual content will have closer vector embeddings in the high-dimensional space, enabling efficient similarity searches.
Vision Studio
Vision Studio is a web-based user interface that allows you to explore the capabilities of Azure AI Vision without writing any code. It provides a visual way to test different features, upload images and videos, and see the results in real-time.
- No-Code Exploration – The primary purpose of Vision Studio is to allow users, regardless of their technical expertise, to try out the different Vision API features easily. You can upload your images or provide URLs and see the results in a user-friendly interface.
- Feature Showcase – It provides a clear and organized way to access and test features like:
- Image Analysis: Tagging, captioning, object detection, instance segmentation, smart cropping, and more.
- OCR (Optical Character Recognition): Extracting text from images and documents.
- Face service – Detecting faces, analyzing attributes, comparing faces, and even liveness detection.
- Spatial Analysis (Preview) – Analyzing people’s movement and interactions in video.
- Image Retrieval (Preview) – Searching for visually similar images.
- Custom Vision (Integration) – Interacting with and managing your custom-trained models.
- When you upload an image or video and select a feature to test, Vision Studio displays the results in real time.
- Vision Studio allows you to adjust parameters, such as confidence thresholds for object detection or the language for OCR, giving you more control over the analysis.
- It provides sample code snippets in various programming languages (like Python, C#, Java) that demonstrate how to achieve the same results programmatically using the Azure AI Vision SDKs.
Language Support
- OCR – Supports a broad range of languages for text extraction, including English, Spanish, French, German, Chinese (Simplified and Traditional), Japanese, Korean, Portuguese, Russian, and many more. The specific list of supported languages can be found in the official Azure AI Vision documentation.
- Image Analysis (Captioning, Tagging) – Supports generating captions and tags in multiple languages, including English, Chinese, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The quality and availability of language support may vary depending on the specific analysis feature.
- Face service – The core face detection and recognition functionalities are language-independent as they analyze visual facial features. However, analyzing text within an image (e.g., on a t-shirt in a detected face) would depend on the OCR capabilities.
Pricing
Azure AI Vision follows a pay-as-you-go model with potential free tiers for initial use. Pricing varies depending on the specific feature and the volume of transactions. Key aspects of pricing include:
- Free Tier – Offers limited free transactions per month for many features.
- Standard Tier – Priced per 1,000 transactions, with different rates for different API calls (e.g., basic analysis vs. object detection).
- OCR – Priced per 1,000 image reads.
- Face service – Priced per 1,000 transactions for operations like detection, identification, and verification.