Azure AI Video Indexer

Azure AI Video Indexer Cheat Sheet

Azure AI Video Indexer is a cloud-based service within Azure AI that leverages multiple Azure AI technologies like Face, Translator, Vision, and Speech to analyze video and audio content. It uses over 30 AI models to extract detailed insights from videos, enabling advanced audio and video analysis for richer understanding and indexing of multimedia content.

How can Azure AI Video Indexer be used?

Azure AI Video Indexer provides insights that can support a wide range of use cases, including:

Deep Search – Enables searching within videos for spoken words, faces, and co-occurrences to find specific moments quickly. Useful for news agencies, education, broadcasters, entertainment, enterprises, and any video library.
Content Creation – Facilitates creating trailers, highlight reels, and clips using keyframes, scene markers, and timestamps to easily navigate and edit video content.
Accessibility- Provides multi-language transcription and translation to support people with disabilities and distribute content globally in various languages.
Monetization- Enhances video value by supplying detailed insights to ad servers, helping industries deliver more relevant ads and increase revenue.
Content Moderation- Uses text and visual moderation models to detect inappropriate content, ensuring compliance with organizational standards by blocking or alerting users.
Recommendations- Improves user engagement by tagging videos with rich metadata to suggest relevant videos and highlight important moments tailored to user preferences.

🎥 Video models

Face Detection: Finds and groups faces appearing in videos.
Celebrity Identification: Recognizes over 1 million celebrities, including actors, leaders, and public figures, using data from public sources.
Account-based Face Identification: Custom trains face recognition models specific to an account to identify faces in videos.
Face Thumbnail Extraction: Selects the best-quality face image per group and extracts it as an asset.
Optical Character Recognition (OCR): Extracts text from images such as signs or product labels within videos.
Visual Content Moderation: Detects adult or racy visual content for compliance.

Labels Identification: Recognizes objects and actions shown in video frames.
Scene Segmentation: Detects when scenes change by analyzing visual cues; scenes are made of related shots.
Shot Detection: Identifies shot boundaries based on visual differences; shots are continuous frames from a single camera take.
Black Frame Detection: Finds black frames within video content.
Keyframe Extraction: Identifies stable frames that represent important moments in videos.
Rolling Credits Detection: Locates the start and end of credits in TV shows or movies.
Editorial Shot Type Detection: Tags shots by type (e.g., close-up, wide shot, indoor/outdoor).
Observed People Detection: Tracks people’s locations in video frames with bounding boxes and timestamps.
- Matched Person: Links observed people with detected faces, providing confidence scores.
- Detected Clothing: Identifies clothing types worn by people with timing and confidence.
- Featured Clothing: Extract images of key clothing items to support targeted advertising.
Object Detection: Detects and tracks unique objects, recognizing their reappearance in frames.
Slate Detection: Identifies movie post-production elements like clapperboards, color bars, and textless slates.
Textual Logo Detection: Matches specific predefined text (e.g., brand names) detected via OCR to identify logos.

Note: Some features require special authorization or have privacy restrictions due to regulatory requirements.

🔉 Audio Models

Audio Transcription: Converts spoken words to text in over 50 languages, supporting extensions.
Automatic Language Detection: Identifies the primary spoken language; defaults to English if uncertain.
Multi-language Speech Identification: Detects and transcribes multiple languages within different audio segments, combining them into a single transcript.
Closed Captioning: Generates captions in VTT, TTML, and SRT formats.
Two-Channel Processing: Separates and merges transcripts from two audio channels into one timeline.
Noise Reduction: Cleans up noisy or telephony audio using filters similar to Skype.
Transcript Customization (CRIS): Allows training of custom speech-to-text models for industry-specific terminology.
Speaker Enumeration: Identifies up to 16 different speakers, mapping who spoke when.
Speaker Statistics: Provides data on the proportion of speech per speaker.
Textual Content Moderation: Detects explicit language in transcripts.
Text-Based Emotion Detection: Analyzes transcripts to detect emotions such as joy, sadness, anger, and fear.
Translation: Translates transcripts into multiple languages.
Audio Effects Detection: Recognizes non-speech sounds like alarms, dog barking, crowd reactions, gunshots, laughter, breaking glass, and silence; included in downloadable caption files.

Audio and video models (multi-channels)

Keywords Extraction: Extracts important keywords from both speech and visual text.
Named Entities Extraction: Identifies brands, locations, and people using natural language processing (NLP) on speech and visual text.
Topic Inference: Determines topics by analyzing keywords, using ontologies like IPTC, Wikipedia, and Video Indexer’s hierarchical topic ontology.
Artifacts Extraction: Provides detailed metadata and insights related to each AI model’s output.
Sentiment Analysis: Detects positive, negative, and neutral sentiments from spoken words and visual text.

Note: Some model results may be partial when indexing from a single channel.

Using Azure AI Video Indexer Safely and Legally

Users must follow all applicable laws when using Azure AI Video Indexer and not violate others’ rights or cause harm.
Before uploading videos or images, users must have all necessary legal rights and consents, especially for individuals appearing in the content.
Some jurisdictions have special rules for sensitive data like biometric information; users must ensure compliance with these laws when processing or storing such data.
Users should consult the Microsoft Trust Center regarding compliance, privacy, and security for detailed information.
The Privacy Statement, Online Services Terms (OST), and Data Processing Addendum (DPA) explain Microsoft’s privacy practices, data handling, retention, and deletion policies.
Using Azure AI Video Indexer means agreeing to these legal terms (OST, DPA, Privacy Statement).

💰 Pricing Overview

Free Trial Account:
- Web Users: Up to 10 hours (600 minutes) of free indexing.
- API Users: Up to 40 hours (2,400 minutes) of free indexing.
Paid Unlimited Account:
- Designed for larger-scale indexing. Requires an Azure subscription.
- Pricing is based on the duration of the input file and the selected analysis presets

References:

Written by: Irene Bonso

Irene Bonso is currently thriving as a Software Engineer at Tutorials Dojo and also an active member of the AWS Community Builder Program. She is focused to gain knowledge and make it accessible to a broader audience through her contributions and insights.

Azure AI Video Indexer

Azure AI Video Indexer