Last updated on August 14, 2023
Azure Machine Learning Data Concepts
URI
-
A Uniform Resource Identifier (URI) represents a storage location on a local computer, Azure storage, or a publicly available http(s) location.
-
URIs can be used as inputs or outputs to an Azure Machine Learning job and can be mapped to the compute target filesystem in different modes: read-only mount, read-write mount, download, or upload.
-
URIs use identity-based authentication to connect to storage services, with options for Azure Active Directory ID or Managed Identity.
Data types
-
Azure Machine Learning supports three data types: File, Folder, and Table.
-
File: References a single file and can have any format.
-
Folder: References a single folder and is useful for deep-learning tasks with various file types such as images, text, audio, and video.
-
Table: References a data table and is suitable for a complex schema with frequent changes or large tabular data subsets.
Data runtime capability
-
Azure Machine Learning uses its own data runtime for mounts, uploads, downloads, and materialization of tabular data into pandas/spark.
-
The data runtime is built with Rust language for high speed and efficiency.
-
It has no dependencies on other technologies, allowing for quick installation on compute targets.
-
It supports multi-process data loading and pre-fetching to enhance GPU utilization in deep-learning operations.
-
Provides seamless authentication to cloud storage.
Datastore
-
An Azure Machine Learning datastore is a reference to an existing Azure storage account.
-
It provides a common API for interacting with different storage types (Blob/Files/ADLS) and facilitates team operations.
-
Datastore creation and use offer easier discovery of useful datastores and secure connection information for credential-based access.
-
Authentication methods include credential-based (service principal/SAS/key) and identity-based (Azure Active Directory or managed identity).
Data asset
-
An Azure Machine Learning data asset allows users to create a reference to frequently used data sources with a friendly name.
-
Data asset creation includes metadata and a reference to the data source location without incurring extra storage costs or risking data source integrity.
-
Data assets can be created from Azure Machine Learning datastores, Azure Storage, public URLs, or local files.
Data splits & cross-validation (Python)
Data Splits
-
In Azure Automated Machine Learning, the recommended approach is to randomly split the data into training and evaluation sets based on rows.
-
The AutoMLConfig object represents the configuration for submitting an automated ML experiment in Azure Machine Learning, containing parameters and training data for the experiment run.
Provide validation data
-
Provide a separate validation set by specifying the validation data in your machine learning process to assess the model’s performance on unseen data during training.
Provide validation set size
-
Control the size of the validation set by specifying the desired percentage or number of samples to be allocated for validation to fine-tuning the model and evaluating its generalization ability.
K-fold cross-validation
-
Dividing the data into K subsets, or “folds,” and using each fold as a validation set while training on the remaining data to provide a robust evaluation by averaging the results across multiple iterations.
Monte Carlo cross-validation
-
A technique where multiple random training and validation splits are generated to mitigate bias in the model evaluation caused by a particular split.
Specify custom cross-validation data folds
-
Specify custom cross-validation data folds using CV split columns in the model configuration, giving you control over the data divisions for validation.
Metric calculation for cross-validation in machine learning
-
Calculates metrics on each validation fold and aggregates them for comprehensive model performance evaluation, ensuring reliable assessment.
References:
Data concepts in Azure Machine Learning – Azure Machine Learning
Data splits and cross-validation in automated machine learning – Azure Machine Learning
azureml.train.automl.automlconfig.AutoMLConfig class – Azure Machine Learning Python
Secure data access in the cloud v1 – Azure Machine Learning
Evaluate AutoML experiment results – Azure Machine Learning
Azure Machine Learning – ML as a Service | Microsoft Azure
AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!
Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!
View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses