Last updated on August 14, 2023
Azure Machine Learning Data Concepts
URI
-
A Uniform Resource Identifier (URI) represents a storage location on a local computer, Azure storage, or a publicly available http(s) location.
-
URIs can be used as inputs or outputs to an Azure Machine Learning job and can be mapped to the compute target filesystem in different modes: read-only mount, read-write mount, download, or upload.
-
URIs use identity-based authentication to connect to storage services, with options for Azure Active Directory ID or Managed Identity.
Data types
-
Azure Machine Learning supports three data types: File, Folder, and Table.
-
File: References a single file and can have any format.
-
Folder: References a single folder and is useful for deep-learning tasks with various file types such as images, text, audio, and video.
-
Table: References a data table and is suitable for a complex schema with frequent changes or large tabular data subsets.
Data runtime capability
-
Azure Machine Learning uses its own data runtime for mounts, uploads, downloads, and materialization of tabular data into pandas/spark.
-
The data runtime is built with Rust language for high speed and efficiency.
-
It has no dependencies on other technologies, allowing for quick installation on compute targets.
-
It supports multi-process data loading and pre-fetching to enhance GPU utilization in deep-learning operations.
-
Provides seamless authentication to cloud storage.
Datastore
-
An Azure Machine Learning datastore is a reference to an existing Azure storage account.
-
It provides a common API for interacting with different storage types (Blob/Files/ADLS) and facilitates team operations.
-
Datastore creation and use offer easier discovery of useful datastores and secure connection information for credential-based access.
-
Authentication methods include credential-based (service principal/SAS/key) and identity-based (Azure Active Directory or managed identity).
Data asset
-
An Azure Machine Learning data asset allows users to create a reference to frequently used data sources with a friendly name.
-
Data asset creation includes metadata and a reference to the data source location without incurring extra storage costs or risking data source integrity.
-
Data assets can be created from Azure Machine Learning datastores, Azure Storage, public URLs, or local files.
Data splits & cross-validation (Python)
Data Splits
-
In Azure Automated Machine Learning, the recommended approach is to randomly split the data into training and evaluation sets based on rows.
-
The AutoMLConfig object represents the configuration for submitting an automated ML experiment in Azure Machine Learning, containing parameters and training data for the experiment run.
Provide validation data
-
Provide a separate validation set by specifying the validation data in your machine learning process to assess the model’s performance on unseen data during training.
Provide validation set size
-
Control the size of the validation set by specifying the desired percentage or number of samples to be allocated for validation to fine-tuning the model and evaluating its generalization ability.
K-fold cross-validation
-
Dividing the data into K subsets, or “folds,” and using each fold as a validation set while training on the remaining data to provide a robust evaluation by averaging the results across multiple iterations.
Monte Carlo cross-validation
-
A technique where multiple random training and validation splits are generated to mitigate bias in the model evaluation caused by a particular split.
Specify custom cross-validation data folds
-
Specify custom cross-validation data folds using CV split columns in the model configuration, giving you control over the data divisions for validation.
Metric calculation for cross-validation in machine learning
-
Calculates metrics on each validation fold and aggregates them for comprehensive model performance evaluation, ensuring reliable assessment.
References:
Data concepts in Azure Machine Learning – Azure Machine Learning
Data splits and cross-validation in automated machine learning – Azure Machine Learning
azureml.train.automl.automlconfig.AutoMLConfig class – Azure Machine Learning Python
Secure data access in the cloud v1 – Azure Machine Learning
Evaluate AutoML experiment results – Azure Machine Learning
Azure Machine Learning – ML as a Service | Microsoft Azure