Data Concepts in Azure Machine Learning

Last updated on August 14, 2023

Azure Machine Learning Data Concepts

URI

A Uniform Resource Identifier (URI) represents a storage location on a local computer, Azure storage, or a publicly available http(s) location.
URIs can be used as inputs or outputs to an Azure Machine Learning job and can be mapped to the compute target filesystem in different modes: read-only mount, read-write mount, download, or upload.
URIs use identity-based authentication to connect to storage services, with options for Azure Active Directory ID or Managed Identity.

Data types

Azure Machine Learning supports three data types: File, Folder, and Table.
File: References a single file and can have any format.
Folder: References a single folder and is useful for deep-learning tasks with various file types such as images, text, audio, and video.
Table: References a data table and is suitable for a complex schema with frequent changes or large tabular data subsets.

Data runtime capability

Azure Machine Learning uses its own data runtime for mounts, uploads, downloads, and materialization of tabular data into pandas/spark.
The data runtime is built with Rust language for high speed and efficiency.
It has no dependencies on other technologies, allowing for quick installation on compute targets.

It supports multi-process data loading and pre-fetching to enhance GPU utilization in deep-learning operations.
Provides seamless authentication to cloud storage.

Datastore

An Azure Machine Learning datastore is a reference to an existing Azure storage account.
It provides a common API for interacting with different storage types (Blob/Files/ADLS) and facilitates team operations.
Datastore creation and use offer easier discovery of useful datastores and secure connection information for credential-based access.
Authentication methods include credential-based (service principal/SAS/key) and identity-based (Azure Active Directory or managed identity).

Data asset

An Azure Machine Learning data asset allows users to create a reference to frequently used data sources with a friendly name.
Data asset creation includes metadata and a reference to the data source location without incurring extra storage costs or risking data source integrity.
Data assets can be created from Azure Machine Learning datastores, Azure Storage, public URLs, or local files.

Data splits & cross-validation (Python)

Data Splits

In Azure Automated Machine Learning, the recommended approach is to randomly split the data into training and evaluation sets based on rows.
The AutoMLConfig object represents the configuration for submitting an automated ML experiment in Azure Machine Learning, containing parameters and training data for the experiment run.

Provide validation data

Provide a separate validation set by specifying the validation data in your machine learning process to assess the model’s performance on unseen data during training.

Provide validation set size

Control the size of the validation set by specifying the desired percentage or number of samples to be allocated for validation to fine-tuning the model and evaluating its generalization ability.

K-fold cross-validation

Dividing the data into K subsets, or “folds,” and using each fold as a validation set while training on the remaining data to provide a robust evaluation by averaging the results across multiple iterations.

Monte Carlo cross-validation

A technique where multiple random training and validation splits are generated to mitigate bias in the model evaluation caused by a particular split.

Specify custom cross-validation data folds

Specify custom cross-validation data folds using CV split columns in the model configuration, giving you control over the data divisions for validation.

Metric calculation for cross-validation in machine learning

Calculates metrics on each validation fold and aggregates them for comprehensive model performance evaluation, ensuring reliable assessment.

References:

Data concepts in Azure Machine Learning – Azure Machine Learning

Data splits and cross-validation in automated machine learning – Azure Machine Learning

azureml.train.automl.automlconfig.AutoMLConfig class – Azure Machine Learning Python

Secure data access in the cloud v1 – Azure Machine Learning

Evaluate AutoML experiment results – Azure Machine Learning

Azure Machine Learning – ML as a Service | Microsoft Azure

Written by: Maine Cruz

Charmaine is a DevOps engineer and a Cloud instructor at Tutorials Dojo. She is also an AWS BuildHers+ Mentor in AWS User Group Philippines. Certified in both AWS and Azure Cloud platforms. Charmaine specializes in automating solutions and CI/CD.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses