In the rapidly evolving domain of machine learning, ensuring fairness and explainability in model predictions has become crucial. With Amazon SageMaker Clarify, these critical aspects are not just an afterthought but integral components of the model development and deployment process. This article delves into the world of SageMaker Clarify, offering a comprehensive guide to its capabilities and practical applications. We commence our journey with a high-level understanding of what SageMaker Clarify is and its importance in the day-to-day tasks of machine learning modeling. Our exploration is anchored in a hands-on example, utilizing a specially crafted dataset that simulates loan approval scenarios in the Philippines. This dataset, designed to exhibit certain biases, serves as a perfect canvas to demonstrate the prowess of SageMaker Clarify in identifying and addressing fairness issues in machine learning models. As we navigate through the intricate paths of machine learning model development, we’ll be using AWS’s Python SDK, closely following the documentation with some adaptations to suit our unique dataset. Our focus will be on a range of critical topics, from the prerequisites of using SageMaker Clarify to the training of an XGBoost model. We’ll then delve into how SageMaker Clarify helps in detecting bias in the model predictions and explains these predictions in a transparent and understandable manner. Join us as we embark on this enlightening journey to master SageMaker Clarify, and arm ourselves with the knowledge and tools to build not only effective but also fair and explainable machine learning models. Amazon SageMaker Clarify is a powerful tool designed to bring transparency and fairness into the realm of machine learning. In a world where AI-driven decisions increasingly impact every aspect of our lives, SageMaker Clarify stands as a beacon of accountability and understanding. It serves as a crucial component in the Amazon SageMaker suite, ensuring that machine learning models are not only efficient but also equitable and interpretable. SageMaker Clarify seamlessly integrates into your existing AWS machine learning workflow. Whether you’re starting from scratch or have a pre-existing model, Clarify can be incorporated at various stages – from the initial data preparation phase to post-deployment. This flexibility allows for continuous monitoring and improvement of your models, ensuring they remain fair and understandable throughout their lifecycle. In our case study, we’ll be using an artificial dataset simulating loan approvals in the Philippines. This dataset, purposefully designed to exhibit biases, is an ideal testbed for demonstrating the capabilities of SageMaker Clarify. Through this example, we will witness firsthand how Clarify detects biases in the dataset and in the machine learning model. This practical application not only underscores the importance of fairness in AI but also showcases the ease with which SageMaker Clarify can be integrated into everyday machine learning tasks. In conclusion, SageMaker Clarify is not just a tool; it’s a commitment to responsible AI. By ensuring fairness and explainability, it empowers developers and businesses to create machine learning models that are not only high-performing but also equitable and transparent, fostering trust and reliability in AI-driven decisions. The first step in our journey involves setting up the Python environment with the necessary libraries. This setup ensures that all tools required for data manipulation, machine learning, and interaction with AWS services are readily available. The following libraries form the foundation of our work: SageMaker-specific libraries like session and get_execution_role for managing SageMaker sessions and roles. Setting up the SageMaker session and defining the role is crucial for integrating our local environment with AWS services. This step allows us to interact seamlessly with SageMaker and other AWS services throughout our project. We’ll be using a pre-prepared dataset that represents loan applications in the Philippines. This dataset is specifically designed to showcase potential biases and will serve as the foundation for our analysis with SageMaker Clarify. You can download this dataset via this link. Preprocessing involves normalizing numerical features and encoding categorical ones, preparing the dataset for machine learning models. A thorough understanding of our dataset is critical for identifying and addressing potential biases. It includes: The target variable is the loan approval status, which we will analyze for bias using SageMaker Clarify. By understanding these features, we can better comprehend how a model might develop biases and work proactively towards creating a more equitable machine learning solution. In this section, we will go through the process of training an XGBoost model using our prepared dataset. Before training, we need to upload our dataset to Amazon S3, AWS’s scalable storage service. This process ensures that our data is accessible to the SageMaker training job. XGBoost is a popular and efficient open-source implementation of gradient-boosted trees, renowned for its performance and speed. In this step, we’ll configure and initiate the training of an XGBoost model on our dataset. Once the training is complete, the next step is to create a SageMaker model. This model will be used for making predictions and will also be the subject of our fairness and explainability analysis with SageMaker Clarify. In this section, we have successfully uploaded our data to S3, trained an XGBoost model, and created a SageMaker model. These steps lay the groundwork for the subsequent stages where we will use SageMaker Clarify to detect bias and explain predictions made by our model. Detecting and addressing bias is a pivotal aspect of responsible AI practices. In this section, we explore how Amazon SageMaker Clarify helps in identifying and mitigating biases in machine learning models. Bias in machine learning refers to the unfair and prejudicial treatment of certain groups based on their characteristics, like gender or ethnicity. This unfair treatment often stems from the data the model is trained on or the way the model processes data. Biases can significantly impact individuals and communities, leading to skewed and unjust outcomes. Therefore, it’s crucial to detect and mitigate these biases to ensure fairness and equity in AI-driven decisions. SageMaker Clarify provides tools to detect both pre-training and post-training biases using a variety of metrics. Pre-training bias arises from the training data itself, while post-training bias may develop during the model’s learning process. To start with, we initialize a SageMakerClarifyProcessor, which will compute the bias metrics and model explanations: DataConfig informs SageMaker Clarify about the data used for the bias analysis: This configuration specifies the S3 paths for input data and output reports, the target label, column headers, and the dataset type. ModelConfig defines the trained model details: BiasConfig is used to specify parameters for bias detection: In our example, we focus on gender as the sensitive attribute and age as the subgroup for measuring bias. In our scenario, pre-training bias would relate to any inherent biases in the dataset, such as disproportionate representation of certain genders or ethnicities. Post-training bias would concern biases that the model may develop as it learns from this data, potentially exacerbating or creating new biases. Finally, we run the bias analysis using SageMaker Clarify: This process comprehensively examines both pre-training and post-training biases, offering insights into areas where the model might be exhibiting unfair biases. By addressing these biases, we can work towards more fair and equitable AI systems. After running the SageMaker Clarify analysis, you can view the results of the bias report. If you are following the demo locally, you can access the report by navigating to the output of the following command: You can then download the report from this path and view it. If you are following the demo using SageMaker Studio, the results can be viewed directly in the “Experiments” tab. The Amazon SageMaker Clarify bias report is comprehensive, structured into various sections: Each of these sections provides valuable insights into different aspects of bias in the machine learning model, allowing for a comprehensive understanding of where biases might exist and how they manifest in both the data and the model’s predictions. You can check the whole bias report in this link. In the realm of machine learning, especially in applications with significant social impacts like loan approvals, understanding the ‘why’ behind a model’s decision is as important as the decision itself. Amazon SageMaker Clarify employs Kernel SHAP (SHapley Additive exPlanations) to elucidate the contribution of each input feature to the final decision. This method, grounded in cooperative game theory, offers a way to interpret complex model predictions by assigning each feature an importance value for a particular prediction. For running the run_explainability API call, SageMaker Clarify requires configurations similar to those used for bias detection, including DataConfig and ModelConfig. Additionally, SHAPConfig is introduced specifically for the Kernel SHAP algorithm. In our demonstration, we configure SHAPConfig with the following parameters: The actual execution of the explainability analysis involves running the run_explainability method, which takes about 10-15 minutes: The Explainability Report generated by SageMaker Clarify offers an in-depth look at how different features influenced the model’s predictions. The report includes: This detailed breakdown enables a deeper understanding of the model’s decision-making process, highlighting the factors that are most influential in predictions. Such transparency is crucial not only for regulatory compliance but also for building trust in machine learning systems among users and stakeholders. You can check the whole explainability report in this link. As we conclude our exploration of Amazon SageMaker Clarify, it’s clear that this tool is pivotal in fostering fairness and transparency in machine learning models. Through our journey, from setting up our environment to training an XGBoost model and using SageMaker Clarify, we’ve seen firsthand the impact and necessity of these tools in contemporary machine learning practices. As machine learning continues to evolve and integrate more deeply into various sectors, the importance of tools like SageMaker Clarify cannot be overstated. They are essential for building models that not only perform well but also align with our ethical standards and societal values. The journey towards responsible AI is ongoing, and SageMaker Clarify is a powerful ally in this endeavor. We encourage practitioners in the field of machine learning and data science to leverage SageMaker Clarify in their projects. By doing so, we can collectively work towards more equitable and transparent AI systems. Remember, the goal is not just to create intelligent machines but to ensure that these machines make decisions that are fair, understandable, and accountable. https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/clarify-model-explainability.html https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker-clarifyIntroduction
What is SageMaker Clarify?
Core Functions
Integrating with Your Machine Learning Workflow
Why SageMaker Clarify Matters
Prerequisites and Data
Importing Libraries
import pandas as pd
import numpy as np
import os
import boto3
from datetime import datetime
from sagemaker import session, get_execution_role
from sklearn.model_selection import train_test_split
Initializing Configurations
# Initialize SageMaker session
sagemaker_session = session.Session()
region = sagemaker_session.boto_region_name
print(f"Region: {region}")
# Define role based on your environment
role =
"arn:aws:iam::123123123:role/service-role/AmazonSageMaker-ExecutionRole-123123
123"
# or, if using SageMaker Studio
role = get_execution_role()
print(f"Role: {role}")
Downloading the Data
Preprocessing
Scaling the numerical features:
# Scale the numerical features
from sklearn.preprocessing import StandardScaler
numerical_features = ["monthly_income", "credit_score", "employment_years",
"age", "debt_to_income", "other_obligations"]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[numerical_features])
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=numerical_features)
df = df.drop(columns=numerical_features, axis=1)
df = pd.concat([df, scaled_features_df], axis=1)
Splitting the dataset:
training_data, testing_data = train_test_split(df, test_size=0.2,
random_state=0)
Encoding categorical columns:
from sklearn import preprocessing
def number_encode_features(df):
result = df.copy()
encoders = {}
for column in result.columns:
if result.dtypes[column] == object:
encoders[column] = preprocessing.LabelEncoder()
result[column] =
encoders[column].fit_transform(result[column].fillna("None"))
return result, encoders
training_data = pd.concat([training_data["loan_approved"], training_data.drop(["loan_approved"], axis=1)], axis=1)
training_data, _ = number_encode_features(training_data)
training_data.to_csv("train_data.csv", index=False, header=False)
testing_data, _ = number_encode_features(testing_data)
test_features = testing_data.drop(["loan_approved"], axis=1)
test_target = testing_data["loan_approved"]
test_features.to_csv("test_features.csv", index=False, header=False)
Data Definition
Model Training
Putting Data into S3
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput
bucket = "your-s3-bucket-name"
prefix = "sagemaker-clarify-article/philippines-loan"
# Upload training and testing data to S3
train_uri = S3Uploader.upload("train_data.csv", f"s3://{bucket}/{prefix}")
train_input = TrainingInput(train_uri, content_type="csv")
test_uri = S3Uploader.upload("test_features.csv", f"s3://{bucket}/{prefix}")
Training an XGBoost Model
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
# Retrieve the XGBoost image
xgboost_image_uri = retrieve("xgboost", region, version="1.5-1")
# Configure the XGBoost model
xgb = Estimator(
xgboost_image_uri,
role,
instance_count=1,
instance_type="ml.m5.xlarge",
disable_profiler=True,
sagemaker_session=sagemaker_session,
)
# Set hyperparameters for XGBoost
xgb.set_hyperparameters(
max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective="binary:logistic",
num_round=800,
)
# Start the training job
xgb.fit({"train": train_input}, logs=False)
Create a SageMaker Model
model_name =
"DEMO-clarify-model-{}".format(datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))
# Create a SageMaker model
model = xgb.create_model(name=model_name)
container_def = model.prepare_container_def()
sagemaker_session.create_model(model_name, role, container_def)
Amazon SageMaker Clarify
Detecting Bias
Understanding Bias in Machine Learning
SageMaker Clarify for Bias Detection
Initializing Clarify
from sagemaker import clarify
clarify_processor = clarify.SageMakerClarifyProcessor(
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
sagemaker_session=sagemaker_session
)
DataConfig: Setting Up Data for Bias Analysis
bias_report_output_path = f"s3://{bucket}/{prefix}/clarify-bias"
bias_data_config = clarify.DataConfig(
s3_data_input_path=train_uri,
s3_output_path=bias_report_output_path,
label="loan_approved",
headers=training_data.columns.to_list(),
dataset_type="text/csv",
)
ModelConfig and ModelPredictedLabelConfig: Configuring the Model
model_config = clarify.ModelConfig(
model_name=model_name,
instance_type="ml.m5.xlarge",
instance_count=1,
accept_type="text/csv",
content_type="text/csv",
)
ModelPredictedLabelConfig sets up how SageMaker Clarify interprets the model’s predictions:
predictions_config =
clarify.ModelPredictedLabelConfig(probability_threshold=0.8)
BiasConfig: Specifying Bias Parameters
bias_config = clarify.BiasConfig(
label_values_or_threshold=[1],
facet_name="gender",
facet_values_or_threshold=[0],
group_name="age"
)
Pre-training vs Post-training Bias
Running Bias Report Processing
clarify_processor.run_bias(
data_config=bias_data_config,
bias_config=bias_config,
model_config=model_config,
model_predicted_label_config=predictions_config,
pre_training_methods="all",
post_training_methods="all",
)
Viewing the Bias Report
Accessing the Report
print(bias_report_output_path)
Report Overview
Explaining Predictions with Kernel SHAP
Explainability Report Configuration
explainability_output_path = f"s3://{bucket}/{prefix}/clarify-explainability"
explainability_data_config = clarify.DataConfig(
s3_data_input_path=train_uri,
s3_output_path=explainability_output_path,
label="loan_approved",
headers=training_data.columns.to_list(),
dataset_type="text/csv",
)
baseline = [training_data.mean().iloc[1:].values.tolist()]
shap_config = clarify.SHAPConfig(
baseline=baseline,
num_samples=15,
agg_method="mean_abs",
save_local_shap_values=True,
)
Running Explainability Report Processing
clarify_processor.run_explainability(
data_config=explainability_data_config,
model_config=model_config,
explainability_config=shap_config,
)
Viewing the Explainability Report
Wrapping Up
Embracing Fairness and Explainability in Machine Learning
Key Takeaways
Moving Forward
Final Thoughts
Resources:
Amazon AI Fairness and Explainability with Amazon SageMaker Clarify
AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!
Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!
View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE coursesOur Community
~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.