Deploying a Trained CTGAN Model on an EC2 Instance: A Step-by-Step Guide

Last updated on November 30, 2023

Welcome to the first entry in our series on deploying machine learning models in AWS. As cloud computing and machine learning continue to evolve and intersect, understanding the dynamics of deployment becomes invaluable. Whether you’re an enthusiast, a budding data scientist, or a seasoned professional, the insights offered by this series are tailored to empower you to make the most of AWS’s vast ecosystem.

One recurrent pitfall in the journey of many machine learning beginners is the confinement of their models within the boundaries of a Jupyter notebook. Picture this: after hours or even days of data wrangling, feature engineering, model training, and validation, you have a model that boasts of impressive metrics. But what next? All too often, these models do not transition from the experimental phase to a real-world application. They remain trapped within notebooks, unused and unappreciated, even when they have immense potential.

Enter CTGAN – Conditional Generative Adversarial Networks designed for tabular data synthesis. In this guide, we’ll shine a spotlight on CTGAN and particularly how to deploy a trained CTGAN model on an EC2 instance. But we won’t stop there. Our deployment will go a step further by setting up an API that, once triggered, will allow our model to generate data and seamlessly upload it to an S3 bucket. Imagine the potential of an on-demand data generator that populates your storage with synthetic yet realistic datasets!

Before we embark on this journey, a small note for our readers: this guide assumes that you already have an active AWS account and possess a foundational understanding of AWS’s basic concepts and services. If you’re new to AWS, it might be beneficial to familiarize yourself with its core functionalities. Now, with that out of the way, let’s dive into the world of CTGAN deployment on AWS.

Prerequisites

To ensure a smooth and hassle-free deployment, it’s essential to have everything in place. Below are the prerequisites that you should have ready before we dive into the deployment:

AWS Account: If you don’t have one already, sign up for an AWS account. As mentioned in the introduction, a basic understanding of AWS is assumed for this guide.
Python Environment: Ensure you have Python set up on your machine. For this deployment, we’ll be using the following Python libraries:

- - sdv: For synthesizing data.
  - Flask: To build our API.
  - pandas: For data manipulation.
  - boto3: AWS SDK for Python, to interact with AWS services.

3. Pre-trained CTGAN Model: This is the backbone of our project. Have your CTGAN model trained and ready. If you don’t have one, there are several resources online where you can learn to train a CTGAN model or even find pre-trained models.

4. Docker: As we’ll be embracing containerization for this deployment, Docker needs to be installed on your machine. Containers allow us to package our application with all its dependencies into a single unit, ensuring consistent behavior across different environments.

5. EC2 Instance Configuration:

- - Instance Type: We’re using a t2.medium instance. This type offers a balance of compute, memory, and network resources, making it suitable for our deployment.
  - Amazon Machine Image (AMI): The instance will run on the Amazon Linux 2 AMI, which is a general-purpose and widely used AMI.
  - Security Group: Our security group, named “CTGANSynthesizerSG”, has been configured to allow SSH connections via port 22 and Flask API connections via port 5000.
  - Storage: The EC2 instance has a block storage (EBS) of size 30GB with type gp2, which is a general-purpose SSD volume type.
  - Key Pair: For this project, I created a key pair that allows secure SSH access to our EC2 instance, ensuring that our deployment is both safe and easily accessible.
  - IAM Role: To make things simple, I granted the EC2 instance an IAM role named “EC2S3FullAccess”, which provides comprehensive permissions to interact with S3, ensuring our application can seamlessly upload generated data to the AWS storage service.

Ensure that you can SSH into this EC2 instance from your local machine. If you’re not familiar with how to do this, AWS provides comprehensive documentation on connecting to your EC2 instance.

Architecture Overview

The foundation of any robust solution lies in its architecture. A clear and scalable architecture ensures smooth operation and eases future upgrades or modifications. As we dive deep into deploying our CTGAN model on an Amazon EC2 instance, it’s vital to understand the architectural flow. The following is a visual blueprint of the solution:

Description of the Architecture:

Users: The starting point of our flow. Users will send their requests to our deployed model on the EC2 instance.
EC2 Instance Container:

- - The heart of our deployment, the EC2 instance, hosts the Docker container, which in turn runs our CTGAN model.
  - It’s shielded by specific security rules to ensure only necessary traffic (like SSH or specific API requests) gets through.
  - Docker offers an isolated environment ensuring that the model runs in a consistent setup, unaffected by external factors.

3. Trained CTGAN Model:

- - CTGAN (Conditional Generative Adversarial Networks) specializes in generating synthetic tabular data.
  - In our case, it’s been pre-trained and is ready to generate data upon receiving requests from users.

4. Amazon S3:

- - We have two interactions with S3. First, our trained model might have been stored and retrieved from S3.
  - Second, once the model generates synthetic data, this data is then uploaded to an S3 bucket, making data storage and retrieval seamless.
  - S3 offers a durable and scalable storage solution, ensuring our generated data is safely stored and readily accessible.

The beauty of this architecture lies in its modularity. Each component operates independently but in harmony with the others. This setup not only provides flexibility but also eases troubleshooting, should any issues arise.

API Code Explanation

From our architectural overview, we now delve into the code that makes our synthetic data generation API function. This API is structured to manage the entire data synthesis lifecycle, from model initialization to data generation and quality evaluation.

Remember, while this code may appear complex at first glance, it’s structured for clarity and modularity. This ensures ease of understanding and makes future modifications straightforward.

In this section, we’re setting up the necessary imports, initializing the Flask app, and configuring logging and warning filters. We also established a connection with Amazon S3, which is used to store and retrieve models and data.

Setup

Before diving deep into the functionality, we prepare the environment. We begin by importing the necessary libraries. We initialize Flask to build our API and also set up logging to keep track of important information. The connection with Amazon S3 is set up using boto3 to manage our models and data.

from flask import Flask, jsonify
import pandas as pd
import random
import boto3
from datetime import datetime
import logging
import warnings

logging.basicConfig(level=logging.INFO)
warnings.filterwarnings("ignore")
s3 = boto3.client('s3')
app = Flask(__name__)

Model Initialization

After setting up the API, we want to ensure our machine learning model (in this case, CTGAN) is ready for use. This section deals with fetching the latest model from Amazon S3 and initializing it. If no model is found, a new synthesizer is trained.

Upon starting our application, we attempt to fetch the most recent model from our S3 bucket. If we can’t find a model, we default to training a new synthesizer, ensuring that our API always has a model to work with.

objects = s3.list_objects_v2(Bucket='mlops-python', Prefix='models/ctgan/')

try:
    latest_model = objects['Contents'][-1]['Key']
except KeyError:
    logging.info("No models found in S3. Training a new synthesizer...")
    train_synthesizer()
    latest_model = objects['Contents'][-1]['Key']

s3.download_file('mlops-python', latest_model, 'tmp/telco_customer_churn_synthesizer.pkl')
ctgan = CTGANSynthesizer.load('tmp/telco_customer_churn_synthesizer.pkl')

Data Generation and Validation

This section is dedicated to the core functionality of the API – generating synthetic data using CTGAN. Additionally, there’s code for validating the generated synthetic data to ensure its quality.

Using our trained model, we can generate synthetic data samples. We’ve also implemented a validation function to assess the quality and integrity of the generated synthetic data, ensuring that it matches our expectations.

def generate_synthetic_data():
    global latest_synthetic_data
    samples = random.randint(100, 1000)
    latest_synthetic_data = ctgan.sample(samples)

def validate_synthetic_data(data):
    # ... (data validation logic here) ...

@app.route('/generate_data', methods=['GET'])
def generate_data_endpoint():
    generate_synthetic_data()
    # ... (response logic here) ...

Data Quality Evaluation and Cloud Metrics

This section evaluates the quality of the generated synthetic data and sends the quality score to Amazon CloudWatch. If the quality is below a threshold, the synthesizer is retrained.

It’s crucial to ensure that our synthetic data is of high quality. After validating the data, we evaluate its quality and compare it to the real data. This quality score is then sent to Amazon CloudWatch. If the quality isn’t up to par, we take corrective measures by retraining the synthesizer.

def send_metric_to_cloudwatch(metric_name, metric_value):
    # ... (cloudwatch logic here) ...


@app.route('/evaluate_quality', methods=['GET'])
def evaluate_quality_endpoint():
    generate_data_endpoint()
    # ... (validation, evaluation, and response logic here) ...

Docker Local Testing and Deployment on Amazon EC2

In the realm of software deployment, it’s crucial to ensure that your applications are running flawlessly in every environment, especially when you’re dealing with complex architectures and data. Docker, a powerful tool for creating, deploying, and running applications inside containers, makes this process seamless. Alongside Amazon Web Services (AWS), it provides a robust infrastructure to deploy your applications at scale. Let’s walk through how we can locally test our synthetic data generation API using Docker and then deploy it on an Amazon EC2 instance.

Docker Local Testing

Before deploying our application to a production or cloud environment, we should test it locally to catch any potential issues. This ensures that our application behaves as expected and aids in debugging if necessary.

Build the Docker Image:

- - This command creates a Docker image of our application. An image is a lightweight standalone executable package containing everything required to run a piece of software.

docker build -t data_generation .

2. Run the Docker Image:

- - Once our image is built, we can run it. The -p flag maps the port on your machine to the port on which your application runs in the Docker container.

docker run -p 5001:5000 -v ~/.aws:/root/.aws data_generation

3. Test the Data Generation API:

- - With our Docker container up and running, we can test the API endpoint responsible for data generation.

curl http://localhost:5001/generate_data

4. Test the Data Quality Evaluation Endpoint:

- - Similarly, we test the endpoint that evaluates the quality of our generated data.

curl http://localhost:5001/evaluate_quality

Deploying to an Amazon EC2 Instance

Having tested locally, we’re now confident about deploying our application on an Amazon EC2 instance. This is a scalable computing capacity in the cloud which allows for easy deployment of applications.

Note: This section assumes you already have an EC2 instance set up and can SSH into it.

Prepare Your Deployment Files:

- - To transfer files to EC2, it’s practical to compress them into a single .zip file.

zip -r app.zip app.py requirements.txt Dockerfile

2. Transfer the Zip to the EC2 Instance:

- - Use the scp command to copy the zip file over to your EC2 instance.

scp -i &lt;path_to_pem_file&gt; app.zip &lt;ec2_user&gt;@&lt;ec2_public_dns&gt;:~/app.zip

3. SSH into the EC2 Instance:

- - Access your EC2 instance.

ssh -i ssh -i &lt;path_to_pem_file&gt; &lt;ec2_user&gt;@&lt;ec2_public_dns&gt;

4. Update and Install Docker on EC2:

- - For a fresh instance, ensure you update and then install Docker.

sudo yum update -y
sudo yum install docker -y

5. Unzip the Transferred Files:

- - Extract the files from the zip we transferred.

unzip app.zip

6. Start Docker and Build the Image:

- - Kickstart Docker and then proceed to build the Docker image.

sudo service docker start
docker build -t data_generation .

7. Run the Docker Image with AWS Logging:

- - Now run the Docker image, this time configuring it with Amazon CloudWatch logging to keep track of logs.

docker run --name data_generation --log-driver=awslogs --log-opt awslogs-group=data-synthesizer-logs --log-opt awslogs-region=us-west-2 -p 5000:5000 data_generation:latest

8. Test the Data Generation API on EC2:

- - Similar to our local tests, let’s test our endpoints, but this time on the cloud.

curl http://&lt;ec2_public_dns&gt;:5000/generate_data

9. Test the Data Quality Evaluation Endpoint on EC2:

- - Lastly, test the data quality evaluation endpoint.

curl http://&lt;ec2_public_dns&gt;:5000/evaluate_quality

With this, you’ve successfully tested your application locally using Docker and then deployed it onto an Amazon EC2 instance. These steps ensure a robust testing and deployment cycle, guaranteeing the reliability and scalability of your application in real-world scenarios.

Data Quality Evaluation and Monitoring

Most model deployment blog posts stop here, focusing solely on the process of setting up and deploying the model. But there’s a critical aspect often overlooked: the continuous evaluation and monitoring of data quality. In the domain of machine learning operations (MLOps) and data-centric applications, understanding and maintaining the integrity of your data post-deployment is pivotal.

In the realm of synthetic data generation, for instance, the quality of generated data can influence the effectiveness of machine learning models. Poor quality data can lead to misleading model outputs, which can compromise the overall reliability of systems relying on that data.

The Importance of Monitoring

Assurance of Data Integrity: As data is the backbone of any data-driven system, its quality directly impacts the system’s performance. Continuous monitoring guarantees that the system is functioning on trustworthy data.
Early Detection of Anomalies: Regular monitoring helps in detecting and rectifying anomalies or outliers in the data, which if left untreated, can skew the results and predictions of ML models.
Adaptability: In dynamic environments, data streams can change over time. Monitoring ensures that your system adapts to these changes and remains accurate.
Stakeholder Trust: Consistent monitoring and reporting increase the trust of stakeholders, as they can be sure of the system’s reliability and accuracy.
Compliance and Regulations: Especially in industries like finance and healthcare, data quality and its monitoring are not just best practices but mandatory for regulatory compliance.

Pushing Quality Metrics to CloudWatch

Amazon CloudWatch provides real-time monitoring services for AWS resources and applications. By pushing quality metrics to CloudWatch, you get a holistic view of how your synthetic data generation process is performing over time.

Here’s a function that helps push these metrics to CloudWatch:

def send_metric_to_cloudwatch(metric_name, metric_value):
    cloudwatch = boto3.client('cloudwatch', region_name='us-west-2')
    cloudwatch.put_metric_data(
        Namespace='MLOps/QualityMetrics',
        MetricData=[
            {
                'MetricName': 'QualityScore',
                'Dimensions': [
                    {
                        'Name': 'ModelName',
                        'Value': 'CTGANSynthesizer'
                    },
                ],
                'Value': metric_value,
                'Unit': 'None'
            },
        ]
    )

Practical Use of Monitoring

Imagine you are working on a finance application that uses machine learning models to predict stock prices. These predictions are based on various parameters, one of which is synthetic data generation for modeling possible future scenarios. As markets are dynamic, the data used for these predictions must be of the highest quality.

One day, due to some issues, the quality of the synthetic data starts deteriorating. Without a proper monitoring system in place, this drop in quality can go unnoticed. This can result in inaccurate stock price predictions, leading to significant financial losses and a decrease in user trust.

However, with the CloudWatch metrics in place, as soon as there’s a dip in the quality of synthetic data, an alert is triggered. The team can immediately look into the cause and rectify it before it impacts the stock price prediction model.

In essence, the continuous evaluation and monitoring of data quality are not just about maintaining system performance; it’s about risk mitigation and ensuring that the insights drawn from the data are accurate and reliable. This proactive approach, backed by real-time monitoring tools like CloudWatch, ensures that any potential issues are addressed promptly, safeguarding the integrity of the entire system.

Final Remarks

Deploying a machine learning model into a production environment is a multifaceted endeavor that goes beyond just the coding and training phase. Through the course of this article, we’ve traversed the complexities of architecture design, API integration, containerization with Docker, deployment to cloud platforms like Amazon EC2, and the indispensable act of monitoring data quality in real-time.

Our journey began with a comprehensive overview of the architecture that provides a high-level understanding of how different components interconnect. The API code was then dissected to understand its functionalities and how it plays a pivotal role in our deployment. Subsequently, the process of testing locally with Docker and deploying to an EC2 instance was elucidated, offering a hands-on approach to bring our ML models to life in the real world.

Yet, as highlighted, the process doesn’t conclude with deployment. The subsequent monitoring of data quality, especially in applications involving synthetic data, becomes the lynchpin for maintaining the reliability and performance of our system.

For any data scientist, developer, or machine learning enthusiast, the insights from this article serve as a comprehensive guide for model deployment, emphasizing not just the technical aspects but also the significance of continuous evaluation post-deployment. It is a testament to the idea that in the world of MLOps, the journey doesn’t end once the model is deployed; it merely begins. As the landscape of technology and data continues to evolve, so must our methodologies and practices to ensure that we’re delivering the best possible results with utmost reliability.

Whether you’re a novice taking your first steps in model deployment or a seasoned practitioner, it’s hoped that this article sheds light on the intricacies involved and inspires you to adopt a holistic approach to MLOps, from inception to continuous monitoring. Remember, in the dynamic universe of machine learning, staying informed, adaptable, and vigilant is the key to success.

Resources and References:

https://docs.aws.amazon.com/ec2/

https://docs.docker.com/

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch.html

https://sdv.dev/SDV/user_guides/single_table/ctgan.html

Written by: John Patrick Laurel

Pats is the Head of Data Science at a European short-stay real estate business group. He boasts a diverse skill set in the realm of data and AI, encompassing Machine Learning Engineering, Data Engineering, and Analytics. Additionally, he serves as a Data Science Mentor at Eskwelabs. Outside of work, he enjoys taking long walks and reading.

Deploying a Trained CTGAN Model on an EC2 Instance: A Step-by-Step Guide

Deploying a Trained CTGAN Model on an EC2 Instance: A Step-by-Step Guide

Prerequisites

Architecture Overview

Description of the Architecture:

API Code Explanation

Setup

Model Initialization

Data Generation and Validation

Data Quality Evaluation and Cloud Metrics

Docker Local Testing and Deployment on Amazon EC2

Docker Local Testing

Deploying to an Amazon EC2 Instance

Data Quality Evaluation and Monitoring

The Importance of Monitoring

Pushing Quality Metrics to CloudWatch

Practical Use of Monitoring

Final Remarks

Resources and References:

Get $3 OFF ALL CCP, SAA, CDA, and SysOps Video Courses!

Be Inspired and Mentored with Cloud Career Journeys!

Enroll Now – Our Azure Certification Exam Reviewers

Enroll Now – Our Google Cloud Certification Exam Reviewers

Tutorials Dojo Exam Study Guide eBooks

FREE AWS Exam Readiness Digital Courses

Subscribe to our YouTube Channel

FREE Intro to Cloud Computing for Beginners

FREE AWS, Azure, GCP Practice Test Samplers

Recent Posts

Written by: John Patrick Laurel

Our Community

What our students say about us?

Deploying a Trained CTGAN Model on an EC2 Instance: A Step-by-Step Guide

Deploying a Trained CTGAN Model on an EC2 Instance: A Step-by-Step Guide

Prerequisites

Architecture Overview

Description of the Architecture:

API Code Explanation

Setup

Model Initialization

Data Generation and Validation

Data Quality Evaluation and Cloud Metrics

Docker Local Testing and Deployment on Amazon EC2

Docker Local Testing

Deploying to an Amazon EC2 Instance

Data Quality Evaluation and Monitoring

The Importance of Monitoring

Pushing Quality Metrics to CloudWatch

Practical Use of Monitoring

Final Remarks

Resources and References:

Get $3 OFF ALL CCP, SAA, CDA, and SysOps Video Courses!

Be Inspired and Mentored with Cloud Career Journeys!

Enroll Now – Our Azure Certification Exam Reviewers

Enroll Now – Our Google Cloud Certification Exam Reviewers

Tutorials Dojo Exam Study Guide eBooks

FREE AWS Exam Readiness Digital Courses

Subscribe to our YouTube Channel

FREE Intro to Cloud Computing for Beginners

FREE AWS, Azure, GCP Practice Test Samplers

Recent Posts

Written by: John Patrick Laurel

Our Community

What our students say about us?

Did you find our content helpful?