Last updated on July 8, 2023
Before we begin, let’s quickly talk about what Amazon SageMaker is and what it is used for. If this is your first time learning about Amazon SageMaker, it is the machine learning platform of AWS that helps solve the different requirements of data scientists, developers, and machine learning practitioners. It has several features and capabilities that assist in the different stages of the machine learning process.
In this tutorial, we will focus on SageMaker Ground Truth and how it helps data science teams get access to clean labeled datasets. When performing machine learning experiments, getting access to labeled datasets is not as common and straightforward as it seems. In most cases, raw available data is “dirty” and it requires a few manual steps for this data to be considered ready for use for an ML experiment. When dealing with larger datasets and files that require significant labeling work, it may be more practical to find a scalable workflow and solution to get the work done by a dedicated workforce instead of doing this labeling work by yourself. That said, SageMaker just has that capability that solves this specific need — Ground Truth.
There are multiple options available for SageMaker Ground Truth and we can have a public workforce and a private workforce. It is also possible to work with a data labeling company through a Vendor workforce. When dealing with sensitive data which can’t be shared with other entities or companies, one of the recommended options would be to work with a private workforce. This involves the machine learning practitioner assigning the labeling work to individuals from your company or from a group of trusted data labelers. The great thing here is that creating a private workforce and assigning labeling tasks are straightforward when using SageMaker Ground Truth. We’ll divide the steps into 3 parts:
- Creating and preparing the private workforce
- Creating and preparing the labeling job
- Using the Worker Portal to perform the labeling job
Let’s begin!
PART I. CREATING AND PREPARING THE PRIVATE WORKFORCE
1. Go to the SageMaker console
2. Using the sidebar, navigate to Labeling Workforces section (under Ground Truth)
3. Navigate to the Private workforce tab
4. Invite Workers by clicking the “Invite new workers” button
5. Specify the email addresses of the workers you want to invite inside the text area then click the “Invite new workers” button.
6. Verification emails will be sent to the email addresses specified.
7. Create a new private team by clicking the “Create private team” button in the Private Labeling Workforce tab.
8. Specify a team name and leave the defaults as is before clicking the “Create private button”
9. Once the private team has been created, navigate to the specific private team details page by clicking the name in the Private teams pane.
10. Navigate to the “Workers” tab and click “Add workers to team”. Select the workers you want to add to the private team then click the “Add workers to team” button.
After this step, we can now proceed with creating and preparing the labeling job!
PART II. CREATING AND PREPARING THE LABELING JOB
1. Navigate to the Amazon S3 console
2. Create a new S3 bucket (e.g., sagemaker-cookbook-ground-truth)
3. Upload 3 text files with the following filenames and values inside the S3 bucket created
· 1.txt – 42
· 2.txt – 19
· 3.txt – 21
4. Create another S3 bucket where the output files are going to be stored
5. Navigate to the Amazon SageMaker console
6. Using the sidebar, navigate to the Labeling Jobs section under Ground Truth
7. Click the “Create labeling job” button
8. Specify the labeling job details
9. Under Data Setup – S3 location for input datasets, select an S3 bucket using the Browse S3 button. Use one of the S3 buckets created in this recipe.
10. Under Data Setup – S3 location for output datasets, select an S3 bucket by selecting the “Specify a new location” option then click the “Browse S3 button”. Use one of the S3 buckets created in this recipe.
11. Set the Data Type to Text
12. Specify the IAM Role (create a new one or use an existing one)
In this example, we’ve selected the “Any S3 bucket” option but feel free to select the “Specific S3 buckets” option for a more secure setup.
13. Click the “Complete data setup” button
14. Under “Task type”, select the desired task selection option. In this example, choose “Text Classification (single label)” under Task selection.
15. Click the “Next” button
16. Specify the labeling job configuration under the “Select workers and configure tool” pane
- Worker types – Private
- Private teams – [Select private team]
- Task timeout – 1 hour
- Task expiration time – 10 days
17. Specify the text classification labeling job details as seen below:
18. Click the “Preview” button to see a quick preview on what the workers will see when they’ve received the job instructions.
After a few minutes, the new labeling job should be visible in the worker’s portal. In the last part of this tutorial, we will assume the role of the worker from the Private workforce and perform the actual labeling job.
PART III. USING THE WORKER PORTAL TO PERFORM THE LABELING JOB
1. Worker Portal: Using the link provided in the verification email, access the worker’s portal then use the credentials to sign in (and change the password)
2. Worker Portal: Select the job then click “Start working”
3. Worker Portal: You’ll see a screen similar to the Preview page
4. Once completed, the results should now reflect back in the account which created the labeling job
That’s pretty much it! We were able to perform the steps in the workflow from start to finish using SageMaker Ground Truth. Involving more users to participate and contribute to the labelling tasks will not be a problem as Ground Truth is able to help us manage the work and the results with the appropriate workflow and interfaces. There are other options available and we can also perform labelling tasks with images and other types of data as needed. There’s definitely more options and features available not discussed here so feel free to take a look at the official documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html
What’s next?
If you want to dig deeper into what Amazon SageMaker can do, feel free to check the 762-page book I’ve written here: https://amzn.to/3CCMf0S. Working on the hands-on solutions in this book will make you an advanced ML practitioner using SageMaker in no time.
You should find all the other features and capabilities of SageMaker such as SageMaker Clarify, SageMaker Model Monitor, and SageMaker Debugger here as well.
That’s all for now and stay tuned for more!