2025-11-08: Using Amazon SageMaker Ground Truth for Crowdsourcing

Image from https://aws.amazon.com/sagemaker-ai/groundtruth/

Imagine you are training a machine learning (ML) model to identify dog breeds from images. You have thousands of photos of Golden Retrievers, Poodles, and Dachshunds, but the model cannot distinguish a Chihuahua from a Bulldog without labeled data. This is where data labeling becomes essential. Human annotators tag each image so the ML model can learn to recognize patterns. However, manually labeling thousands of images is time-consuming and resource-intensive. A more efficient approach is to distribute the work among many people online through crowdsourcing, which is the process of outsourcing tasks to a large group of people, often over the internet, instead of relying on a single employee or a small team. In machine learning, crowdsourcing is commonly used for data labeling, the process of annotating raw data such as images, text, or videos to make it suitable for training ML models.

Amazon SageMaker Ground Truth is a managed service that simplifies data labeling by integrating crowdsourcing directly into the ML workflow. Instead of manually managing workers, tracking progress, and consolidating labels, Ground Truth automates much of the process. Key features of Amazon SageMaker Ground Truth include:

Built-in workforce options – Use your own team (private workforce), MTurk workers, or third-party vendors.
Pre-built labeling templates – Supports common tasks like image classification, object detection, text sentiment analysis, and more.
Automated quality control – Uses techniques such as majority voting to improve label accuracy.

Setting Up for Crowdsourcing with SageMaker Ground Truth

Before launching into crowdsourcing with SageMaker Ground Truth, you need to complete a few prerequisites. Two key components are deciding where your data will reside (Amazon S3) and defining access permissions (IAM roles).

S3 Buckets for Storage

Amazon S3 (Simple Storage Service) is basically where you keep your files within an AWS working environment. In our scenario this includes images, text data, video files, and any other file format your ML model needs to learn from. SageMaker Ground Truth pulls from S3 and also pushes labeled output back into it.

To make this process easier to follow, I’ll be using examples from a project we recently worked on, where we labeled citation contexts into one of three reproducibility-oriented sentiment categories using SageMaker Ground Truth. For our project, we followed the steps below in S3.

Go to the S3 service in the AWS Console.
Click “Create Bucket” and create a new S3 bucket.
After the bucket is created, upload the input files (to prepare the input files, follow the instructions in this tutorial series) into the “input/” folder.

For our project, we had all the citation contexts organized with one context per row in a CSV file, without any headers.

Folder structure for the bucket created in S3 for our labeling task

Creating Custom IAM Roles

AWS uses IAM (Identity and Access Management) roles to decide who gets access to what. For Ground Truth, you’ll need a role that gives it permission to access your S3 buckets and, optionally, invoke Lambda functions if you use them later. We created a simple IAM role that had AmazonS3FullAccess, AmazonSageMakerFullAccess, AmazonSageMakerGroundTruthExecution, and AWSLambda_FullAccess.

Steps for creating the IAM Role:

Go to the IAM service in the AWS Console.
Under “Roles”, click “Create Role”.
Select “AWS Service” as the trusted entity and choose “SageMaker” as the service.
Attach policies (choose based on your use case).
Use a uniquely identifiable name for the role.

Created role and attached policies for our project

Selecting the Right Workforce

Once your data is ready and permissions are set up, the next question is: who’s actually going to label your data? SageMaker Ground Truth provides three options: use your own team (private workforce), outsource to a professional vendor (vendor workforce), or crowdsource through MTurk. The choice depends on your budget, task complexity, and how much control you want.

A “private workforce” is basically your internal team. You create an access portal where only approved users can log in and complete labeling tasks. This is useful when you need confidentiality or when domain expertise is required such as medical images or legal text.

The “vendor workforce” is for when you want a managed, professional service to handle labeling. These vendors are pre-vetted by AWS and often used for complex or large-scale projects.

Amazon Mechanical Turk (MTurk) connects researchers and organizations to a global pool of crowd workers who can complete small tasks, such as labeling images or text, for a small fee per task. It is an effective platform because it allows easy scalability, enabling hundreds or even thousands of labelers to participate simultaneously. MTurk is particularly suitable for simple tasks and is often more cost-effective than hiring a dedicated labeling team. In my project I chose the MTurk option for workforce when creating the labeling job.

Selecting MTurk as the workforce

Designing Labeling Tasks

After selecting the workforce, the next step is deciding what kind of labeling task you will be doing. Currently, SageMaker Ground Truth offers 14 built-in templates for common tasks, categorized into four input types: images, video, text, and 3D point clouds, and also allows you to create custom workflows for more specific needs.

Since our project was a single-choice text classification problem (choosing one label from three), for our use case, we used the built-in single label text classification template and customized it using the crowd elements.

Custom labeling UI created using the crowd elements

Creating a Labeling Job

Once the input data, roles, and configurations are ready as described in the previous sections, you can create a labeling job directly in the SageMaker console by following a few straightforward steps.

Sign in to the SageMaker console and open the Ground Truth Labeling jobs section.
Choose “Create labeling job” and provide a job name.
Set up input and output locations in Amazon S3 and assign an IAM role with the required permissions.
Select the task type and category that matches your data (text, image, video, or point cloud).
Choose your workforce (MTurk, private, or vendor) and, if needed, adjust worker settings such as timeouts and the number of workers per task.
Configure the worker interface by providing instructions and label categories, or upload a custom template using crowd elements if required.
(Optional) Add pre-annotation or post-annotation Lambda functions to process data automatically.
Review the setup and click “Create” to start the labeling job.

Once created, SageMaker Ground Truth distributes your data to selected workers, collects annotations, and stores the labeled output in your S3 bucket. From there, you can monitor progress and review results through the “Labeling jobs” console.

SageMaker Ground Truth “Labeling Jobs” Console

Additional Configurations and Enhancements

Once the basic labeling job is set up, SageMaker Ground Truth provides extra tools to improve accuracy and automation. One key feature is AWS Lambda function for annotation consolidation for better label quality. Lambda functions allow you to customize how multiple worker responses are combined, apply rules to filter or transform outputs, and calculate confidence scores. For instance, if three workers label the same item and two agree, that label becomes the final result. In our project, we tested using three workers per citation context. This increased the cost but improved label reliability. Ground Truth also saves all individual responses and the final consolidated label in the output file.

Managing Human Responses

After workers submit their labels, the results are saved in your Amazon S3 “/output” folder. Each labeling job generates output manifest files in JSON format, containing individual worker responses and the consolidated output for each labeling item (in our case, each citation context). The example below shows the JSON-formatted outputs that include both individual workers’ responses (A) and the consolidated label (B) for that particular object (citation context).

Labeling job output — A: Individual labels for the same object (responses from multiple workers for the same citation context) B: Consolidated label from multiple annotations for a single object (citation context).

Payments and Task Management

When using Amazon Mechanical Turk through SageMaker Ground Truth, AWS handles task distribution and payment processing. You set the price per labeling task in the job configuration, while AWS adds a small service fee and bills you through your account. In our project, each text snippet was a simple classification task, so we kept the price low (~0.03$ per label) and monitored progress in the SageMaker console. Ground Truth provides tools to track completed tasks, worker responses, and errors, though it does not allow direct communication with MTurk workers or access to detailed metrics.

Limitations of SageMaker Ground Truth

SageMaker Ground Truth offers convenience but less flexibility than native MTurk. It does not support custom worker qualifications, bonus handling, or manual worker management. While AWS recommends using SageMaker Ground Truth for crowdsourcing, projects that require direct communication with workers or reusing previous workers may benefit from using MTurk directly.

Conclusion

Crowdsourcing through SageMaker Ground Truth streamlines data labeling and reduces manual work. For our three-category text classification project, we used MTurk as the workforce, configured tasks with minimal setup, and obtained accurate results efficiently. Ground Truth is ideal for researchers and developers who want to focus on model development while easily managing labeling at scale.

References

– Rochana R. Obadage

Search This Blog

Web Science and Digital Libraries Research Group