SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

Last updated on March 24, 2025

It takes too long. It’s boring. I want the good stuff already…

Sounds familiar? These statements are not entirely wrong, but data preparation in any kind of data analytics job is still important and must be executed carefully. This is an important step that can easily take up to 80% of the total working time of a project. For insightful results, the data must be prepared properly, and this article will be about the data one will be dealing with when working in bioinformatics. Fortunately, this process can be relieved with the use of the cloud, which in this case is Amazon Web Services (AWS).

For context, in genomics there is a process called DNA sequencing. The order of the nucleotides in a DNA strand is determined during the process. This process can help researchers understand important genomic elements, from functional gene sequences to critical regulatory elements. There are two (2) main sequencing properties—the Sanger sequencing and Next Generation Sequencing (NGS). The former is slower and has limited ability in identifying gene variants and mutations. It was used to sequence the first human genome over a period of 13 years. In comparison, NGS is cheaper, faster, and more accurate. It can sequence the human genome in only a few days. Illumina sequencing by synthesis is one of the common approaches to NGS and can be something you will encounter when looking for data to work with. This is important to note, as this will be the kind of data that will be prepared in this article.

Understanding the SRA Toolkit

Without getting into too many technicalities on the biological side, let’s understand what you’re going to work with: the SRA toolkit combined with the power of AWS. An SRA (Sequence Read Archive) is an archive of high-throughput sequencing data. It stores raw sequencing data and alignment information that can enhance reproducibility and help discoveries through data analysis.

SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) clouds. That makes it easier to work with it later in this article. Now, the SRA Toolkit is a set of tools that bioinformaticians use to work with NGS data. One can download, convert, and analyze sequencing data.

Core Components and Tools of the SRA Toolkit

The following are the core components you should familiarize yourself with when working with the SRA Toolkit. Take note that the components are not limited to the list below.

Component	Description
fastq-dump	Converts data from the SRA format (stored in compressed .sra files) into FASTQ format used in bioinformatics analysis.
prefetch	Used to download raw SRA data files from the SRA repository.
vdb-config -i	This is where you will configure and manage the settings for working with SRA data, including the AWS settings.

AWS and its role in SRA data

With all of the necessary information you need about the SRA, what exactly is the role of AWS? With the power of AWS, you can execute the following:

Accessing of original submitted files

Faster downloading speed

Unlimited concurrent downloads from our cloud buckets to your buckets

Did you catch that? Faster download speed… Downloading SRA data directly from the NCBI database can take time, especially with larger datasets.

Fortunately, the NIH NCBI Sequence Read Archive (SRA) on AWS is a product of AWS which is a part of the AWS Open Data Sponsorship Program. This program is an initiative of AWS that supports the availability and accessibility of public datasets. AWS sponsors the storage and hosting of various datasets which are open for anyone to use, mainly for research. In other words, AWS also hosts SRA data that you can use.

Integrating SRA Toolkit with AWS

Before starting, you should first look for your data of interest from the NCBI database.

Preparing Data with SRA Toolkit on AWS

The following are the steps to take to utilize AWS with SRA Toolkit. Do note that you should have an account to access the AWS console.

Setting up the SRA Toolkit on AWS EC2 instances

1. Provisioning an EC2 Instance

Go to the AWS EC2 service and select the ‘Launch instance’ option.

Select the Amazon Machine Image (AMI) of your choice. In this article, we’ll use Amazon Linux since it is covered by Amazon’s Free Tier.

Scroll down and choose an instance type. Again, this article will work with the ones under the Free Tier.

Select or create a new key pair. This is for a secured connection with the instance. Make sure to save it to a place where you can access it later.

Leave everything by default and launch the instance. Make sure that the region is set to us-east-1 or N. Virginia for a free and efficient retrieval of SRA data.

Connect to the instance through the SSH client and follow its instructions.

To know if you are connected, it should look like the image below.

2. Install the SRA Toolkit. Since we are working with Amazon Linux, downloading the CentOS Linux 64 bit architecture should work. It can be accessed here together with its instructions.

Make sure that with your preferred terminal (Bash, Linux, or Windows cmd) you are already connected to the EC2 instance you’ve created.

From there, download the SRA Toolkit for this activity. Type:

wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz

Extract the contents of the tar file:

tar -vxzf sratoolkit.tar.gz

Navigate to the extracted file: /home/ec2-user/sratoolkit.3.1.1-centos_linux64/bin

3. Configuring the SRA Toolkit.

Now, type in the command

vdb-config -i

This is for configuration purposes. An interactive display for the configuration should be seen below. Make sure that the remote access is enabled.

Type ‘A’ to navigate to the AWS option, and type ‘R’ to enable report cloud instance identity. Then you can type ‘X’ to exit and confirm to save the changes.

4. Accessing the SRA file

From the NCBI database, get its accession number which will be used to access it.

In the terminal, type:

./fasterq-dump <accession_number>

Then confirm if it is downloaded. Like in the image below, the two (2) files are downloaded.

There you go, you can start working with those files! How is this different from accessing SRA data directly through NCBI? Amazon EC2 can provide a more secure and high-performance environment to process and analyze genomic datasets.

There is also another way of downloading/accessing the data: the Cloud Data Delivery Service. The Sequence Read Archive (SRA) delivers different file types through the SRA Toolkit, but not all original files sent to SRA are available. To access these files, SRA developed this cloud service which moves source files and other formats from NCBI cold storage to users’ data storage in AWS and GCP. A guide can be accessed here.

The utilization of AWS doesn’t stop here–you can also do the following:

Create your own SQL to find specific sets of data.
Retrieve search results quickly and at a low cost.
Compute statistics on the SRA’s available data.
Access this data through various API libraries.

All of which are accessible through AWS Athena and in this guide.

Conclusion

After all the exhaustive steps, you are finally free to work with your SRA data! It might seem overwhelming at first, but utilizing AWS for this task can help you work more efficiently and faster. Especially if your workflow consists of AWS tech, just like in AWS HealthOmics.

References

SRA

Data

Genomics

DNA Sequencing

Written by: Samantha Servo

Samantha is a fourth-year Computer Science student at Pamantasan ng Lungsod ng Maynila and an IT intern in Tutorials Dojo. Actively involved in campus organizations such as GDSC PLM and AWS Cloud Clubs Haribon, she's passionate about the convergence of medicine and technology, particularly data science. Samantha aims to contribute to advancements in these dynamic fields.

SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

Understanding the SRA Toolkit

AWS and its role in SRA data

Integrating SRA Toolkit with AWS

Preparing Data with SRA Toolkit on AWS

References

🎁 Save Up to 40% OFF ALL Cloud eBooks and grab the new SC-900 Practice Exam for only $9.99!

Learn AWS with our PlayCloud Hands-On Labs

🧑‍💻 CodeQuest – AI-Powered Programming Labs

FREE AI and AWS Digital Courses

Tutorials Dojo Exam Study Guide eBooks

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Join Data Engineering Pilipinas – Connect, Learn, and Grow!

Ready to take the first step towards your dream career?

Follow Us On Linkedin

Recent Posts

Written by: Samantha Servo

Our Community

What our students say about us?

SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

Understanding the SRA Toolkit

AWS and its role in SRA data

Integrating SRA Toolkit with AWS

Preparing Data with SRA Toolkit on AWS

References

🎁 Save Up to 40% OFF ALL Cloud eBooks and grab the new SC-900 Practice Exam for only $9.99!

Learn AWS with our PlayCloud Hands-On Labs

🧑‍💻 CodeQuest – AI-Powered Programming Labs

FREE AI and AWS Digital Courses

Tutorials Dojo Exam Study Guide eBooks

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Join Data Engineering Pilipinas – Connect, Learn, and Grow!

Ready to take the first step towards your dream career?

Follow Us On Linkedin

Recent Posts

Written by: Samantha Servo

Our Community

What our students say about us?

Did you find our content helpful?