Ends in
00
days
00
hrs
00
mins
00
secs
ENROLL NOW

🚀 25% OFF All Practice Exams, Video Courses, & eBooks – Cyber Week Blowout Deals!

SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

Home » Others » SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

SRA Toolkit + AWS: Revolutionizing Bioinformatics Data Prep

It takes too long. It’s boring. I want the good stuff already…

Sounds familiar? These statements are not entirely wrong, but data preparation in any kind of data analytics job is still important and must be executed carefully. This is an important step that can easily take up to 80% of the total working time of a project. For insightful results, the data must be prepared properly, and this article will be about the data one will be dealing with when working in bioinformatics. Fortunately, this process can be relieved with the use of the cloud, which in this case is Amazon Web Services (AWS). 

For context, in genomics there is a process called DNA sequencing. The order of the nucleotides in a DNA strand is determined during the process. This process can help researchers understand important genomic elements, from functional gene sequences to critical regulatory elements. There are two (2) main sequencing properties—the Sanger sequencing and Next Generation Sequencing (NGS). The former is slower and has limited ability in identifying gene variants and mutations. It was used to sequence the first human genome over a period of 13 years. In comparison, NGS is cheaper, faster, and more accurate. It can sequence the human genome in only a few days. Illumina sequencing by synthesis is one of the common approaches to NGS and can be something you will encounter when looking for data to work with. This is important to note, as this will be the kind of data that will be prepared in this article. 

Understanding the SRA Toolkit

Without getting into too many technicalities on the biological side, let’s understand what you’re going to work with: the SRA toolkit combined with the power of AWS. An SRA (Sequence Read Archive) is an archive of high-throughput sequencing data. It stores raw sequencing data and alignment information that can enhance reproducibility and help discoveries through data analysis. 

SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) clouds. That makes it easier to work with it later in this article. Now, the SRA Toolkit is a set of tools that bioinformaticians use to work with NGS data. One can download, convert, and analyze sequencing data. 

Core Components and Tools of the SRA Toolkit

The following are the core components you should familiarize yourself with when working with the SRA Toolkit. Take note that the components are not limited to the list below. 

Component Description
fastq-dump Converts data from the SRA format (stored in compressed .sra files) into FASTQ format used in bioinformatics analysis.
prefetch Used to download raw SRA data files from the SRA repository.
vdb-config -i This is where you will configure and manage the settings for working with SRA data, including the AWS settings.

AWS and its role in SRA data

With all of the necessary information you need about the SRA, what exactly is the role of AWS? With the power of AWS, you can execute the following:

Accessing of original submitted files

Faster downloading speed

Unlimited concurrent downloads from our cloud buckets to your buckets

Did you catch that? Faster download speed… Downloading SRA data directly from the NCBI database can take time, especially with larger datasets. 

Fortunately, the NIH NCBI Sequence Read Archive (SRA) on AWS is a product of AWS which is a part of the AWS Open Data Sponsorship Program. This program is an initiative of AWS that supports the availability and accessibility of public datasets. AWS sponsors the storage and hosting of various datasets which are open for anyone to use, mainly for research. In other words, AWS also hosts SRA data that you can use.

Integrating SRA Toolkit with AWS

Before starting, you should first look for your data of interest from the NCBI database.  

Preparing Data with SRA Toolkit on AWS

Tutorials dojo strip

The following are the steps to take to utilize AWS with SRA Toolkit. Do note that you should have an account to access the AWS console.

Setting up the SRA Toolkit on AWS EC2 instances

1. Provisioning an EC2 Instance 

Go to the AWS EC2 service and select the ‘Launch instance’ option. 

Select the Amazon Machine Image (AMI) of your choice. In this article, we’ll use Amazon Linux since it is covered by Amazon’s Free Tier.

Scroll down and choose an instance type. Again, this article will work with the ones under the Free Tier.

Select or create a new key pair. This is for a secured connection with the instance. Make sure to save it to a place where you can access it later.

Leave everything by default and launch the instance. Make sure that the region is set to us-east-1 or N. Virginia for a free and efficient retrieval of SRA data.

Connect to the instance through the SSH client and follow its instructions.

To know if you are connected, it should look like the image below.

2. Install the SRA Toolkit. Since we are working with Amazon Linux, downloading the CentOS Linux 64 bit architecture should work. It can be accessed here together with its instructions.

Make sure that with your preferred terminal (Bash, Linux, or Windows cmd) you are already connected to the EC2 instance you’ve created.

From there, download the SRA Toolkit for this activity. Type:

wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz

Extract the contents of the tar file:

tar -vxzf sratoolkit.tar.gz

Navigate to the extracted file: /home/ec2-user/sratoolkit.3.1.1-centos_linux64/bin

3. Configuring the SRA Toolkit.

Now, type in the command

vdb-config -i 

This is for configuration purposes. An interactive display for the configuration should be seen below. Make sure that the remote access is enabled.

Type ‘A’ to navigate to the AWS option, and type ‘R’ to enable report cloud instance identity. Then you can type ‘X’ to exit and confirm to save the changes.

4. Accessing the SRA file

From the NCBI database, get its accession number which will be used to access it.

In the terminal, type:

./fasterq-dump <accession_number>

Then confirm if it is downloaded. Like in the image below, the two (2) files are downloaded.

There you go, you can start working with those files! How is this different from accessing SRA data directly through NCBI? Amazon EC2 can provide a more secure and high-performance environment to process and analyze genomic datasets. 

There is also another way of downloading/accessing the data: the Cloud Data Delivery Service. The Sequence Read Archive (SRA) delivers different file types through the SRA Toolkit, but not all original files sent to SRA are available. To access these files, SRA developed this cloud service which moves source files and other formats from NCBI cold storage to users’ data storage in AWS and GCP. A guide can be accessed here

The utilization of AWS doesn’t stop here–you can also do the following:

  • Create your own SQL to find specific sets of data. 

  • Retrieve search results quickly and at a low cost. 

  • Compute statistics on the SRA’s available data. 

  • Free AWS Courses
  • Access this data through various API libraries.

 All of which are accessible through AWS Athena and in this guide.

Conclusion

After all the exhaustive steps, you are finally free to work with your SRA data! It might seem overwhelming at first, but utilizing AWS for this task can help you work more efficiently and faster. Especially if your workflow consists of AWS tech, just like in AWS HealthOmics.

References

SRA 

Data

Genomics

🚀 25% OFF All Practice Exams, Video Courses, & eBooks – Cyber Week Blowout Deals!

Tutorials Dojo portal

Learn AWS with our PlayCloud Hands-On Labs

Tutorials Dojo Exam Study Guide eBooks

tutorials dojo study guide eBook

FREE AWS Exam Readiness Digital Courses

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

Follow Us On Linkedin

Recent Posts

Written by: Samantha Servo

Samantha is a fourth-year Computer Science student at Pamantasan ng Lungsod ng Maynila and an IT intern in Tutorials Dojo. Actively involved in campus organizations such as GDSC PLM and AWS Cloud Clubs Haribon, she's passionate about the convergence of medicine and technology, particularly data science. Samantha aims to contribute to advancements in these dynamic fields.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?