It takes too long. It’s boring. I want the good stuff already…
Sounds familiar? These statements are not entirely wrong, but data preparation in any kind of data analytics job is still important and must be executed carefully. This is an important step that can easily take up to 80% of the total working time of a project. For insightful results, the data must be prepared properly, and this article will be about the data one will be dealing with when working in bioinformatics. Fortunately, this process can be relieved with the use of the cloud, which in this case is Amazon Web Services (AWS).
For context, in genomics there is a process called DNA sequencing. The order of the nucleotides in a DNA strand is determined during the process. This process can help researchers understand important genomic elements, from functional gene sequences to critical regulatory elements. There are two (2) main sequencing properties—the Sanger sequencing and Next Generation Sequencing (NGS). The former is slower and has limited ability in identifying gene variants and mutations. It was used to sequence the first human genome over a period of 13 years. In comparison, NGS is cheaper, faster, and more accurate. It can sequence the human genome in only a few days. Illumina sequencing by synthesis is one of the common approaches to NGS and can be something you will encounter when looking for data to work with. This is important to note, as this will be the kind of data that will be prepared in this article.
Understanding the SRA Toolkit
Without getting into too many technicalities on the biological side, let’s understand what you’re going to work with: the SRA toolkit combined with the power of AWS. An SRA (Sequence Read Archive) is an archive of high-throughput sequencing data. It stores raw sequencing data and alignment information that can enhance reproducibility and help discoveries through data analysis.
SRA data is also available on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) clouds. That makes it easier to work with it later in this article. Now, the SRA Toolkit is a set of tools that bioinformaticians use to work with NGS data. One can download, convert, and analyze sequencing data.
Core Components and Tools of the SRA Toolkit
The following are the core components you should familiarize yourself with when working with the SRA Toolkit. Take note that the components are not limited to the list below.
Component | Description |
fastq-dump | Converts data from the SRA format (stored in compressed .sra files) into FASTQ format used in bioinformatics analysis. |
prefetch | Used to download raw SRA data files from the SRA repository. |
vdb-config -i | This is where you will configure and manage the settings for working with SRA data, including the AWS settings. |
AWS and its role in SRA data
With all of the necessary information you need about the SRA, what exactly is the role of AWS? With the power of AWS, you can execute the following:
Accessing of original submitted files
Faster downloading speed
Unlimited concurrent downloads from our cloud buckets to your buckets
Did you catch that? Faster download speed… Downloading SRA data directly from the NCBI database can take time, especially with larger datasets.
Fortunately, the NIH NCBI Sequence Read Archive (SRA) on AWS is a product of AWS which is a part of the AWS Open Data Sponsorship Program. This program is an initiative of AWS that supports the availability and accessibility of public datasets. AWS sponsors the storage and hosting of various datasets which are open for anyone to use, mainly for research. In other words, AWS also hosts SRA data that you can use.
Integrating SRA Toolkit with AWS
Before starting, you should first look for your data of interest from the NCBI database.
Preparing Data with SRA Toolkit on AWS
The following are the steps to take to utilize AWS with SRA Toolkit. Do note that you should have an account to access the AWS console.
Setting up the SRA Toolkit on AWS EC2 instances
1. Provisioning an EC2 Instance
Go to the AWS EC2 service and select the ‘Launch instance’ option.
Select the Amazon Machine Image (AMI) of your choice. In this article, we’ll use Amazon Linux since it is covered by Amazon’s Free Tier.
Scroll down and choose an instance type. Again, this article will work with the ones under the Free Tier.
Select or create a new key pair. This is for a secured connection with the instance. Make sure to save it to a place where you can access it later.
Leave everything by default and launch the instance. Make sure that the region is set to us-east-1 or N. Virginia for a free and efficient retrieval of SRA data.
Connect to the instance through the SSH client and follow its instructions.
To know if you are connected, it should look like the image below.
2. Install the SRA Toolkit. Since we are working with Amazon Linux, downloading the CentOS Linux 64 bit architecture should work. It can be accessed here together with its instructions.
Make sure that with your preferred terminal (Bash, Linux, or Windows cmd) you are already connected to the EC2 instance you’ve created.
From there, download the SRA Toolkit for this activity. Type:
wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz
Extract the contents of the tar file:
tar -vxzf sratoolkit.tar.gz
Navigate to the extracted file: /home/ec2-user/sratoolkit.3.1.1-centos_linux64/bin
3. Configuring the SRA Toolkit.
Now, type in the command
vdb-config -i
This is for configuration purposes. An interactive display for the configuration should be seen below. Make sure that the remote access is enabled.
Type ‘A’ to navigate to the AWS option, and type ‘R’ to enable report cloud instance identity. Then you can type ‘X’ to exit and confirm to save the changes.
4. Accessing the SRA file
From the NCBI database, get its accession number which will be used to access it.
In the terminal, type:
./fasterq-dump <accession_number>
Then confirm if it is downloaded. Like in the image below, the two (2) files are downloaded.
There you go, you can start working with those files! How is this different from accessing SRA data directly through NCBI? Amazon EC2 can provide a more secure and high-performance environment to process and analyze genomic datasets.
There is also another way of downloading/accessing the data: the Cloud Data Delivery Service. The Sequence Read Archive (SRA) delivers different file types through the SRA Toolkit, but not all original files sent to SRA are available. To access these files, SRA developed this cloud service which moves source files and other formats from NCBI cold storage to users’ data storage in AWS and GCP. A guide can be accessed here.
The utilization of AWS doesn’t stop here–you can also do the following:
-
Create your own SQL to find specific sets of data.
-
Retrieve search results quickly and at a low cost.
-
Compute statistics on the SRA’s available data.
-
Access this data through various API libraries.
All of which are accessible through AWS Athena and in this guide.
Conclusion
After all the exhaustive steps, you are finally free to work with your SRA data! It might seem overwhelming at first, but utilizing AWS for this task can help you work more efficiently and faster. Especially if your workflow consists of AWS tech, just like in AWS HealthOmics.
References
SRA
- NIH NCBI Sequence Read Archive (SRA) on AWS Article
- NIH NCBI Sequence Read Archive (SRA) on AWS
- SRA in the Cloud
- NCBI SRA Toolkit
- What is SRA?
- How to Access SRA Data with Amazon Web Services
- Cloud Data Service
- AWS Athena SRA
Data
Genomics