Marriage of biology and tech. With the current technological advancements, how about we throw in the cloud in the tech too? Just like… Bioinformatics in AWS HealthOmics.
Yes, you read that right!
Are you torn between the two and want to pursue them both? To sit in a laboratory working on something new and interesting without bidding your farewell to the “Hello World!”? Then this article will introduce a field that may interest you.
Maybe you are a biologist who wants to know how to use your domain expertise to move into the technological realm. Or perhaps you are a programmer and want to use your tech skills in this specific field. Or maybe, you just knew you wanted to be a part of both. Whatever your position is, you can use this article as your starting point in this wonderful field of… Bioinformatics!
What exactly is Bioinformatics?
First off, what is Bioinformatics? According to the article of National Human Genome Research Institute, it is a scientific subdiscipline involving the use of computer technology to collect, store, analyze, and disseminate biological data and information. In other words, you can utilize technology to understand biological processes. It’s like data science in which you are to collect data, process data, and analyze data but specifically on biological data.
Bioinformatics has a lot of applications such as:
- Gene therapy
- Evolutionary studies
- Microbial applications
- Prediction of Protein Structure
- Storage and Retrieval of Data
- Drug discovery
- Biometrical Analysis
- And many more…
In a nutshell, you can use Bioinformatics to help in discovering knowledge from the available biological data which can help significant medical advancements. This can also include precision and preventive medicines which are focused on developing measures to prevent, control and cure infectious diseases.
What kind of data are you going to deal with?
In biological data, you must take note of the three building blocks you will most likely encounter. The DNA, RNA, and proteins. The following is a very quick overview of those concepts; please visit this site for more information.
- DNA or Deoxyribonucleic Acid – codes genetic information for the transmission of inherited traits (C [Cytosine],G [Guanine], A [Adenine], T [Thymine]).
- RNA or Ribonuclecic Acid – a molecule similar to DNA which carries the information from the DNA, transforming into proteins (C [Cytosine],G [Guanine], A [Adenine], U [Uracil]).
- Proteins – made up of amino acids linked together; perform most cellular functions.
You can find biological data to process and analyze from the following websites:
- NCBI (National Center for Biotechnology Information) – provides access to public databases for research in computational biology, genome data, and biomedical information.
- ExPasy – Swiss Institute of Bioinformatics (SIB) Bioinformatics Resource Portal which provides access to scientific databases and software tools.
- Gramene – an open-source data source for comparative plant genomics for crops and model organisms
- AgBioData – a collection of agricultural biological databases and associated resources
There are a lot more resources here and here for you to check out.
As for file formats, there are different types too. The file formats in Bioinformatics are as follows but you can read more here and here.
- Sequence File Formats
- FASTA – Denoted by the .fas extension, used by most large curated databases. There are also specific extensions for components such as nucleic acids (.fna), nucleotide coding regions (.ffn), amino acids (.faa), and non-coding RNAs (.frn).
- FASTQ – Denoted by .fastq, .sanfastq, or .fq extension. This format expands upon the simplicity of the FASTA format and was designed for use with next-generation sequencing devices. The “Q” in “FASTQ” stands for quality—the information about quality of the sequencing reads and base calls.
- Alignment File Formats
- BAM file formats – Denoted by the .bam extension, stores sequence alignment data and is a smaller, compressed binary version which can be indexed and is preferred to be used for Integrative Genomics Viewer—an interactive tool for the visual exploration of genomic data.
- SAM file formats – Stands for sequence alignment/MAP and is denoted by .sam extension, and were initially derived from another bioinformatics tool, the Samtools, a suite of programs for interacting with high-throughput sequencing data.
- CRAM file formats – A restructured version of BAM files enabling lossless compression.
- Stockholm file formats
- VCF file formats – Stands for Variant Calling Format, VCF is denoted by .vcf extension which stores gene sequence variation and used in genotyping projects.
- Generic feature formats – A GFF is denoted by .gff2 or .gff3 extension, describing the sequence elements that make up a gene. Defined in the body of a GFF file are the features present within a gene including transcripts, regulatory regions, untranslated regions, exons, introns, and coding sequences.
- Gene Transfer Formats – A GTF is denoted by the same format as GFF files, but is used to define gene and transcript-related features exclusively.
- Unlabeled File Formats
- BED – Stands for Browser Extensible Data, and contains details about sequences that a genome browser can display.
- Tar.gz – A compressed file type and can store bioinformatics software or raw data.
- PDB – It includes atomic coordinates and are utilized by the Protein Data Bank to store three-dimensional protein structures.
- PED – Denoted by .ped extension, it is the file format used for pedigree analysis.
- MAP – Denoted by .map extension, it is the file format that accompanies the PED file when using PLINK program and has variant information.
- CSV – Stands for Comma Separated Values and is denoted by .csv extension. It can also be opened with spreadsheet programs, and a common file format for data science.
- JSON – Stands for JavaScript Object Notation and is denoted by .json extension. It is commonly used in programming, but can also be encountered in bioinformatics.
With the rise of different methods of generating and using sequencing data, different file formats are also born. However different they are, they still share a similar structure. Take note of their header with metadata and a body with lines or fields of data. It may seem overwhelming for now, but you can get yourself acquainted with them by exploring and experimenting with them! Especially if you utilize the cloud for your work, in this case, the AWS.
But isn’t biological data sensitive?
Now that you have an idea what kind of biological data and its file formats you are going to work with, how about the data issues around this kind of data? Compared to other types of data, this includes a human touch of information, and you’re going to use cloud on top of it all.
Data privacy and security and the integration of current bioinformatics software with cloud platforms are the usual concerns.
- Data privacy and security – Since bioinformatics deals with sensitive genomic data, this concern should be prioritized when choosing a cloud service provider for the job. This includes encryption and access controls.
- Software integration – As mentioned earlier, biological data comes in different formats which may need to be properly integrated with the cloud services. Additionally, bioinformatics often requires complex algorithms and large computational resources for data analysis. The compatibility between bioinformatics software and cloud services must be ensured as well.
Even with these concerns, however, there are existing progression already in combining the bioinformatics work with cloud computing. An article from 2021 written by Andy Powell entitled, “Scaling genomics workloads using HPC on AWS” is an example. In particular, it utilizes the High Performance Computing for Healthcare & Life Sciences. AWS HPC can bring instant access to computing resources in order to accelerate structure-based drug design.
From the official website of AWS itself, it’s stated that the AWS infrastructure accelerates the development of protein structure solutions and molecular modeling by combining faster algorithms with increased computational power. This results in improvements in speed, accuracy, and scale for a variety of applications, including virtual screening, molecular dynamics, quantum mechanics, and 3D structure solutions. To read the full article of Powell, please visit here.
Another example is an article written by Swaine Chen, Austin Cherian, Sarah Geiger, and Suma Tiruvayipati from 2022 talked about helping bioinformaticians to transition running their workloads on AWS. You can read more about the article here.
AWS supports a lot of bioinformatics-related work. In their official website, the bioinformatics industry is supported, specifically, Genomics on AWS which supports the genomic innovations at the intersection of biology and technology.
What if you want to run your whole bioinformatics workflow? Fret not, AWS has it too! To be specific, you can explore and use their AWS HealthOmics service for genomics and other biological data.
In their AWS Solution Library, an article is written for guiding the user, such as yourself, in building and running production-grade bioinformatics workflows at scale. You will be able to utilize AWS services for automation, workflow analysis, storage, and operational and cost observability. There is an architecture diagram there that you can use as the foundation for your infrastructure and update as needed.
There is an important note, however. This specific guidance article requires the use of AWS CodeCommit which is no longer available to new customers. This doesn’t mean that you can’t use AWS for your bioinformatics work though! You just need to make some adjustments if you have decided to follow the article. Don’t worry though, there will be a hands-on activity that you can try and follow along in a minute!
Tasks of a bioinformatician
With all the basic concepts and issues you might encounter out of the way… Let’s say you have decided to try out this field. What tasks are you usually involved with?
In a Youtube video of Data Professor, the “Data Science for Bioinformatics”, he mentioned four common tasks involved in bioinformatics. While they are not everything that a bioinformatician would do, they are a good starting point for you to have a basic idea.
- Search – Search public datasets for information on genes/proteins/RNA/pathways. You can use the datasets mentioned in the previous sections.
- Compare – Sequence alignment to discern similarity/differences amongst various genes/proteins/RNA.
- Model – Building structural model of protein structure or building predictive models using retrospective data. These are specific applications of model building in a data science project.
- Integrate and Curate – Combine heterogeneous data sources. Some biological data are not in the same place, so integrating them for easy analyzation is also a task you will be needing to perform.
Combining these with the data science lifecycle that is usually followed for other industries as well which includes, data collection, data preparation, data exploration, model planning, model building, data analysis, and deployment. Depending on your project, you may add or subtract from this general lifecycle. That’s where data science shines—the freedom to be creative!
Role of AWS in Bioinformatics
Now time to get friendly with the AWS services that you will be encountering. In particular, you will be using the AWS HealthOmics service for storing, querying, analyzing, and generating insights from genomics and other biological data.
There are three components AWS HealthOmics has:
- HealthOmics Storage – Will help you store and share petabytes of genomics data efficiently and at low cost per gigabase.
- HealthOmics Analytics – Will help simplify the process of preparing genomics data for multiomics and multimodal analyses.
- HealthOmics Workflows – Will automatically provision and scale the underlying infrastructure for your bioinformatics computation.
For clarity, this AWS service also has its limits. It can only be used for “transferring, storing, formatting, or displaying of data, and for the provision of infrastructure and configuration support for managing workflows.” as written in the their official documentation.
You can’t use this service to directly perform variant calling or genomic analysis and interpretation. In short, this service can’t be used as a substitute for third-party tools that are already existing and designed for genomic analyses. However, that doesn’t mean that AWS HealthOmics can’t help you with your bioinformatics work.
To read more about the AWS HealthOmics service, please refer to the documentation here.
Hands-on activity: Let’s set up your AWS account and utilize the AWS HealthOmics service!
For the exciting part… You are going to explore AWS HealthOmics!
What if you’re stuck and you need help? Can you use stackoverflow? Yes! But you can also use its counterpart in the bioinformatics world: biostars.org for more biology-oriented, specific queries.
Now let’s get you set up for your AWS journey in bioinformatics.
Steps to follow are below:
1. Create an AWS account. Before you can access any of the services, you are required to create your AWS account. For more assistance in creating your account, please visit their guide here. Make sure you have completed setting up your account.
2. Visit your Console Home and navigate to the Services. Here, you can then choose or search ‘AWS HealthOmics’. It wasn’t supported in other regions so in the example, London is chosen. You can choose any other region.
Clicking the ‘Getting Started’, you will be able to see an overview of the three components (storage, workflow, analytics) of AWS HealthOmics.
3. Look for biological data. You can use the resources mentioned to look for the data. In this activity, you can use this FASTA file. To follow along, you can access it here. It is the ACTB actin beta (homo sapiens) gene.
Download the gene sequences (FASTA).
4. Create a reference store. This is to create a data store that will hold your reference genome files. Make sure that your file is also available in Amazon S3 (more on this later).
Go to ‘Storage’ > ‘Reference store’. Click the ‘Import reference genome’.
Choose Manual Create and add the reference store name: actb-reference-store.
Create a new S3 bucket. This will provide redundancy, and data protection for your genomic data. Go to Amazon S3 bucket in a new tab. Click the ‘Create Bucket’ button.
Add a bucket name. Here in the example, the bucket name is myomicsbucket. Feel free to choose anything you want, and scroll down to create the bucket.
Then select the bucket you have created. You will upload the downloaded dataset from NCBI earlier.
It should contain these three files:
The ‘gene.fna’ is the most important here since it will be the one imported as the reference genome.
Going back to your reference store, scroll down to Reference Genome Details and add ‘actbHGNC132’ as the genome name. Then Click ‘Browse S3’ to choose the bucket created earlier.
Scroll down to the Service access section and choose the ‘Create and use a new service role’ option. Choose any service role name.
All the way down, you can now click ‘Import reference genome’. If it’s successful, you can now proceed to the next step.
5. Use the imported reference genome. You can start now use this reference genome for the other sequences you will experiment with. A reference genome is important as it is needed in other cases to provide context in analyzing sequences. Take note of the URI generated for you to be able to use this specific reference genome.
6. You are now ready to explore other components of AWS HealthOmics. Since you have tried the usual first step in using the storage component of this service, why not try uploading a sequence data in the sequence store? You can reference this genome you’ve uploaded. Try to experiment with other biological data and build your projects to learn!
Conclusion
That was fun! It’s a simple hands-on to get you acquainted with the reference store storage component of AWS HealthOmics. You can explore deeper on other components too, check their documentation for more information.
Now you know that you don’t have to choose—you love biology and technology? Feel free to explore Bioinformatics then! You can also start building projects on your own, use this article as a starting point and expand. Technology is a wonderful thing, it can bridge different fields of your choosing. It is the perfect platform for interdisciplinary fields which can help advance what we currently know and can do. Biology and technology, both evolving fields, will continue to expand upon itself—who knows what it holds for us in the future? Stay curious, experiment, fail, and learn.
References
You can read more about Bioinformatics, Biology, cloud, and AWS here.
Bioinformatics
- Definition.
- Data Science for Bioinformatics
- Beginner guide in bioinformatics
- Bioinformatics
- Bioinformatics data 1
- File formats
- Bioinformatics VS Data Science
- Overview of Bioinformatics and its applications
Biology
Cloud Computing
- Cloud computing in Bioinformatics: A game-changer for big data analysis
- Cloud Computing in Bioinformatics: Benefits and Barriers
- Helping bioinformaticians transition to running workloads on AWS
- Scaling genomics workloads using HPC on AWS
- Genomics on AWS
- Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS
- AWS for Bioinformatics
- AWS HealthOmics Documentation
- High Performance Computing for Healthcare and Life Sciences