Your DNA has the blueprint for your life. Much like everybody else, as it determines everything from our physical characteristics to our susceptibility to diseases. However, there is the challenge of decoding this complexity efficiently and accurately. With AWS HealthOmics—an AWS service for storing, querying, analyzing, and generating insights from genomics and other biological data.
AWS HealthOmics helps scientists and researchers to unlock the mysteries of DNA at scale with its efficiency. Is this your first time reading about the magic of AWS HealthOmics? If you want a basic rundown, you can check out my previous article about it.
AWS Health Omics Analytics
The analytics part of HealthOmics is just one of its three components (storage, analytics, and workflows). AWS HealthOmics analytics helps store and study genomic variants and annotations.
It offers two main storage options:
- Variant store is for genomic variant data
- Annotation store is for more information about those variants
These stores let you save, handle, and search your genomic data. After importing the data, Amazon Athena allows for complex queries and insights to be discovered.
Understanding Variant and Annotation Stores
You must understand the two storage resource types in HealthOmics—the variant and annotation stores. However, if you have read the previous article, you would have encountered other types, such as reference and sequence stores. These stores are for raw genomic data, while the latter, mentioned in this article, are used for data ready to be analyzed. For more clarity, here is a visualization for comparison.
Since this article is all about the utilization of the analytics component of AWS HealthOmics, we are going to focus on it. You can always refer to this article for the rundown of the different biological data formats mentioned (FASTQ, FASTA, VCF, etc.) in the image above.
Managing Data with HealthOmics Console and API
In this hands-on section, we will focus on utilizing the variant store. This is easier than jumping into the annotation store right away as it focuses on storing and querying standard genomic data formats like VCF or BCF.
1. Navigate to the HealthOmics console: https://console.aws.amazon.com/omics/
2. Choose Variant stores in the left navigation pane.
3. Before creating a variant store, make sure that you have imported a reference genome. This is important for serving as the baseline for comparing and interpreting genomic data. You can read more in this article.
In this example, we will be using the genome assembly GRCh38.p14, which is a specific version of the GRCh38 human genome reference, with the updates included in patch version 14.
4. With the reference genome imported, you can proceed to create a variant store. Fill in the following information accordingly:
- Variant store name
- Description (optional)
- Reference genome
- Data Encryption
- Tags (optional)
5. Now you are ready to download any dataset of your choice!
Downloading your data from NCBI
1. Go to NCBI, then search for any genomic variant you want to analyze. In this example, we will use a human genomic variant because our reference genome is human. If you want to follow the hands-on, make sure you choose the ‘dbVar’ from the choices, like in the picture below.
2. After searching for any disease or gene-specific terms, you can choose from one of the results. In this example, we are using a dataset with the gene CACNA1C (which has been associated with various conditions such as bipolar disorder).
3. Go to your chosen study and then navigate to its ‘Genotype Information’. Click the ‘FTP site’ link to then access the page where you can download the dataset.
4. Don’t forget to upload it to an S3 bucket so you can access it later for analysis.
Performing Analytics with Amazon Athena
Now that you have both your reference genome and your variant dataset, you can then analyze with Amazon Athena by doing the following:
1. Configure query result location with the Athena console.
Go to the Athena console and launch it.
Navigate to ‘Settings’ then click ‘Manage’.
Choose any S3 bucket where you would like the query results to be saved.
2. Configure a workgroup with Athena engine v3
Go to the Athena console again. Then go to Workgroups from the left navigation bar.
Click the Create workgroup button, then fill in the details needed.
3. Create a table in the Data Catalog that points to the uploaded S3 data.
Navigate to the AWS Glue console, then navigate to the ‘Data Catalog’.
Go to the ‘Crawlers’ section from the navigation pane on the left. Then select ‘Create crawler’. Fill in the details needed.
Add the S3 bucket where your dataset is located as the data source.
Create a new IAM role.
Create a database by selecting Add database. Then add it as the target database.
Then you can now create the crawler then run it. This will create tables in your database.
4. Running a simple query with the Athena console.
Go to the ‘Editor’ tab.
Verify that the ‘Data source’ is AwsDataCatalog, and that the ‘Database’ is the same as the one created in the previous step.
You can now try querying your dataset! (Note: There are cases in which you might need to manually change the table schema if it’s not detected properly.)
Best Practices
Now that you understand how to manage and analyze genomic data, here are the best practices you should keep in mind for better implementation:
- Ensure proper format conversion before import. Refer to the documentation to make sure you are using the correct format of your data.
- Data storage. You have learned the different storages (reference, sequence, variant, and annotation store) and their appropriate uses. Use them accordingly.
- Access control. Don’t forget to set up IAM policies to manage user permissions and restrict access to your biological data.
- Integration with other AWS services. Don’t limit yourself to AWS HealthOmics, as it is possible to utilize other services as well however you see fit.
The mysteries encoded in our DNA hold the key to understanding life, health, and illness. With AWS HealthOmics, the difficulty of managing and analyzing genetic data has become more efficient. During the process of utilizing these services, don’t forget to adopt the best practices needed.
Remember, you are not only decoding genomes but also paving the path for the future of personalized medicine and different genetic discoveries. AWS HealthOmics is your partner for transforming genetic data into life-changing insights.
References
Genomics: Scalable Data Storage and Processing with Cloud Computing
Introduction to AWS HealthOmics: Streamlining Bioinformatics with Amazon’s Latest Service
Configuring Athena for queries