Bioinformatics Program

Whole-genome sequencing and other genomic, proteomic and metabolomic technologies generate massive amounts of data — your genome alone contains some 6 billion pieces of code in the form of the letters A, G, T and C. Finding genetic and molecular variants within these data that are related to disease is like searching for the proverbial needle in a haystack.

The Bioinformatics Program takes "raw" genomic sequencing data and processes, analyzes and interprets it, advancing research within the center's translational programs, ultimately leading to individualized tests and treatments for patients. When needed, the program develops novel bioinformatics methods to further advance our data processing and data interpretation capabilities.

The program builds on the already extensive bioinformatics resources and activities in the Division of Biomedical Statistics & Informatics at Mayo Clinic, as well as Mayo collaborations with the University of Minnesota, Arizona State University, the University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications.

Areas of focus

Data pre-processing

Data pre-processing aims to convert "raw" genomic data into biologically interpretable information. This analytical step is the most time-consuming and requires the largest amount of storage space.

Our program has designed and implemented a suite of pre-processing workflows, including several open-source or in-house applications that have been calibrated for optimal variant calling. Workflows are also engineered to run on highly parallel systems and enabled for cloud computing.

Workflows are available to preprocess:

  • DNA-seq data for the characterization of:
    • Variants (SNVs, indels)
    • Structural variants (CNV, translocation, inversion)
  • mRNA-seq data for the characterization of:
    • Variant found in DNA-seq data
    • Expression level of genes
    • Splice variants
    • Fusion genes
  • Methyl-seq data from the RRBS protocol, whole genome sequencing
  • CHiP-seq data for the identification of Transcription Factor binding sites and histone modification
  • CHiP-exo data for the identification of Transcription Factor binding sites and histone modification
  • miRNA-seq data for the quantification of microRNAs
  • linc-RNA data for identification and quantification of long non-coding RNAs
  • Microbiome and metagenomics data
  • Proteomics/proteogenomics data
  • Metabolomics data

Data integration

After pre-processing, genomic results obtained from patients with similar diseases, tumor type or tumor stage are integrated into a centralized Mayo Clinic repository called the Biologically Oriented Repository Architecture (BORA). In this repository, data are modeled to follow the organization of a biological system. BORA provides interfaces to applications for annotation of variants.

Importantly, BORA can integrate genomic data from Mayo patients with similar data retrieved from external publicly available data sources, such as The Cancer Genome Atlas, a National Institutes of Health catalog of genomic changes in cancer, and the Gene Expression Omnibus, a National Center for Biotechnology Information public repository.

BORA is interfaced to the Enterprise Data Trust, a Mayo Clinic data warehouse that stores the clinical history of patients. This interface provides centralized access to comprehensive clinical and genomics information on Mayo patients as well as biological knowledge from both Mayo Clinic and public sources, allowing the original genomic data to be modeled and studied in the context of a larger biological system.

Data analysis and interpretation

In this final step, bioinformaticians analyze genomic results, help with their interpretation, and then prioritize and report their findings to investigators.

Association methods are used to identify variants that are significantly associated with the disease phenotype or directly involved in the biological mechanisms underlying the disease. So in addition to looking for the "what," bioinformaticians investigate the "why" — in other words, the cascade of molecular events that lead to the development of a disease.

The mechanistic understanding of a disease may lead to new or better ways of treating patients by identifying existing drugs that can be repositioned to treat specific conditions or discovering new drug targets for which new therapies can be developed.

Methods development

The Bioinformatics Program is actively engaged in the development of novel methods that are published in peer-reviewed journals. Methods and applications are made freely available via bioinformatics software packages.

Program leader

Jean-Pierre A. Kocher, Ph.D.Director