Wednesday, March 9, 2016

GRC reference assembly curation with BioNano maps

Commercial whole genome mapping systems, such as OpGen and BioNano, are playing an increasingly important role in a variety of genomic analyses, including de novo assembly, structural variant detection and assembly curation.

Comparison of a reference assembly to a collection of whole genome maps can help curators find potential regions of misassembly and identify genomic variations that are candidates for representation in alternate loci scaffolds. For example, optical maps played a key role in the GRC's resolution of the human 10q11.22 tiling path, as described in this prior GRC blog post.

To assist with curation efforts, the Wellcome Trust Sanger Institute, a GRC member, has integrated BioNano data sets for human, mouse and zebrafish assemblies into the gEVAL browser. They have also provided data from OpGen maps and optical maps from David Schwartz (PMID: 20534489). Common curation use cases for these browser data include:

  • Gap sizing
  • Confirmation of assembly components
  • Identification of problems in assembly components
  • Identification of assembly regions missing sequence
  • Detection of haplotype over-expansions

Below is an example highlighting the potential usage.

The issue (HG-172) reports possible assembly error or missing sequence from reference component AL691432.54 that affects CDC2L1 (GeneID:984) in GRCh37.  Navigating to this region in gEVAL and turning on the tracks for:
  • REFSEQ transcript mappings
  • BspQI insilico digest
  • JIRA issue entries
  • Bionano genome maps
    • NA12878
    • NA24143 (Ashkenazim trio mother)
    • NA24149 (Ashkenazim trio father)
    • NA24385 (Ashkenazim trio son)
    • NA24631 (Chinese trio son)
In the REFSEQ track there are 4 distinct transcript mapping groups, in which most are orange colour indicating the mapping is incomplete in terms of coverage.  This indicates that perhaps there may be an issue in the assembly.  From left to right the first group is mib2, mm23A/B, cdc2L1(cdk11b) and slc35e2.
Jump to this Region in gEVAL
Zooming into the region of the red features in the Bionano genome maps reveals clearly the fragment sizes between Bionano BspQI nicks/labels.  The red features within each map indicates a discordance in size with the BspQI insilico digest.

The insilico digest reveals that the region in question on the assembly has a fragment size of ~26,5kb, whereas all 5 genome maps reveal two fragments totaling ~30kb.  This discordant region encompasses the cdc2L1 gene (which from further analysis indicates 2 exons missing) as well, giving evidence that perhaps this region is not represented correctly and missing roughly 3-4kb of sequence. 

Upon further review and analysis, the GRC curation team was able to place in to motion new clone sequence to represent correctly this region.  The new path is represented in GRCh38 below.  The addition of clone FO704657 on to the assembly path correctly contains the complete coding region for the cdc2L1(cdk11b) gene.  It also corrects the flanking gene mappings as well.  This is indicated by the green mappings of the REFSEQ transcripts.  The discordant red features of the Bionano genome maps are no long present, replaced with concordant fragment mappings.  The issue can be updated as resolved.
Jump to this Region in GRCh38

For more information on how to use BioNano displays in gEVAL, check out the gEVAL browser blog.