Friday, February 9, 2018

New technique closes gaps in GRCm38.p6

Ongoing efforts to close gaps and to correct clone problems remaining in the GRCm38 mouse reference assembly have proved difficult. The available clone library resources have been exhausted, and the remaining gaps are recalcitrant to cloning, with either no clones available or gap-spanning clones deleted for the expected genomic sequence. The GRC has previously used contigs from publicly available whole genome shotgun assemblies to provide sequence at some of these gaps, and in some cases have been able to close gaps entirely with this approach. Nonetheless, several hundred sequence gaps, many of which are known to contain genes, remain.

With the release of 17 strain-specific genome assemblies from the Mouse Genomes Project, the GRC evaluated alignments between C57BL/6NJ, the most closely related strain, and the GRCm38 reference (C57BL/6J). This evaluation found genes missing from the reference assembly to be present in the new strain assembly. Utilising the C57BL/6J read set (PRJNA51977) deposited in GenBank by the Broad Institute, and used in the production of the C57BL/6J ALLPATHS WGS assembly GCA_000185105.2, the Genome Reference Consortium’s sought to generate local assemblies from these reads that could be used for curation of the GRCm38 reference. The read set was initially aligned to the C57BL/6NJ assembly using bwa-mem. Once completed, reads aligning to the C57BL/6NJ assembly corresponding to GRCh38 gaps and the location of clone-assembly problems in the GRCm38 reference were identified and subsequently assembled using the Geneious software platform (version 10.1.3). The resulting assembly BAMs were then loaded into GAP5 for manual curation. The assembled WGS contigs were then submitted to GenBank.

The patch release GRCm38.p6 addresses 20 regions with these newly created and submitted sequences. These contigs fix and improve representation for several genes, examples of which are shown in Table 1 and Figure 1.

Table 1: Examples of issues fixed in GRCm38.p6 using assembled Illumina reads.

Figure 1 Top: Incomplete representation of Anxa13 gene in GRCm38 due to a deletion in reference component AC152395.9. Middle: clone error corrected in GRCm38.p6. Fix patch uses MF597750.1 and MF597749.1 to add deleted sequence to AC152395.9. It also provided a complete representation of Anxa13. Bottom: Representation of Anxa13 by reference chr. 15 and fix patch highlighting complete representation of Anxa13 (NM_027211.2).