Ongoing efforts to close gaps and to correct clone problems remaining in the GRCm38 mouse reference assembly have proved difficult. The available clone library resources have been exhausted, and the remaining gaps are recalcitrant to cloning, with either no clones available or gap-spanning clones deleted for the expected genomic sequence. The GRC has previously used contigs from publicly available whole genome shotgun assemblies to provide sequence at some of these gaps, and in some cases have been able to close gaps entirely with this approach. Nonetheless, several hundred sequence gaps, many of which are known to contain genes, remain.
With
the release of 17 strain-specific genome assemblies from the Mouse Genomes
Project, the GRC evaluated alignments between C57BL/6NJ, the most closely
related strain, and the GRCm38 reference
(C57BL/6J). This evaluation found genes missing from the reference assembly to
be present in the new strain assembly. Utilising the C57BL/6J read set (PRJNA51977)
deposited in GenBank by the Broad Institute, and used in the production of the
C57BL/6J ALLPATHS WGS assembly GCA_000185105.2,
the Genome Reference Consortium’s sought to generate local assemblies from
these reads that could be used for curation of the GRCm38 reference. The
read set was initially aligned to the C57BL/6NJ assembly using bwa-mem. Once
completed, reads aligning to the C57BL/6NJ assembly corresponding to GRCh38
gaps and the location of clone-assembly problems in the GRCm38 reference were
identified and subsequently assembled using the Geneious software platform
(version 10.1.3). The resulting assembly BAMs were then loaded into GAP5 for manual
curation. The assembled WGS contigs were then submitted to GenBank.
The
patch release GRCm38.p6 addresses 20 regions with these newly created and
submitted sequences. These contigs fix and improve representation for
several genes, examples of which are shown in Table 1 and Figure 1.
Issue number
|
Issue type
|
GenBank_ID
|
RefSeq_ID
|
Gene (Gene ID)
|
Dgkk (331374)
|
||||
933416I08Rik (71159)
|
||||
Cylc1 (67407)
|
||||
Atg4a (666468)
|
||||
Pstpip2 (19201)
|
||||
Baalc (118452)
|
||||
Dnah12 (110083)
|
||||
Ahnak2 (100041194)
|
||||
Efcab7 (230500)
|
||||
Kazn (71529)
|
||||
Intu (380614)
|
||||
Anxa13 (69787)
|
||||
Trerf1 (224829)
|
Table 1: Examples of issues fixed in GRCm38.p6 using assembled Illumina reads.
Figure 1 Top: Incomplete representation of Anxa13 gene in GRCm38 due to a deletion in reference component AC152395.9. Middle: clone error corrected in GRCm38.p6. Fix patch uses MF597750.1 and MF597749.1 to add deleted sequence to AC152395.9. It also provided a complete representation of Anxa13. Bottom: Representation of Anxa13 by reference chr. 15 and fix patch highlighting complete representation of Anxa13 (NM_027211.2).
Figure 1 Top: Incomplete representation of Anxa13 gene in GRCm38 due to a deletion in reference component AC152395.9. Middle: clone error corrected in GRCm38.p6. Fix patch uses MF597750.1 and MF597749.1 to add deleted sequence to AC152395.9. It also provided a complete representation of Anxa13. Bottom: Representation of Anxa13 by reference chr. 15 and fix patch highlighting complete representation of Anxa13 (NM_027211.2).