Wednesday, July 22, 2020

GRCm39: the new mouse reference genome assembly

The GRC is pleased to announce the release of GRCm39 (GCA_000001635.9), the latest version of the mouse reference genome assembly. 

GRCm39 is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38. More than 400 reported issues were resolved in the production of the new assembly, which also incorporates the sequence edits released as scaffolds in the six GRCm38 patch releases.

The new reference assembly exhibits substantial improvements in contiguity. As shown in Fig 1, the scaffold N50 has increased by 95% to 106.1 Mb in GRCm39, and 1.9 Mb of non-N bases were added to the assembly. The gap count has been nearly cut in half, with the total gap length reduced by 4.5 Mb. The decrease in gap length reflects in part the use of optical map data to size the remaining gaps wherever possible, replacing many of the default 50 kb gaps found in GRCm38. Sequences used for gap closures included clones, GRC-constructed contigs, as well as contigs from the C57BL/6J long-read based assembly ASM377452v2.

GRCm39 assembly statistics
Figure 1: GRCm39 Assembly Statistics


As in prior assembly versions, the GRCm39 chromosome sequences continue to represent the C57BL/6J strain. However, the alternate loci scaffolds that provided additional strain representations for highly variant genomic regions in GRCm38 and MGSCv37, have been removed from the assembly. The relatively low usage of these scaffolds, coupled with a growing number of high quality strain-specific genome assemblies available in public sequence databases, such as those generated by the Mouse Genomes Project, has reduced the need for the inclusion of these sequences in the reference genome assembly. Although no longer affiliated with the reference assembly, these sequences remain available in the INSDC databases (GenBank/ENA/DDBJ).

The new reference assembly will be annotated by GENCODE and RefSeq in the coming months. An in-depth transcript alignment analysis of a pre-release version of the GRCm39 assembly, presented at the 2019 IMGC meeting, demonstrated that there is improved representation for more than 50 genes. A list of these genes is provided in our earlier blog post. The GRC will provide a complete list of genes improved in GRCm39 as the annotation effort progresses.

Notable curation activities represented in the new assembly, but not in previous patch releases, include the targeted update of more than 1,500 individual bases at which the GRCm38 allele representation was erroneous or an unsupported C57BL/6J variant, a substantial retiling of the chr X pseudo-autosomal region (PAR) that provides representation for several genes missing from GRCm38 (Fig 2), removal of a false triplication involving the Duxbl locus, and correction of a 16 Mb inversion at the proximal end of chromosome 14.

GRCm39 chromosome X
Figure 2. Genes in GRCm39 chr X PAR


The GRC wishes to thank the many members of the mouse community who have reported assembly issues, and contributed their time, expertise, and data to assist in curation efforts. Updates to the GRC website will be made to reflect the new assembly. With the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems. We encourage you to contact the GRC for additional information on the curation of assembly regions of interest. You can also subscribe to grc-announce email list to receive email notification for all GRC assembly updates.