After 2.5 years of assembly
curation, the GRC is proud to present the new zebrafish reference genome
assembly, GRCz11.
This latest assembly has
been refined by the addition of nearly 1000 finished clone sequences and by
the resolution of more than 400 assembly issues. This resulted in a significant
reduction in scaffold numbers (3399 to 1905) and increase in scaffold N50 (2.18
Mb to 7.5 Mb) whilst the overall genome size was not affected. Figure 1 shows an overview of contig and
scaffold N50s over time, indicating the advance in assembly curation.
Figure 1: Contig
vs. scaffold N50s for zebrafish reference genome assemblies. Release dates: Zv7:
2008, Zv8: 2009, Zv9: 2010, GRCz10: 2014, GRCz11: 2017.
Alignments of 16133 RefSeq
sequences showed a further improvement over past assemblies: only 31 sequences remained not
found (down from 34), 105 transcripts are still split between locations (down from 205) and only 441 exhibit less than 95% CDS coverage (down from 566). Figure 2
shows an example of an improved region, correcting the representation of two genes.
Figure 2:
gEVAL screenshot of the supt4h1 gene (red arrow) in GRCz10 (top) and GRCz11 (bottom). In
GRCz10 the supt4h1 gene on chromosome 5 is incomplete, missing its first exon,
and surrounded by a truncated supplicated copy of rnf150b (blue arrow). In GRCz11, the
supt4h1 gene is complete and neighbouring the hsf5 gene, as seen in other
vertebrates, whereas the rnf150b gene is now complete and located singularly on
chromosome 23. gEVAL, the GRC’s genome assembly evaluation browser, indicates
completeness of genes and other features via colours (green > 98% coverage,
yellow = 50-98% coverage, red < 50% coverage).
GRCz11 was built as
described previously using clone sequences ordered and oriented
according to genetic markers and BioNano data, the latter greatly influencing
the scaffolding. Remaining gaps were filled with selected contigs from whole genome sequencing assemblies, mainly WGS31, and in
a few cases, WGS32.
For the first time in a
zebrafish assembly, GRCz11 also features alternate loci scaffolds (ALT_REF_LOCI). The alternate loci represent
variant sequence representations for certain genomic regions. They were selected from a pool
of 1895 finished clones that were found to inhabit an assembly region already
occupied by clone sequence and were therefore not included in the primary chromosomal path.
All surplus clones that exhibited at least 5 kb of unique sequence not
present in the primary chromosomal path were added to the assembly as alternate loci scaffolds, totaling
186 Mb of additional sequence in 1150 clones. The alignments of the alternate loci scaffolds to the primary chromosomal path are also included in the GRCz11 assembly to provide the chromosome context for these alternate
sequences. The alternate loci will be represented in genome browsers in the same way as
human and mouse ALT_REF_LOCI. Additional smaller scale variation will be submitted
to dbSNP/dbVAR/EVA.
After release of this
assembly, within the GRC, the Sanger team is transferring the maintenance of
the zebrafish reference to the new GRC member ZFIN. ZFIN will take on the future
curation of the assembly, and invites user reports on assembly issues. Future
updates to the assembly will be issued as patch releases, adding sequence but
not impacting the chromosomal coordinates.