Wednesday, September 17, 2014

GRCz10 - The GRC's first zebrafish genome reference assembly

When the Zv9 assembly was released in July 2010, the zebrafish genome sequence was given into the care of the GRC for future improvement and maintenance. After 4 years of hard work (and zebrafish IS hard work), we have now produced a new reference assembly, GRCz10.

The previous assembly was already of high value for the scientific community, and served well for both the investigation of isolated gene loci and to address overall bioinformatics questions (Howe et al. 2013), but still featured many gaps and suffered from sub-optimal long-range continuity. To address this, we have sequenced more than 1500 additional BAC and fosmid clones and added them to the assembly. We reviewed clone overlaps and clone placements with a variety of techniques. In collaboration with the Stemple lab, using the MGH panel, we generated a new meiotic map to fill remaining gaps in the high density meiotic map SATMAP. This new map, GAPMAP, helped with placing previously unlocalised contigs onto chromosomes, and allowed us to assess and improve the order of existing chromosome placements. The creation of an optical map further improved the clone assignments, with a notable impact on the structure of the repeat-rich chromosome 4. Thanks to a collaboration with Mark Hills from the Lansdorp lab, we gained additional insight into the orientation of assembly components, leading to more than 250 orientation changes and re-placements.  In total, more than 4000 genome issues were reviewed and resolved. The remaining gaps in the clone path were filled with sequence from the WGS31 whole genome assembly, as done before with Zv9.

The most notable changes in the chromosome landscape since Zv9 can be found on chromosome 4, which has gained about 15 Mb in length, and 94 of the 112 previously unplaced clone-contigs found a home on a chromosome. Whilst 85% of all publicly available cDNAs could be assigned a place on Zv9 with at least 97% identity and 90% coverage, we now find 87% in GRCz10. If we classify cDNAs with less than 97% identity and less than 40% coverage as not found, then Zv9 was missing 7% of the cDNAs, whilst GRCz10 now is only missing 3%. Now that the assembly has been released, the Havana team at the Sanger Institute is busy manually (re-)annotating genes, and the Ensembl team is working on generating an automated gene build and integrating it with these manually produced models. The NCBI eukaryotic genome annotation pipeline (gpipe) will also annotate the GRCz10 RefSeq assembly.

If you are working with the zebrafish genome assembly, we'd be very happy to get some feedback from you. You can either fill in the form at the GRC home page, or send us an email to zfish-help@sanger.ac.uk.