Monday, November 30, 2020

A New Rat Genome Assembly Sparks Membership of Rat and RGD in the Genome Research Consortium

The Rat Genome Database RGD is very pleased to announce the release of mRatBN7.1, the new rat genome assembly! The mRatBN7 assembly, generated by the Darwin Tree of Life Project at the Wellcome Sanger Institute, is significantly improved over the Rnor6.0 and previous assemblies. mRatBN7 was derived from a male BN/NHsdMcwi rat that is a direct descendent from the female BN rat previously sequenced. The new BN rat reference genome was generated using multiple technologies including PacBio long reads, 10X linked reads, Bionano maps and Arima Hi-C. Its quality is a substantial improvement compared to any of the previous assemblies, with just 175 scaffolds and having an N50 >135Mb and 756 contigs with N50 >29Mb, resulting in a contiguity similar to the human or mouse reference assemblies. The assembly has been submitted to the International Nucleotide Sequence Database Consortium (INSDC), and the initial GenBank record for it is now available at https://www.ncbi.nlm.nih.gov/assembly/GCA_015227675.1. Genome annotation, i.e. the assignment of gene positions and prediction of new genes and other genomic elements, will be generated by both NCBI and Ensembl. We are also pleased to announce that Rat and the mRatBN7 assembly have been accepted into the Genome Reference Consortium (GRC) and the RGD has been approved to represent the rat research community and participate in the ongoing work of curating the assembly. RGD will work closely with curators from the GRC, with the International Rat Omics Consortium (IROC), a grassroots community of rat genomics researchers, and the rat research community to identify any candidate regions for focused genome curation. Stay tuned for the appearance of rat on the GRC website!
Hi-C 2D map of mRatBN7.1 generated with HiGlass

Wednesday, July 22, 2020

GRCm39: the new mouse reference genome assembly

The GRC is pleased to announce the release of GRCm39 (GCA_000001635.9), the latest version of the mouse reference genome assembly. 

GRCm39 is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38. More than 400 reported issues were resolved in the production of the new assembly, which also incorporates the sequence edits released as scaffolds in the six GRCm38 patch releases.

The new reference assembly exhibits substantial improvements in contiguity. As shown in Fig 1, the scaffold N50 has increased by 95% to 106.1 Mb in GRCm39, and 1.9 Mb of non-N bases were added to the assembly. The gap count has been nearly cut in half, with the total gap length reduced by 4.5 Mb. The decrease in gap length reflects in part the use of optical map data to size the remaining gaps wherever possible, replacing many of the default 50 kb gaps found in GRCm38. Sequences used for gap closures included clones, GRC-constructed contigs, as well as contigs from the C57BL/6J long-read based assembly ASM377452v2.

GRCm39 assembly statistics
Figure 1: GRCm39 Assembly Statistics


As in prior assembly versions, the GRCm39 chromosome sequences continue to represent the C57BL/6J strain. However, the alternate loci scaffolds that provided additional strain representations for highly variant genomic regions in GRCm38 and MGSCv37, have been removed from the assembly. The relatively low usage of these scaffolds, coupled with a growing number of high quality strain-specific genome assemblies available in public sequence databases, such as those generated by the Mouse Genomes Project, has reduced the need for the inclusion of these sequences in the reference genome assembly. Although no longer affiliated with the reference assembly, these sequences remain available in the INSDC databases (GenBank/ENA/DDBJ).

The new reference assembly will be annotated by GENCODE and RefSeq in the coming months. An in-depth transcript alignment analysis of a pre-release version of the GRCm39 assembly, presented at the 2019 IMGC meeting, demonstrated that there is improved representation for more than 50 genes. A list of these genes is provided in our earlier blog post. The GRC will provide a complete list of genes improved in GRCm39 as the annotation effort progresses.

Notable curation activities represented in the new assembly, but not in previous patch releases, include the targeted update of more than 1,500 individual bases at which the GRCm38 allele representation was erroneous or an unsupported C57BL/6J variant, a substantial retiling of the chr X pseudo-autosomal region (PAR) that provides representation for several genes missing from GRCm38 (Fig 2), removal of a false triplication involving the Duxbl locus, and correction of a 16 Mb inversion at the proximal end of chromosome 14.

GRCm39 chromosome X
Figure 2. Genes in GRCm39 chr X PAR


The GRC wishes to thank the many members of the mouse community who have reported assembly issues, and contributed their time, expertise, and data to assist in curation efforts. Updates to the GRC website will be made to reflect the new assembly. With the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems. We encourage you to contact the GRC for additional information on the curation of assembly regions of interest. You can also subscribe to grc-announce email list to receive email notification for all GRC assembly updates.

Monday, June 8, 2020

ZFIN and the GRC: Supporting the zebrafish reference genome assembly


ZFIN is a member of The Genome Reference Consortium (GRC), an international collaboration consisting of NCBI, the Wellcome Trust Sanger Institute, the McDonnell Genome Institute at Washington University, the European Bioinformatics Institute (EBI) and ZFIN. This consortium is tasked with ensuring that the reference assemblies for human, mouse, zebrafish and chicken are updated and improved through new data and analysis from genome centers and the research community.

The zebrafish-specific GRC webpage (Fig. 1) provides an overview of the zebrafish genome, including an ideogram of the latest zebrafish assembly (GRCz11) that highlights the location of alternate loci scaffolds, downloadable files for the current public assemby and the tiling path files reflecting the latest assembly edits, as well as links to genome assembly data and genome regions under review. Zebrafish genome issues (Fig. 2), such as sequencing errors, gaps, and path problems, can be browsed at the chromosome level, filtered by problem type or status or searched by gene, location, clone name or accession.

If you come across what you suspect is a problem in the build in the course of your research, visit the GRC website to search the list of genome issues and if it has yet to be reported, select the "Report an Issue" tab in the header (Fig. 3) to report information about the potential problem in the build. Be as complete as possible and provide location, flanking sequences and a description of the issue. Genome annotators will evaluate the region, determine if an update to the genome is needed and submit data to create a new tiling path to improve the build with an update or "patch".

We welcome your feedback!

Figure 1 Zebrafish specific GRC webpage
Figure 1

Figure 2 Zebrafish genome issues webpage
Figure 2

Figure 3 Report a Genome Problem webpage
Figure 3

Thursday, May 23, 2019

Readying the release of GRCm39


GRCm38, the current mouse reference assembly, whose chromosomes represent the C57BL/6J strain, supports a broad range of research activities. Despite being one of the highest quality mammalian genome assemblies ever produced, it still has more than 600 gaps and includes sub-optimal representations for some genes. To address these issues and provide the murine research community with an improved substrate for their work, the GRC has been applying new technologies, such as optical/genome mapping, and using new sequence resources to curate an update to the reference genome assembly. The public release of the updated assembly, GRCm39, is planned for the end of 2019/early 2020.

Since the 2012 release of GRCm38, the last coordinate-changing update to the mouse reference, the GRC has provided 6 publicly accessible minor assembly updates, the last of which (GRCm38.p6) was released in September, 2017. These non-coordinate changing assembly versions, known as patch releases, cumulatively include 65 fix patches (chromosome path changes) and 9 novel patches (alternate representations of chromosome sequences, derived from other strains). In GRCm39, these fix patches will be incorporated into the chromosomes, and the novel patches will persist as alternate loci.

In preparation for the release of GRCm39, the GRC also analyzed a non-public updated version of the reference assembly. In the production of this updated assembly, 322 reported genome issues were resolved, 70% of which addressed gaps and problems with the sequence of underlying genomic clone components. The review evaluated the value and impact of the released patches and subsequent unreleased genome updates and assessed the need for additional work prior to the GRCm39 release.

Analyses of this assembly, known informally as "GRCm38B", reveal the removal of about 200 Kb over-expanded sequence found in GRCm38, and the addition of 254 new components, of which 95 are contigs assembled from WGS reads (PRJNA51977). Due to the curation effort, GRCm38B has fewer gaps, and increased contig and scaffold N50s in comparison to GRCm38 (Table 1).


Furthermore, an analysis of mouse RefSeq transcripts aligned to GRCm38B demonstrates the improved representation for at least 50 genes (Table 2).


One such example of improved gene representation in GRCm38B is shown in Figure 1. In GRCm38, an assembly gap at chromosome 4 nt 99,842,111, between components BX324127.8 and CU326395.5, results in a partial representation of Efcab7 (EF-hand calcium binding domain 7). In GRCm38B, sequences from MF597759.1 (a GRC-assembled contig of Illumina reads) closes this gap and provides the exons missing from Efcab7 transcript NM_145549.1.
Figure 1 Top: Incomplete representation of Efcab7 gene in GRCm38 due to an assembly gap. Middle: The gap is closed in GRCm38B with MF597759.1. It also provides complete representation of Efcab7. Bottom: Efcab7-mRNA partial representation in GRCm38 and complete representation in GRCm38B.

Based on GRCm38B analyses, 5 GRCm39 chromosomes will be comprised of a single scaffold (Chr. 11, 12, 15, 16, 18), 11 will be built from 2 scaffolds and the remaining 5 from more than 2 scaffolds.

In the months leading up to the GRCm39 release, the GRC will continue to curate additional genome issues. Sequences from the recently published C57BL/6J long-read based assembly ASM377452v2 are providing new resources for the update or closure of assembly gaps and correction of sequencing errors. Additionally, we are investigating individual bases at which the GRCm38 sequence differs from all 17 strain-specific genome assemblies (Mouse Genome Project) with the aim of correcting confirmed erroneous bases.

Upon the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems.

You can browse the status of GRC curation activities at our website, and we encourage you to contact the GRC for additional information on the curation of assembly regions of interest. Updates to the timeline for the GRCm39 release will be provided on the Mouse Genome Overview webpage. You can also subscribe to grc-announce email list to receive email notification for all GRC assembly updates .

Tuesday, March 26, 2019

Shining a light on human acrocentric p-arms


The GRC is excited to announce that representations for the p-arms of the human acrocentric chromosomes can now be found in the GRCh38.p13 patch update of the reference genome, thanks to work done in Brian McStay's lab. These sequences are included on the following scaffolds: ML143366.1, ML143367.1, ML143372.1, ML143377.1, and ML143380.1.

The p-arms of the human acrocentric chromosomes HSA13-15, 21 and 22 each bear ribosomal gene arrays (Figure 1) termed nucleolar organiser regions (NORs). These are the most transcriptionally active regions of the genome and direct formation of nucleoli, the largest structures in the nuclei of all human cells. Research on these critical genomic regions is hampered by the fact that acrocentric p-arms are not included in human genome drafts. They are both internally highly repetitive and share a strikingly similar sequence content, making them recalcitrant to standard sequencing approaches. Despite these issues, Brian McStay's lab previously described a collection of sequenced cosmid and BAC clones that allowed them build a reasonable consensus for sequences both immediately proximal and distal to NORs (Floutsakou et al. 2013. Genome Res 23:2003-12). Proximal sequences are almost entirely segmentally duplicated, similar to regions bordering centromeres. In contrast, the distal sequence is predominantly unique to the acrocentric p-arms. Their interphase localisation, open chromatin structure and transcriptionally active state, point to a role in nucleolar biology and prioritise their inclusion in a future genome draft (for discussion see McStay. 2016 Genes Dev. 30:1598-610).

The McStay lab subsequently developed a workflow that has enabled them to determine the NOR distal sequence, the Distal Junction (DJ) from all five acrocentric chromosomes and from an additional two versions of HSA21, ~3 Mb in all. A panel of mono-chromosomal somatic cell hybrids, mouse A9 cells containing individual human chromosomes, allowed them to sequence one chromosome at a time. Sequencing was performed by combining sequence capture with PacBio SMRT sequencing. Pre-capture libraries (typically in the range of 4-6 kb) were prepared from each hybrid line. Capture was performed using oligonucleotide libraries designed using their original consensus. Circular consensus sequencing (CCS) of post-capture libraries generated so called reads of insert (ROIs) each with high sequence accuracy. This allowed the McStay group to assemble sequence contigs from the NOR distal region of each chromosome, regardless of the presence of repetitive sequences such as satellite DNA.

Their analysis of these sequences confirms sequence and presumably functional conservation between the acrocentrics. It also provides evidence for non-homologous exchanges between them. It's anticipated that extension of sequence contigs towards the telomeres will uncover increased structural variation between the acrocentric chromosomes.

Figure 1. FISH experiment showing the relative locations of the rDNA array and distal junctions on the p-arms of the human acrocentric chromosomes.



Wednesday, March 20, 2019

GRCh38.p13 has been released


The GRC is pleased to announce that GRCh38.p13 is now available! This release adds 45 new scaffolds: 43 FIX patches and 2 NOVEL patches. The FIX patch scaffolds provide assembly corrections while the NOVEL patch scaffolds deliver new alternate sequence representations. A valuable contribution to this patch release comes in the addition of the Nucleolus Organiser Region (NOR) sequences for the short arms of the acrocentric chromosomes (13, 14, 15, 21, and 22) as provided by Brian McStay's group (PMID: 23990606). The NOR additions will be discussed in detail in a separate blog.

With access to an ever-increasing pool of high quality, long-read human assembly data, the GRC has been able to utilise this in GRCh38.p13 to address genome issues that have until now persisted due either to lack of data, or complexity. Much of the data added in this patch is derived from the CHM1 human haploid hydatidiform mole assembly (GCA_001297185.2). Originally produced as part of an assembly comparison analysis (see PMID: 
28396521), the assembly was recently Pilon corrected and re-submitted to GenBank by the McDonnell Genome Institute at the Washington University, a GRC center, with the specific aim of improving the base pair accuracy for use of its sequences in improving the Human Genome Reference.

In GRCh38.p13, a total of 28 assembly gaps have been closed. These updates, together with sequences added to correct 5 clone errors, add more than 0.5 Mb of unique data to the assembly. The majority of unique sequences added in this release come from contigs that are components of WGS assemblies derived from PacBio sequence reads, such as the CHM1 assembly mentioned above. However, genomic clone libraries still play an important role in assembly curation. In this release BAC clones from human cell line (CHM1htert) have provided complete, single haplotype representations of clinically important regions such as Prader-Willi on chromosome 15, and CT47 on chromosome X.

The CT47 cancer/testis antigen located on human Xq24 is organized as an array of 4.8 kb tandemly repeated units. Due to the repetitive nature of the sequence involved, coupled with the limitations of the technologies available at the time, the representation of the CT47 gene cluster in GRCh37 and GRCh38 was problematic. The region is gapped, and the flanking clones are from different haplotypes. As a consequence, the representation of the cluster in these assemblies was incomplete and biologically unsound, representing an indeterminate number of gene copies (Figure 1, top).
Studies have indicated that this polymorphic array is highly variable between haplotypes and ranges from 4 to 17 copies in length. Long-read sequencing of genomic clones has now captured the complete CT47 cluster as a single haplotype. The fix patch (ML143381.1) included in the GRCh38.p13 release now provides a contiguous and validated representation of the CT47 genomic region. This patch closes the assembly gap with sequences from BAC clone AC275592.1 (CH17-182I12) which contains a complete, 7 copy representation of the CT47 array (Figure 1, bottom). Note that this update reduces the number of CT47 genes represented as compared to GRCh37 and GRCh38.
Figure 1 Top: CT47 region in GRCh38. Incomplete representation of CT47 gene cluster in GRCh38 due to an assembly gap. Bottom: CT47 fix patch in GRCh38.p13. The gap is closed and a complete representation of CT47 cluster is provided. 
Optical mapping technology has been used to confirm the copy number for the CT47 array is accurate for the haploid CHM1tert sample (Figure 2), from which the clone library was derived.
Figure 2: AC275592.1 alignment to CHM1 Bionano optical map.

As more data becomes available using the latest technologies the GRC is able to utilise this in order to continually to update and improve the reference genome. If you have questions about this process, let us know.

You can download the GRCh38.p13 assembly, including the alignments of the patches to GRCh38, from the GenBank FTP.


Friday, February 23, 2018

GRCg6: Curation of the chicken reference genome assembly transfers to the GRC



The GRC announces the release of the latest chicken reference genome assembly, GRCg6.


The chicken reference assembly defines a standard upon which other avian whole genome studies are based. Providing the best representation of the chicken genome is essential for facilitating continued progress in understanding and improving human health as this species serves as a model organism similar to mouse, zebrafish and other vertebrates.

The chicken reference genome project began as an international research collaboration coordinated by the McDonnell Genome Institute with past funding from the National Institutes of Health (NIH) and U.S. Department of Agriculture whose shared goals were to determine the sequence of the chicken chromosomes and annotate all possible chicken genes. The initial genome reference was completed and published in Nature in 2004 and has since evolved in quality. The reference experienced a major upgrade in 2016, termed Gallus_gallus-5.0 (GCA_000002315.3), as a result of long read sequencing technology and added transcriptome data. In 2017, responsibility for the management of chicken genome assembly updates transferred to the Genome Reference Consortium (GRC). The generation of additional sequence coverage comprised of long read data, in particular average read lengths of 12kb, as well as improvements to de novo assembly algorithms has resulted in another upgrade, GRCg6, that has now been released for immediate community use. Manual annotation of de novo assembled contigs that have been integrated with finished BAC clones have produced an assembly with superior metrics of quality, such as N50 contig size of 18Mb and much lower gap counts.


Visit the chicken homepage at the GRC website for assembly notifications, report assembly issues or contact us with questions.

The GRCg6 assembly will be available in all major genome browsers, and will be annotated by both the NCBI eukaryotic genome annotation pipeline and Ensembl.

* Photo courtesy of Dr. Jerry Dodgson