GenomeRef

Friday, February 2, 2024

GRCr8: A new rat reference assembly is released!

GRCr8 (GCA_036323735.1), the latest version of the rat reference genome assembly, is now available. GRCr8 is an evolution of mRatBN7.2 (GCA_015227675.2), the Vertebrate Genomes Project-generated rat assembly that was the first reference for this species to be adopted by the GRC for stewardship. mRatBN7.2 was an assembly of a Brown Norway (BN) male rat from the same colony at the Medical College of Wisconsin that supplied the female rat used in the 2004 RGSC_v3.4 assembly (AABR00000000.3/GCF_000001895.1). While the assembly of mRatBN7.2 was a substantial improvement over prior versions (https://pubmed.ncbi.nlm.nih.gov/37214860/), advances in sequencing technology and assembly and curation methods since its release in 2020 have for the first time resulted in the GRC releasing a new de novo assembly as a reference update instead of curating issues in the prior version.

GRCr8 was generated by Dr. Peter Doris (University of Texas Health Science Center at Houston) with colleagues Theodore Kalbfleisch (University of Kentucky) and Melissa Smith (University of Louisville) in the NHGRI-funded “Inbred Rat Genomes Project”. The assembly is based on PacBio HiFi sequences from a BN/NHsdMcwi male rat. The assembly was gap filled using contigs from the PacBio CLR reads produced for mRatBN7.2. Additional short read genomic sequence from a BN/NHsdMcwi rat in the Hybrid Rat Diversity Program at the Medical College of Wisconsin were used for assembly polishing. In addition to yielding a consensus quality score (QV) of 59.5, GRCr8 addresses structural limitations of mRatBN7.2. The genome size is increased from 2.63Gb to 2.81Gb, largely because of incorporation of genomic regions that show structural expansion. For example, chrY has increased from 18Mb in mRatBN7.2 to 60Mb in GRCr8. Accompanying multi-tissue single molecule transcript information (PacBio IsoSeq) is available for this assembly (BioProject: PRJNA1027884). These data extend the scope of rat transcript diversity and will inform gene expression in newly incorporated regions of the genome.

The GRCr8 assembly has been submitted to the INSDC, making it available through GenBank, ENA and DDBJ. It will subsequently be annotated by groups such as RefSeq and Ensembl, after which it will be available on genome browsers at various resources, including the Rat Genome Database, NCBI, Ensembl, and UCSC.

Monday, May 9, 2022

GRCh38.p14 is now released!

GRCh38.p14 (GCA_000001405.29/GCF_000001405.40), the latest update to the human reference assembly, has been released! It adds 69 new patch scaffolds, 51 of which are FIX patches that update sequences on the GRCh38 reference chromosomes or alternate loci, while 18 are NOVEL patches, providing new alternate representations for complex genomic regions that are inadequately represented by a single sequence. Two previously released FIX patches were also updated. With this release, the reference assembly contains a total of 250 patch scaffolds (164 FIX, 90 NOVEL).

30 of the patches included in this release include genome updates made in support of the MANE project, a joint NCBI-EBI effort to produce a minimal set of matching RefSeq and Ensembl transcripts of protein coding genes, creating a matched pair of transcripts but retaining their respective identifiers. Read more about MANE effort in their recent Nature publication. The corresponding patch updates to the reference assembly involved changes addressing normal human variation as well as correcting errors in the underlying component sequences.

Of the 53 FIX patches in GRCh38.p14, 23 of these correct errors in individual assembly component sequences, resulting in updates to 12 gene representations (Table 1). 20 are variation-related updates, 12 of which provide the coding allele for 13 polymorphic pseudogenes that are non-coding on the corresponding GRCh38 chromosomes (Table 2). Additionally, 2 provided sequence updates at chromosomal loci where it's unclear if the GRCh38 sequence is in error or a rare haplotype. Patch scaffolds in GRCh38.p14 close 6 gaps in the reference assembly, and extend sequence into one other gap. 4 of the closed gaps are located within chromosomes, while the remaining 2 patch scaffolds closed "pre-telomeric" gaps, extending the sequence of the chromosome into the telomeric repeats.

Table 1. Gene representations updated on FIX patches addressing assembly component problems.

Table 2. Coding alleles of polymorphic pseudogenes updated by FIX patches addressing genomic variation.

An example of an important FIX patch in this release is an update to APOB, one of the genes the American College of Medical Genetics and Genomics recommends for reporting of incidental findings in clinical exome and genome sequencing. The patch scaffold provided in GRCh38.p14 represents the common allele.

There are 18 NOVEL patches in this release, providing alternate sequence representations of chromosomal sequences, including 9 genes (Table 3). Other NOVEL patches represent inversion and insertion haplotypes relative to the corresponding chromosomal region.

Table 3. Genes with alternate sequence representation on GRCh38.p14 NOVEL patches.

Shown below is an example of an update to PRDM9, a medically important gene in which naturally occuring allelic variation regulates the activity of meiotic recombination hotspots. The original GRCh38 release represents the relatively rare "B" allele on chromosome 5. With the release of GRCh38.p14, a NOVEL patch scaffold has been added to the assembly (MU273356.1/NW_025791779.1) to provide additional representation for the sequence of the more common "A" allele.

Figure 1. PRDM9 allele representation in GRCh38.p14. Top: Alignment of PRDM9 "A" (NM_001310214.3) and "B" (NM_001376900.1) allele transcripts to chromosome 5. The chromosome sequence represents the "B" allele. The red circles and arrows highlight mismatches in the alignment of the "A" allele. Bottom: Alignment of Alignment of PRDM9 "A" and "B" allele transcripts to the NOVEL patch added in GRCh38.p14. The patch represents the "A" allele. The red circle highlights mismatches in the alignment of the "B" allele.

Notably, 9 of the NOVEL patches used clone sequence generated by Evan Eichler's lab as part of a published study of the evolution and population diversity of human-specific segmental duplications.The GRC also used sequences generated by the Eichler lab to create a FIX patch to improve a GRCh38 chromosome 5 alternate locus scaffold (KI270897.1/NT_187651.1) representing the haplotype from the CHM1 hydatidiform mole at the hypervariable SMA locus. Informed by CHM1 Bionano optical map data, the GRC provided a FIX patch (MU273354.1/NW_025791777.1) that corrects component order and adds sequence from several newly sequenced CHM1 BAC clones to the alternate locus scaffold.

Figure 2. A FIX patch corrects the sequence path of the GRCh38 alt locus scaffold providing representation of the CHM1 haplotype for the SMA region on chromosome 5. Top: Tiling path of component clones in the alt loci scaffold. Middle: Tiling path of component clones in the FIX patch scaffold. Blue outline: clones excluded from fix scaffold. Green outline: clones added to fix scaffold. Magenta outline: clones from alt scaffold retained in fix scaffold. Black: sequence gap. Bottom: Alignment of fix patch scaffold path to CHM1 Bionano optical map, demonstrating concordance.

This patch release also extends GRC efforts to identify and exclude problematic sequences, such as false redundancies and contamination, from the reference assembly. The companion BED file available from GenBank that identifies such regions and can be used as a mask to exclude them from analyses, has now been updated. The latest updates reflect curation done in response to reports from GRCh38 analyses performed by the Genome In a Bottle (GIAB) and Telomere-to-Telomere (T2T) consortia. In addition to the chromosome 21p regions previously reported, the file provides coordinates for 7 other regions in which the sequence falsely duplicates other sequence found in the assembly.

We are grateful to our community collaborators for the sequences and analyses that contributed to the updates in GRCh38.p14. Please alert the GRC if you have specific assembly issues to report, or contact us for any questions or feedback. We'd love to hear from you!

Wednesday, July 21, 2021

One of these things doesn't belong: efforts to exclude problematic sequences in GRCh38

Since the release of GRCh38, the GRC has received a number of user reports alerting us to a potential false duplication involving chr 21p and 21q. Users noted that reads were aligning to both regions in GRCh38, but not GRCh37/hg19, resulting in a decreased mapping score and difficulties in variant calling throughout. Additionally, user analyses involving Multiplex Ligation-dependent Probe Amplification (MLPA), a technique for gene copy number detection, and exome studies indicated potential false duplications. The implicated regions contained several genes, including CBS (Gene ID: 875), U2AF1 (Gene ID: 7307) and KCNE1B (Gene ID: 3753). The GRC has investigated the matter and concurs that the GRCh38 assembly contains sequence on the short arm of chr 21 that should be excluded from analyses. Read on to learn more about this issue, as well as some recently detected non-human contamination in GRCh38, and ways you can find and avoid these sequences in your analyses.

The short arm of human chromosome 21, like that of the four other human acrocentric chromosomes, is where genes associated with rDNA synthesis are localized, and is characterized by highly repetitive heterochromatic sequence. The repetitive nature of these sequences, coupled with limitations in sequencing technology, have until recently made the representation of these regions in genome assemblies very difficult.

As a consequence, the GRCh37 representation of the chromosome 21 p-arm contained only 11 clone sequences. Seven were clones from the HSA21-specific BAC library CHORI-507 that had previously been experimentally localized to 21p (PMID: 17895424). In an effort to add additional sequence to this repetitive region, 23 additional components were added to 21p for GRCh38, including 18 additional CHORI-507 clones, 4 RPCI-11 clones, and 1 ABC9 fosmid. Admixture mapping localized some of these clones to this region.

In response to the user reports, the GRC re-reviewed the sequences added to 21p in GRCh38. Haploid CHM13hTERT Illumina reads generated by The McDonnell Genome Institute were aligned to GRCh38 by NCBI, and evaluated for read mapping and coverage. This analysis supported the user reports, suggesting that 5 of the newly added CHORI-507 clones (FP565260.4, CU639417.17, FP236240.8, FP475955.4 and CU633980.13) were actually redundant with sequences on chr 21q, and thus represented false duplications in GRCh38.

The GRC has now removed these sequences from the files that it uses to generate the reference assembly. However, we cannot remove them from the GRCh38 assembly without triggering the next major release of the human assembly. In order to help users recognize these regions and avoid them in their analyses, we have produced a masking file to be used as a companion to GRCh38. This BED file is available from the GenBank FTP site: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_GRC_exclusions.bed. This file provides the assembly coordinates of the 5 clones incorrectly localized to chr 21p. The Genome in a Bottle Consortium recently posted a preprint demonstrating that using this masking file greatly improves variant calling accuracy in the affected genes (https://doi.org/10.1101/2021.06.07.444885).

In addition to these sequences, the file also includes 2 other assembly scaffolds that were found, after the release of GRCh38, to be contaminated with non-human sequence. These include a chr Un scaffold (KI270752.1/NT_187507.1), whose sole component (AF065393.1) is now known to represent sequence from Chinese hamster (PMID:30486838), likely derived from the human-hamster CHO cell line that was the clone source, and an alternate loci scaffold (KI270825.1/NT_187580.1) whose non-anchor component (AC225822.3) was shown to be chimeric. In AC225822.3, the first 25,375 bases are human sequence matching GRCh38 chr 10 reference component and alternate scaffold anchor sequence AL391421.27, while the rest match Acidithiobacillus thiooxidans sequences from multiple WGS projects (PMID:32398145). Although all sequences in the reference assembly are screened for foreign contamination, these two were not detected at the time of release (2014). Prompted by these findings, the GRC has more recently re-screened the assembly with updated contamination databases and has not detected additional issues. As these two scaffolds are not human sequence, very few reads are likely to map well to them, but users may still want to make note of them in their analyses.

In total, the contamination represents ~800 Kb, or 0.02% of the total sequence length. GRCh38 remains an extremely high quality reference assembly. Nonetheless, the GRC remains committed to addressing assembly errors and making sure it serves as the most reliable analysis substrate possible. Check out our website to see other genomic regions under review. We welcome your feedback and reports of newly discovered issues! In the future, we plan to update the masking file with any new regions as identified and reviewed by the GRC.

Fig. A

Fig. A: Aligned CHM13hTERT Illumina reads viewed in Integrative Genomics Viewer (IGV). The panes labeled 'Original' were reads aligned prior to redundant sequence masking and the panes labeled 'Fixed' are reads aligned after redundant sequence masking.

The top two panes show reads aligned to the valid U2AF1 locus in GRCh38 (NC_000021.9:43,091,000-43,110,000 of 21q) and the bottom two panes show reads aligned to the falsely duplicated (pseudo region) region of GRCh38 (NC_000021.9:6,480,000-6,500,000 of 21p).

In IGV, sequence reads that align to 2 places in the Reference (whether it is correct or not), yield poor/ambiguous alignments, indicated by clear, unshaded reads. This is shown by both the 'Original hg38 U2AF1' and the 'Original hg38 pseudo region' pane.

Following the masking of the known, duplicated region introduced in GRCh38, the aligned reads in the 'Fixed hg38 U2AF1 gene' pane are shaded grey, meaning they have good mapping scores to that region. And there are no reads mapping to the 'Fixed hg38 pseudo region' because the duplicated sequence is masked in the Fixed hg38 file.

Fig. B

Fig. B: Aligned reads to 21q region that has false duplication on 21p in GRCh38 before masking. Note the BAC clone boundary where alignment of falsely duplicated region in 21p starts. This duplication involves the CBS gene (Gene ID: 875).

Fig. C

Fig. C: Aligned reads to 21p region falsely duplicated (in Fig. B). You can see ambiguous read alignment and the falsely duplicated CBS gene annotated in the gene track.

Fig. D

Fig. D: Alignment of BAC FP236240.8 (redundant BAC added to 21p for GRCh38) to the corresponding valid region on 21q. Note the redundant BAC alignment to the region (bottom pane) and the valid read alignment depth (shown in middle pane). Since the region was falsely duplicated, read alignment in the region of redundant BAC alignment is poor.

Monday, November 30, 2020

A New Rat Genome Assembly Sparks Membership of Rat and RGD in the Genome Research Consortium

The Rat Genome Database RGD is very pleased to announce the release of mRatBN7.1, the new rat genome assembly! The mRatBN7 assembly, generated by the Darwin Tree of Life Project at the Wellcome Sanger Institute, is significantly improved over the Rnor6.0 and previous assemblies. mRatBN7 was derived from a male BN/NHsdMcwi rat that is a direct descendent from the female BN rat previously sequenced. The new BN rat reference genome was generated using multiple technologies including PacBio long reads, 10X linked reads, Bionano maps and Arima Hi-C. Its quality is a substantial improvement compared to any of the previous assemblies, with just 175 scaffolds and having an N50 >135Mb and 756 contigs with N50 >29Mb, resulting in a contiguity similar to the human or mouse reference assemblies. The assembly has been submitted to the International Nucleotide Sequence Database Consortium (INSDC), and the initial GenBank record for it is now available at https://www.ncbi.nlm.nih.gov/assembly/GCA_015227675.1. Genome annotation, i.e. the assignment of gene positions and prediction of new genes and other genomic elements, will be generated by both NCBI and Ensembl. We are also pleased to announce that Rat and the mRatBN7 assembly have been accepted into the Genome Reference Consortium (GRC) and the RGD has been approved to represent the rat research community and participate in the ongoing work of curating the assembly. RGD will work closely with curators from the GRC, with the International Rat Omics Consortium (IROC), a grassroots community of rat genomics researchers, and the rat research community to identify any candidate regions for focused genome curation. Stay tuned for the appearance of rat on the GRC website!

Hi-C 2D map of mRatBN7.1 generated with HiGlass

Wednesday, July 22, 2020

GRCm39: the new mouse reference genome assembly

The GRC is pleased to announce the release of GRCm39 (GCA_000001635.9), the latest version of the mouse reference genome assembly.

GRCm39 is the first coordinate-changing update to the mouse reference since the 2012 release of GRCm38. More than 400 reported issues were resolved in the production of the new assembly, which also incorporates the sequence edits released as scaffolds in the six GRCm38 patch releases.

The new reference assembly exhibits substantial improvements in contiguity. As shown in Fig 1, the scaffold N50 has increased by 95% to 106.1 Mb in GRCm39, and 1.9 Mb of non-N bases were added to the assembly. The gap count has been nearly cut in half, with the total gap length reduced by 4.5 Mb. The decrease in gap length reflects in part the use of optical map data to size the remaining gaps wherever possible, replacing many of the default 50 kb gaps found in GRCm38. Sequences used for gap closures included clones, GRC-constructed contigs, as well as contigs from the C57BL/6J long-read based assembly ASM377452v2.

Figure 1: GRCm39 Assembly Statistics

As in prior assembly versions, the GRCm39 chromosome sequences continue to represent the C57BL/6J strain. However, the alternate loci scaffolds that provided additional strain representations for highly variant genomic regions in GRCm38 and MGSCv37, have been removed from the assembly. The relatively low usage of these scaffolds, coupled with a growing number of high quality strain-specific genome assemblies available in public sequence databases, such as those generated by the Mouse Genomes Project, has reduced the need for the inclusion of these sequences in the reference genome assembly. Although no longer affiliated with the reference assembly, these sequences remain available in the INSDC databases (GenBank/ENA/DDBJ).

The new reference assembly will be annotated by GENCODE and RefSeq in the coming months. An in-depth transcript alignment analysis of a pre-release version of the GRCm39 assembly, presented at the 2019 IMGC meeting, demonstrated that there is improved representation for more than 50 genes. A list of these genes is provided in our earlier blog post. The GRC will provide a complete list of genes improved in GRCm39 as the annotation effort progresses.

Notable curation activities represented in the new assembly, but not in previous patch releases, include the targeted update of more than 1,500 individual bases at which the GRCm38 allele representation was erroneous or an unsupported C57BL/6J variant, a substantial retiling of the chr X pseudo-autosomal region (PAR) that provides representation for several genes missing from GRCm38 (Fig 2), removal of a false triplication involving the Duxbl locus, and correction of a 16 Mb inversion at the proximal end of chromosome 14.

Figure 2. Genes in GRCm39 chr X PAR

The GRC wishes to thank the many members of the mouse community who have reported assembly issues, and contributed their time, expertise, and data to assist in curation efforts. Updates to the GRC website will be made to reflect the new assembly. With the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems. We encourage you to contact the GRC for additional information on the curation of assembly regions of interest. You can also subscribe to grc-announce email list to receive email notification for all GRC assembly updates.

Monday, June 8, 2020

ZFIN and the GRC: Supporting the zebrafish reference genome assembly

ZFIN is a member of The Genome Reference Consortium (GRC), an international collaboration consisting of NCBI, the Wellcome Trust Sanger Institute, the McDonnell Genome Institute at Washington University, the European Bioinformatics Institute (EBI) and ZFIN. This consortium is tasked with ensuring that the reference assemblies for human, mouse, zebrafish and chicken are updated and improved through new data and analysis from genome centers and the research community.

The zebrafish-specific GRC webpage (Fig. 1) provides an overview of the zebrafish genome, including an ideogram of the latest zebrafish assembly (GRCz11) that highlights the location of alternate loci scaffolds, downloadable files for the current public assemby and the tiling path files reflecting the latest assembly edits, as well as links to genome assembly data and genome regions under review. Zebrafish genome issues (Fig. 2), such as sequencing errors, gaps, and path problems, can be browsed at the chromosome level, filtered by problem type or status or searched by gene, location, clone name or accession.

If you come across what you suspect is a problem in the build in the course of your research, visit the GRC website to search the list of genome issues and if it has yet to be reported, select the "Report an Issue" tab in the header (Fig. 3) to report information about the potential problem in the build. Be as complete as possible and provide location, flanking sequences and a description of the issue. Genome annotators will evaluate the region, determine if an update to the genome is needed and submit data to create a new tiling path to improve the build with an update or "patch".

We welcome your feedback!

Figure 1

Figure 2

Figure 3

Thursday, May 23, 2019

Readying the release of GRCm39

GRCm38, the current mouse reference assembly, whose chromosomes represent the C57BL/6J strain, supports a broad range of research activities. Despite being one of the highest quality mammalian genome assemblies ever produced, it still has more than 600 gaps and includes sub-optimal representations for some genes. To address these issues and provide the murine research community with an improved substrate for their work, the GRC has been applying new technologies, such as optical/genome mapping, and using new sequence resources to curate an update to the reference genome assembly. The public release of the updated assembly, GRCm39, is planned for the end of 2019/early 2020.

Since the 2012 release of GRCm38, the last coordinate-changing update to the mouse reference, the GRC has provided 6 publicly accessible minor assembly updates, the last of which (GRCm38.p6) was released in September, 2017. These non-coordinate changing assembly versions, known as patch releases, cumulatively include 65 fix patches (chromosome path changes) and 9 novel patches (alternate representations of chromosome sequences, derived from other strains). In GRCm39, these fix patches will be incorporated into the chromosomes, and the novel patches will persist as alternate loci.

In preparation for the release of GRCm39, the GRC also analyzed a non-public updated version of the reference assembly. In the production of this updated assembly, 322 reported genome issues were resolved, 70% of which addressed gaps and problems with the sequence of underlying genomic clone components. The review evaluated the value and impact of the released patches and subsequent unreleased genome updates and assessed the need for additional work prior to the GRCm39 release.

Analyses of this assembly, known informally as "GRCm38B", reveal the removal of about 200 Kb over-expanded sequence found in GRCm38, and the addition of 254 new components, of which 95 are contigs assembled from WGS reads (PRJNA51977). Due to the curation effort, GRCm38B has fewer gaps, and increased contig and scaffold N50s in comparison to GRCm38 (Table 1).

Furthermore, an analysis of mouse RefSeq transcripts aligned to GRCm38B demonstrates the improved representation for at least 50 genes (Table 2).

One such example of improved gene representation in GRCm38B is shown in Figure 1. In GRCm38, an assembly gap at chromosome 4 nt 99,842,111, between components BX324127.8 and CU326395.5, results in a partial representation of Efcab7 (EF-hand calcium binding domain 7). In GRCm38B, sequences from MF597759.1 (a GRC-assembled contig of Illumina reads) closes this gap and provides the exons missing from Efcab7 transcript NM_145549.1.

Figure 1 Top: Incomplete representation of Efcab7 gene in GRCm38 due to an assembly gap. Middle: The gap is closed in GRCm38B with MF597759.1. It also provides complete representation of Efcab7. Bottom: Efcab7-mRNA partial representation in GRCm38 and complete representation in GRCm38B.

Based on GRCm38B analyses, 5 GRCm39 chromosomes will be comprised of a single scaffold (Chr. 11, 12, 15, 16, 18), 11 will be built from 2 scaffolds and the remaining 5 from more than 2 scaffolds.

In the months leading up to the GRCm39 release, the GRC will continue to curate additional genome issues. Sequences from the recently published C57BL/6J long-read based assembly ASM377452v2 are providing new resources for the update or closure of assembly gaps and correction of sequencing errors. Additionally, we are investigating individual bases at which the GRCm38 sequence differs from all 17 strain-specific genome assemblies (Mouse Genome Project) with the aim of correcting confirmed erroneous bases.

Upon the release of GRCm39, the GRC's curation of the mouse genome reference assembly will be limited to the resolution of community reported problems.

You can browse the status of GRC curation activities at our website, and we encourage you to contact the GRC for additional information on the curation of assembly regions of interest. Updates to the timeline for the GRCm39 release will be provided on the Mouse Genome Overview webpage. You can also subscribe to grc-announce email list to receive email notification for all GRC assembly updates .