Wednesday, March 20, 2019

GRCh38.p13 has been released


The GRC is pleased to announce that GRCh38.p13 is now available! This release adds 45 new scaffolds: 43 FIX patches and 2 NOVEL patches. The FIX patch scaffolds provide assembly corrections while the NOVEL patch scaffolds deliver new alternate sequence representations. A valuable contribution to this patch release comes in the addition of the Nucleolus Organiser Region (NOR) sequences for the short arms of the acrocentric chromosomes (13, 14, 15, 21, and 22) as provided by Brian McStay's group (PMID: 23990606). The NOR additions will be discussed in detail in a separate blog.

With access to an ever-increasing pool of high quality, long-read human assembly data, the GRC has been able to utilise this in GRCh38.p13 to address genome issues that have until now persisted due either to lack of data, or complexity. Much of the data added in this patch is derived from the CHM1 human haploid hydatidiform mole assembly (GCA_001297185.2). Originally produced as part of an assembly comparison analysis (see PMID: 
28396521), the assembly was recently Pilon corrected and re-submitted to GenBank by the McDonnell Genome Institute at the Washington University, a GRC center, with the specific aim of improving the base pair accuracy for use of its sequences in improving the Human Genome Reference.

In GRCh38.p13, a total of 28 assembly gaps have been closed. These updates, together with sequences added to correct 5 clone errors, add more than 0.5 Mb of unique data to the assembly. The majority of unique sequences added in this release come from contigs that are components of WGS assemblies derived from PacBio sequence reads, such as the CHM1 assembly mentioned above. However, genomic clone libraries still play an important role in assembly curation. In this release BAC clones from human cell line (CHM1htert) have provided complete, single haplotype representations of clinically important regions such as Prader-Willi on chromosome 15, and CT47 on chromosome X.

The CT47 cancer/testis antigen located on human Xq24 is organized as an array of 4.8 kb tandemly repeated units. Due to the repetitive nature of the sequence involved, coupled with the limitations of the technologies available at the time, the representation of the CT47 gene cluster in GRCh37 and GRCh38 was problematic. The region is gapped, and the flanking clones are from different haplotypes. As a consequence, the representation of the cluster in these assemblies was incomplete and biologically unsound, representing an indeterminate number of gene copies (Figure 1, top).
Studies have indicated that this polymorphic array is highly variable between haplotypes and ranges from 4 to 17 copies in length. Long-read sequencing of genomic clones has now captured the complete CT47 cluster as a single haplotype. The fix patch (ML143381.1) included in the GRCh38.p13 release now provides a contiguous and validated representation of the CT47 genomic region. This patch closes the assembly gap with sequences from BAC clone AC275592.1 (CH17-182I12) which contains a complete, 7 copy representation of the CT47 array (Figure 1, bottom). Note that this update reduces the number of CT47 genes represented as compared to GRCh37 and GRCh38.
Figure 1 Top: CT47 region in GRCh38. Incomplete representation of CT47 gene cluster in GRCh38 due to an assembly gap. Bottom: CT47 fix patch in GRCh38.p13. The gap is closed and a complete representation of CT47 cluster is provided. 
Optical mapping technology has been used to confirm the copy number for the CT47 array is accurate for the haploid CHM1tert sample (Figure 2), from which the clone library was derived.
Figure 2: AC275592.1 alignment to CHM1 Bionano optical map.

As more data becomes available using the latest technologies the GRC is able to utilise this in order to continually to update and improve the reference genome. If you have questions about this process, let us know.

You can download the GRCh38.p13 assembly, including the alignments of the patches to GRCh38, from the GenBank FTP.