GenomeRef

Tuesday, December 24, 2013

Announcing GRCh38

The GRC announces the public release of GRCh38, the latest version of the human reference genome assembly. This represents the first major assembly update since 2009, and introduces changes to chromosome coordinates. The GRC would like to thank the many individuals and groups that have provided helpful feedback and shared data, often ahead of publication, in efforts to improve the reference assembly. Such interactions help ensure the reference assembly is truly a community resource.

Users can download the latest version of the assembly from the GenBank FTP site: ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/

The GRC does not provide annotation for the assembly. The assembly will be picked up from this FTP site for annotation by the major browsers (UCSC, Ensembl and NCBI), who will make it available on their websites in the upcoming weeks and months.

GRCh38 highlights

Mitochondrial genome

MITOMAP, the organization responsible for management human mitochondrial sequences, has kindly allowed the GRC to include the mitochondrial reference sequence with GRCh38. As in GRCh37, the current MT reference sequence is the Revised Cambridge Reference Sequence (rCRS), represented by GenBank accession number J01415.2 and RefSeq accession number NC_012920.1.

Sequence representation for centromeres

In previous reference assembly versions, the centromeres were represented by large, megabase-sized, gaps (N's in the assembly sequence). In GRCh38, these gaps are replaced by sequences derived from the reads generated during the sequencing of the HuRef genome. These sequences were used to create centromere models, as described in Miga et al., 2013, that provide the approximate repeat number and order for each centromere in the genome. These model centromere sequences are anticipated to be useful for read mapping and variation studies. Be on the lookout for upcoming GRC blogs with more information about these centromeres.

General assembly updates

Large scale studies of human variation, such as the 1000 Genomes Project, identified a number of bases and indels in GRCh37 that were never seen in any individuals, suggesting they may represent errors in the assembly. Several thousand individual bases were updated in GRCh38, many of which corrected errors in coding sequence. In addition, a number of assembly regions that were misassembled in GRCh37, such as 1Q21, 10Q11 and the chr. 9 peri-centromeric regions have been retiled. Several highly variant genomic regions, such as the IGH locus have been retiled with components derived from a single haplotype resource in order to ensure the reference assembly provides a valid haplotypic representation. More that 100 assembly gaps have also been updated; these are either closed or reduced, in many cases with publicly available WGS sequences from other genome sequencing projects.

Variation

Like GRCh37, the updated reference assembly provides alternate sequence representation for variant regions in the form of alternate loci (alt loci) scaffolds. The alt loci are stand-alone, accessioned sequences for which chromosomal context is provided via alignment to the reference chromosomes. All alternate loci include at least one anchor sequence, a component also found on the reference chromosomes, to ensure these alignments are of high quality. Alt loci belong to alternate loci assembly units: the assembly unit ALT_REF_LOCI_1 contains the first alternate sequence representation for any genomic locus, ALT_REF_LOCI_2 contains the second alternate sequence representation and so forth. GRCh38 contains 261 alt loci scaffolds, in 35 alternate assembly units. 72 of these alternate loci were previously available as NOVEL patches to GRCh37. The LRC/KIR complex on chr. 19 has the largest number of alternate sequence representations (35), followed by the MHC on chr. 6 (7).

Friday, September 13, 2013

The non-nuclear genome: including a mitochondrial genome in GRCh38

The GRC maintains and improves the reference nuclear assembly only. The MITOMAP (http://www.mitomap.org/MITOMAP) group has served a similar function curating genetic variation of the mitochondrial genome. When the GRC released GRCh37 we only released the nuclear genome, as we didn't think we could distribute the sequence representing the mitochondrial (MT) reference (which we did not produce) as part of the GRCh37 reference. Not distributing an MT sequence with reference nuclear assembly led to some confusion in the research community as different groups adopted different versions of the MT sequence records to use as their reference.
The current MT reference sequence is the Revised Cambridge Reference Sequence (rCRS) represented by GenBank accession number J01415.2 and RefSeq accession number NC_012920.1. The MITOMAP group has graciously allowed the GRC to distribute their annotated version of the rCRS MT reference sequence with the nuclear assembly and this sequence was added to GRCh37 with the second patch release. This same MT reference sequence will be included in the GRCh38 assembly release.

Tuesday, September 10, 2013

Tech Tip: The Pseudo-autosomal region of the reference assembly

As part of normal cell division, maternal and paternal chromosomes copies pair (that is maternal chromosome 1 pairs with paternal chromosome 1) and exchange genetic material. This is an important step in cell division and poses a unique problem for males, who normally only have a single copy of the X and Y chromosomes. The X and Y chromosomes have regions at each end referred to as the pseudo-autosomal region - or the PAR. The PAR regions of the X and Y are homologous which allows for the X and Y to pair and exchange genetic material within these regions. Within the PAR regions, males contain two copies of genes (one on the X and one on the Y) where as on the rest of the X and Y chromosome they only contain one copy. There are two PAR regions, one at either end of the X and Y chromosomes.

Human chromosomes X and Y with the PAR locations highlighted in orange and denoted by gray triangles. The PAR 1 region (on Xp and Yp, to the left) is larger than the PAR 2 region (on Xq and Yq). The PAR 2 region orange highlight is not visible as it is very small with respect to the rest of the chromosome.

During the Human Genome Project (HGP) a decision was made to not separately sequence the PAR region of a Y chromosome. At that point, the reference assembly was meant to be a haploid genome representation, so only one copy of the PAR regions was necessary. In order to build the Y chromosome, the HGP made a copy of the X chromosome PAR sequence and inserted it into the chromosome Y sequence assembly. This was done in order to make a complete model of each chromosome- without the PAR representation the Y sequence assembly would be incomplete.

Since the GRC has taken responsibility for the reference assembly, we have updated the assembly model so that the reference is no longer a single, haploid representation [PubMed]. However, when we introduce allelic duplication, we put the duplicated sequence into a separate assembly unit (refresher on the assembly model). The PAR is the only case where we represent allelic duplication within the same assembly unit. This was true for GRCh37 and will again be the case when we submit GRCh38 to GenBank.

This duplication needs to be taken into account when performing sequence analysis so that you can distinguish allelic duplication from other types of duplications, such as repeats and segmental duplication. When you sequence a female sample, reads from the PAR regions will align to both the X and Y PAR sequences. This may affect the mapping quality of reads in this region and can affect variant calling. One approach many groups take to solving this problem is to 'hard-mask' the PAR regions on the Y chromosome- this means replacing the actual sequence with Ns. This preserves the sequence coordinate space of the Y chromosome, but eliminates the duplication at this locus.

Wednesday, January 9, 2013

Genome Update: Highly variant immune regions retiled as single haplotype paths

Genes encoding for proteins that compose the immune system are constantly evolving in response to selective pressures from pathogens. This rapid host-pathogen co-evolution has led to large families of genes that are highly polymorphic and are often a result of gene duplication and diversification. In GRCh37, the current reference assembly, some chromosome regions encompassing such genes are comprised of components from several different genomic libraries. The lack of a single haplotype and excess allelic variation at such regions hinders haplotype inference using traditional linkage disequilibrium based methodology. In addition, given the polymorphic nature of these genes, paralogs may be missing from the reference assembly. The CHORI-17 BAC library, derived from a hydatidiform mole, is an excellent resource for resolving loci such as these, as it is composed of germline material without any allelic variation. We sequenced clones from CHORI-17 to create a single haplotype across two of these loci: the leukocyte receptor complex (LRC) and the immunoglobulin heavy chain locus (IGH). These new paths have now been released as fix patches in GRCh37.p11.

The LRC on chromosome 19q13.4 is approximately 1 Mbp and contains many genes related to immune response including the LILR (Leukocyte Immunoglobulin-like Receptor) and KIR (Killer Immunoglobulin-like Receptor) gene families (Fig.1). The products of these genes interact with HLA molecules making them important components of the innate immune response. The GRC previously released 8 novel patches providing partial representation of the LRC region for eight different haplotypes. We have now released a fix patch (KB021647.1) for this region that provides full representation for the CHORI-17 haplotype. In GRCh38, this patch will be incorporated into the reference chromosome, replacing the GRCh37 mixed haplotype. The CHORI-17 haplotype harbors the common 6.8 kbp LILRA3 deletion, which has been associated with multiple autoimmune disorders such as psoriasis and multiple sclerosis. In addition, the KIR haplotype is the A01 haplotype, which contains the 22 bp frameshift deletion variant of the 2DS4 gene that inactivates the protein.

Fig. 1 Top: Alignment of GRCh37 chr. 19 to the LRC region fix patch. Bottom: Alignment of the fix patch and 8 LRC region novel patches to GRCh37 chr. 19. The blue bars represent the tiling paths of chr. 19 (NC_000019.9) and the fix patch (KB021647.1). The region of the fix patch comprised of CHORI-17 clones is highlighted in orange. Genes annotated on the chromosome are shown in green. The gray tracks below represent the alignments: the thin horizontal lines indicate gaps, while the small vertical red bars indicate mismatches. The red arrows show the location of the LILRA3 deletion in the CHORI-17 haplotype.

The 1 Mbp IGH locus on chromosome 14q32.33 contains genes that encode for the heavy chain of immunoglobulin molecules that interact with antigen epitopes (Fig. 2). This locus is even more complicated than the LRC given that the IGH genes are subject to somatic rearrangements, and attempts to reconcile the organization of the locus using B-lymphocyte derived material have been difficult. The GRC has now released a fix patch (KB021645.1) that provides a single haplotype representation for the majority of this locus, covering the IG variable domain encoding gene segments. The CHORI-17 haplotype adds 101 kbp of previously uncharacterized sequence, including functional IGH variable genes and four large germline copy number variants (Watson and Steinberg, in review).

Fig. 2. Top: Alignment of GRCh37 chr. 14 to the IGH region fix patch. Bottom: Alignment of the fix patch to GRCh37 chr. 14. The blue and gray bars represent the tiling paths of chr. 14 (NC_000014.8) and the fix patch (KB021645.1). The region of the fix patch comprised of CHORI-17 clones is highlighted in orange. Genes annotated on the chromosome are shown in green. The purple bars below represent the alignments: the thin regions indicate gaps, while the small vertical ticks indicate mismatches.

These two updates highlight the utility of using hydatidiform mole BAC libraries for resolving complex, highly duplicated loci of the human genome. By releasing these updates as fix patches to the reference sequence researchers can make use of these high quality sequences to better characterize sequence variation from their own disease association studies ahead of the GRCh38 genome update.

Thursday, July 26, 2012

The GRC and the 10th International Zebrafish Genetics and Development Meeting (June 20-24, 2012 - Madison, Wisconsin)

Members from the GRC attended the 10th International Zebrafish Genetics and Development meeting in Madison, Wisconsin, to gain insight into trends of current research and provide information to the community. New approaches for the identification and targeting of mutations (e.g. Whole Genome Sequencing and the use of Transcription Activator-Like Effector Nucleases (TALENs)) were a recurring theme throughout the meeting, highlighting the importance of improved reference assemblies with accurate sequence. Whole genome sequencing is being used in forward genetic screens for the identification of mutations, whereas TALENs are being used to generate targeted gene mutations in zebrafish. A high quality reference genome assembly is of utmost importance to the success of both of these technologies to allow accurate mapping for experimental design (oligonucleotide design for TALENs) and subsequent data analysis (mapping whole genome sequence data to the reference for mutation identification). Discussions held at the meeting indicated that the zebrafish research community would welcome a new assembly.

An interim build, GRCz9b, was performed internal to the GRC prior to the conference as a measure to gauge the value added since the Zv9 release in 2010 and also to assess the scope of work required for further improvements in preparation for GRCz10. This test assembly was met with high interest at the meeting, including requests for access to the data. Improvements to the assembly include the exclusion of 19Mb previously duplicated over-expanded sequence (Fig. 1 and 2); the addition of 935 new clones and an increase in scaffold N50 by 14%. The gene count does not differ significantly from Zv9.

Fig. 1: A PGPviewer screenshot from the Zv9 and GRCz9b assemblies, showing localisation of a previously unlocalised scaffold. In Zv9, FP236471 was not localised to a chromosome and remained on a single scaffold (Scaffold Zv9_scaffold3539), whereas FP003601 was located to chromosome 4. The red block at the end of FP236471, in Zv9, indicates sequence similarity between the two clones. In GRCz9b, an overlap has been made, adjoining the two contigs, with a perfect sequence alignment, indicated by the green clone overlap block at the bottom of the screenshot. The feature tracks have been kept to a minimum for illustration purposes.

Fig. 2: A PGPviewer screenshot from the Zv9 and GRCz9b assemblies, showing the correction of an over-expanded region in Zv9. In Zv9, CU571256 and BX927241 reside next to each other with no overlap. However, the red blocks show a potential alignment between them and the end sequence alignments also indicate an over-expansion. In GRCz9b, the clones overlap with a highly variable alignment due to the haplotypic nature of the two libraries used, as highlighted by the red clone overlap block at the bottom of the screenshot. The feature tracks have been kept to a minimum for illustration purposes.

Approaches underway, to complement existing genome curation, include increasing the coverage of SATMAP, the high-density meiotic map used to allocate all genomic contigs, along with the sequencing of more than 1000 genomic clones to fill gaps and cover those genes still missing. To ensure effective utilization of the new clone sequence, GRCz10 will contain unfinished sequence with HTGS phase 2 (ordered and oriented contigs), unlike its predecessor. The GRC are planning to release GRCz10 in 2013.

Friday, July 6, 2012

Hidden assembly problems exposed

The human reference genome GRCh37 represents the highest quality mammalian genome assembly ever to be produced. It's played a major role in advancing both basic research and clinical research, and it continues to teach us that there's much we still don't know about human genomic biology. However, it is important to keep in mind that the assembly isn't perfect. While many users may be aware of some of the more visible issues with the assembly, such as gaps or missing genes, there are other assembly problems that may be less apparent. As noted in a previous GRC blog post, erroneous bases are one such problem. While the reference assembly is accurate to an error rate of ~1 in 100,000 bases, this still means that approximately 28,000 of the 2.85 billion bases are inaccurate. Using data from sources such as the 1000 genomes project, the GRC is working to address these sequencing errors for GRCh38, the next genome version that will be publicly released.

Component assembly problems represent another less-recognized source of error in the reference genome. The human reference genome is a clone-based assembly, sequenced using Sanger technology. Generating the consensus sequence for each of these genomic clone components involved their sub-cloning into plasmid vectors, sequencing and reassembly. During the course of the human genome project, many of the sequencing centers made note of problems encountered during the sequencing of these clones as annotations on the sequence records. Colloquially known as "black-tag" annotations, these represent regions of uncertainty due to force joins, unresolved tandem repeats, low quality or coverage sequence, and other biological and technical challenges. Although these annotations are part of public sequence records, they have historically been difficult to interpret in the context of the genome assembly because their coordinates were component, rather than scaffold or chromosome based.

The GRC has now mapped annotated clone assembly problems of assembly components to the coordinates of the top-level molecules in which they are found (scaffold or chromosome) in the human, mouse and zebrafish reference assemblies. These data are available in GFF3 and ASN.1 format on the GRC's public FTP site. These files allow users to see whether problems have been annotated at their coordinates of interest or if the locations of other genome features, such as segmental duplications or structural variation, fall in annotated problem regions and potentially represent false calls. Additionally, the GRC is cross-checking reports of genome problems with these coordinates in an effort to understand the underlying cause of these issues (Figure 1). It should be noted, however, that not all genomic components were equivalently annotated during the course of the human genome project. Thus, the absence of an annotation in these files does not mean that no problem was encountered during component sequencing, just that none was noted in its record. We encourage users to review the interpretations of their own assembly annotations in light of this data and we welcome your feedback!

Fig. 1

Figure 1. Graphical image of a region on GRCh37 chr. 15 (NC_000015.9; 99.63-99.65 Kb). The blue bar at top represents the component from which the chromosome sequence is derived at this location (AC036108.19 (RP11-6O2)), while an NCBI gene annotated in this vicinity is shown just below (green bar). At bottom, the small vertical black tick marks show the positions of clone assembly problems that were annotated on the component sequence. The GRC received two separate reports of assembly problems in this region (HG-971 and HG-352). Review shows that both correspond to the positions of annotations indicating problems in the assembly component. Both genome errors have been corrected by the addition of a new component to the assembly and will be released as FIX patches in GRCh37.p9.

Tuesday, May 29, 2012

Updating the Assembly: What's in a base? (Part 2)

How the GRC should address rare bases, as well as common bases that result in non-functional alleles (a.k.a. polymorphic pseudogenes), in the reference assembly is a substantially more complex task than dealing with erroneous bases. This is made even more so by the wide range of opinions held by assembly users, which include:

Most common allele
Ancestral allele
Coding allele

All views have merit, and the opinion held by any genome user is likely to be colored by their own research needs, which may include short read mapping, variation analysis and clinical testing, among others. Regardless of view, it is important to recognize that updating bases in isolation to meet any of these goals runs the risk of creating false haplotypes- those not observed in any individual. Thus, further analyses are needed to investigate the possible mechanisms by which such any changes can be made.

The GRC currently favors a model in which haplotypic integrity is retained within blocks of linkage disequilibrium (LD) as best possible, every base is found at an MAF >5% in some population (i.e. no universally rare alleles) and coding alleles are favored over non-coding alleles, so long as they too are not universally rare. However, additional analyses will be performed before any bases changes are made. Examples of genomic regions where the existing reference base is associated with disease (ASPN, PMID:15640800) or non-coding variants (CYP3A5, PMID:11279519) are presented below in Figures 1 and 2. In the former case, the reference base is the minor allele, while in the latter, it is the major allele. We invite you to consider examples such as these as you form your own views of what should be represented in the reference assembly. If you have questions or concerns about base updates for GRCh38, let us know!

Fig.1

Fig. 1. Zoomed-in graphical view of the ASPN gene in GRCh37. The assembly sequence is shown at top. The ASPN gene is shown in green, and alignments of the corresponding RefSeq transcripts are in grey. The thin red line in the alignments corresponds to a 3 nt indel (TCA). The reference insertion creates an additional aspartic acid in a run of aspartic acids (red box). The reference allele (D14) is a minor allele (MAF between 0.05 and 0.10 in various populations) and is associated with osteroarthritis susceptibility. Other clone based sequences exist that contain the more common, non-disease associated, allele.

Fig. 2

Fig. 2. Zoomed-in graphical view of the CYP3A5 gene in GRCh37. The assembly sequence is shown at top. The CYP3A5 gene is shown in green. The highlighted base in the GRCh37 reference assembly represents the major allele in many populations. This allele creates cryptic splice site that disrupts the reading frame of CYP3A5 and results in a non-coding transcript. However, in other populations, the coding allele is the major allele.