Tuesday, May 29, 2012

Updating the Assembly: What's in a base? (Part 2)

How the GRC should address rare bases, as well as common bases that result in non-functional alleles (a.k.a. polymorphic pseudogenes), in the reference assembly is a substantially more complex task than dealing with erroneous bases. This is made even more so by the wide range of opinions held by assembly users, which include:
  • Most common allele
  • Ancestral allele
  • Coding allele
All views have merit, and the opinion held by any genome user is likely to be colored by their own research needs, which may include short read mapping, variation analysis and clinical testing, among others. Regardless of view, it is important to recognize that updating bases in isolation to meet any of these goals runs the risk of creating false haplotypes- those not observed in any individual. Thus, further analyses are needed to investigate the possible mechanisms by which such any changes can be made.

The GRC currently favors a model in which haplotypic integrity is retained within blocks of linkage disequilibrium (LD) as best possible, every base is found at an MAF >5% in some population (i.e. no universally rare alleles) and coding alleles are favored over non-coding alleles, so long as they too are not universally rare. However, additional analyses will be performed before any bases changes are made. Examples of genomic regions where the existing reference base is associated with disease (ASPN, PMID:15640800) or non-coding variants (CYP3A5, PMID:11279519) are presented below in Figures 1 and 2. In the former case, the reference base is the minor allele, while in the latter, it is the major allele. We invite you to consider examples such as these as you form your own views of what should be represented in the reference assembly. If you have questions or concerns about base updates for GRCh38, let us know!

Fig. 1. Zoomed-in graphical view of the ASPN gene in GRCh37. The assembly sequence is shown at top. The ASPN gene is shown in green, and alignments of the corresponding RefSeq transcripts are in grey. The thin red line in the alignments corresponds to a 3 nt indel (TCA). The reference insertion creates an additional aspartic acid in a run of aspartic acids (red box). The reference allele (D14) is a minor allele (MAF between 0.05 and 0.10 in various populations) and is associated with osteroarthritis susceptibility. Other clone based sequences exist that contain the more common, non-disease associated, allele.

Fig. 2

Fig. 2. Zoomed-in graphical view of the CYP3A5 gene in GRCh37. The assembly sequence is shown at top. The CYP3A5 gene is shown in green. The highlighted base in the GRCh37 reference assembly represents the major allele in many populations. This allele creates cryptic splice site that disrupts the reading frame of CYP3A5 and results in a non-coding transcript. However, in other populations, the coding allele is the major allele.

Friday, May 18, 2012

Updating the assembly: What's in a base? (Part 1)

The human genome reference assembly is the highest quality mammalian assembly to have ever been produced. As reported in the summary paper for the human genome sequencing project (PMID:15496913), the assembly covers ~99% of the euchromatic genome and is accurate to an error rate of ~1 per 100,000 bases. With recent advances in technology driving down the cost of sequencing for individual labs, as well as the efforts of large consortia like the 1000 Genomes Project, there has been an explosion in the numbers of human genomes sequenced, providing not only more sequence representation for our species, but a means to look at human genetic variation. The availability of this sequence now presents the GRC the opportunity to identify and address incorrect and rare bases in the reference assembly.

An analysis of genotyped bases from the phase 1 data of the 1000 Genomes project identified approximately 27,000 GRCh37 reference bases that were never observed in any of the sequenced individuals, suggesting that they may be sequencing errors. Notably, this number is consistent with the expected error rate for a finished genome comprised of 2.85 billion bases. At an additional ~650 sites, the reference allele exhibits a minor allele frequency (MAF) <5% across all combined populations, implying that the assembly contains rare alleles at these positions.  However, it is important to note that distinguishing erroneous bases from rare bases is not a trivial task. Even with the large number of individuals sequenced in the 1000 Genomes project, it is to be expected that some of the “erroneous” bases will be reclassified as “rare” if additional individuals or populations are sequenced. 

The GRC is reviewing these sites with respect to read depth, map quality and where feasible, with additional sequence analysis with the aim of correcting confirmed erroneous bases. Special attention is being given to erroneous bases reported to affect coding sequences (approximately 300 instances), and by extension, gene annotation. More than 20 of these coding errors have already been corrected by the GRC, and they are being released publicly as FIX patches prior to the release of GRCh38. An example of one such correction (shown below in Fig. 1: SLC46A1, a gene in which mutations result in hereditary folate malabsorption disease), was achieved by changing the switch point positions of existing assembly components. Providing error-free gene models in the reference assembly is a GRC priority, as it should improve the ability of clinical geneticists to use the reference assembly as a model for review of test results.

The GRC is hard at work on these base corrections- if you have concerns or questions about this process, let us know!

Next week: Dealing with "rare" bases in the reference assembly.

Fig. 1
Fig. 1. Graphical view of region on GRCh37 encompassing SLC46A1. The blue bars represent the underlying assembly components. The gene is shown in green, and alignments of the corresponding RefSeq transcripts and FIX patch are shown below in grey. The red marks in the alignments correspond to mismatches. The positions of clinical and cited variants of this gene are shown in purple and blue boxes, respectively, In GRCh37, a 1nt indel in the underlying component sequence resulted in a non-functional SLC46A1 representation in the assembly. The switch point between the components has now been changed, excluding the indel and resulting in a functional SLC46A1. This change is represented in the FIX patch JH159145.1.

Friday, May 11, 2012

Updating the genome: correcting the assembly of 10q11.22

Human GRCh37 patch release 8 contains an update to previously released fix patch HG1211_PATCH.

This encompasses a 3Mb region in GRCh37 between chr10: 46,256,855-49,299,273.

The tile path in the 10q11.22 region has been extensively altered from its previously fragmented state to one where a single gap remains, between BX649215.1 and AC245041.3. The reworking of the tile path in the region has been carried out using clones in the existing build and additional finished clones not previously in GRCh37.

Working with optical map data provided by the Schwartz Lab we have been able to identify errors in the GRCh37 assembly and have consequentially worked to correct them. The optical map analysis also highlighted redundancy in the assembly causing artificial duplication, which has now been addressed within this patch.
Above: Optical map consensus alignments to GRCh37 10q11.22.
Below: Optical map consensus alignments to the fix patch (JH591181.2)
Legend: Pink track: Clone path; Green: Contig gap; Blue: In silico SwaI fragments.
For the aligned optical map consensus Gold: Concordant fragment ; Red: Missing fragment (seen where OM consensus span gap); Grey: Unaligned fragment

The optical map information was consistent with a path problem in this region. The map data suggested that several clones in the region were misplaced and did not represent a valid chromosome structure in this region. In addition to rearranging several clones (including changing the orientation of some clones in the path), 3 finished clones were added to the path and several redundant clones were removed. The new path contains a single gap that we estimate, based on optical mapping, to be about 90 Kb. The figure below shows an alignment of the patch sequence to the current chr10 assembly.
The panel to the left shows an overview of chr. 10. The orange dots represent fix patches we've released and the blue dots represent novel patches. The arrow shows the location of the 10q21 fix patch. To the right, the top panel shows the chr. 10 tiling path (in grey), the annotated RefSeq genes are below that (in green) and the alignment to the fix patch below that (in purple). The bottom panel shows the patch tiling path and alignment to the chromosome. 

Friday, May 4, 2012

Filling in the gaps to better understand human biology

Duplicated segments pose serious problems for the assembly and annotation of the human genome. In the human reference genome there are still large gaps that require specialized efforts to fill. Many of these gaps lie within highly duplicated segments in which the degree of sequence variation among duplicated loci approaches levels of allelic variation. Many people assume that much of the sequence that is still missing from the reference assembly is not very biologically interesting. However, it has become increasingly apparent that the segmental duplications themselves provide the molecular basis for many human genetic disorders. The resolution of these regions is therefore essential for a complete understanding of the genetic basis of human disease. 

Three patches released in GRCh37.p8, that add almost 400Kb of novel sequence, prove the concept that sequence missing so far from the reference genome can be of crucial importance.  The biological story surrounding these sequences can be found in a recent publication from the Eichler lab (Dennis et al., 2012) but here we'll tell you a little bit about how we worked with the Eichler lab to create these assembly patches. 

Figure 1: Ancestral copy of SRGAP2 in 
chimpanzee (left) and human (right). The other 
red ticks on the human chromosome show
the human specific duplications added by 
this effort.

One of the impediments in resolving the complexity of these regions is the diploid nature of the human genome. We recently took advantage of a haploid BAC library resource (CHORI-17) from hydatidiform mole DNA to close gaps and resolve the genomic structure of segmental duplications encompassing highly identical paralogs of SRGAP2, a gene important in cortex development. Hydatidiform moles are conception abnormalities that most often arise from the fertilization of an enucleated ovum by a single X-bearing sperm. Subsequent diploidization results in a 46 XX karyotype in which all allelic variation has been eliminated allowing the unambiguous delineation of duplicated DNA as well as haplotype characterization. Our SRGAP2 sequencing efforts resolved the sequence and structure of 4 copies of the gene on human chromosome 1, three of which represent human-specific duplicate truncations of the original ancestral gene. 

Overall, we added >380 kbp of new sequence previously absent from the human reference genome, including 40 kbp within the conserved ancestral copy of the gene. Additionally, we discovered ~560 kbp of sequence mapped incorrectly either in orientation or position. This region in GRCh37 contained 15 gaps, and now in the new sequence patch, only two gaps remain.  Combined, we generated or corrected 0.4% of human chromosome 1 euchromatic sequence. The sequencing of these genes have made it possible to explore the function of the human-specific duplicate copies, particularly their role in  neurological traits and disorders unique to humans.
SRGAP2A (1q32 region): JH636054.1
SRGAP2B,D (1q21 region): JH636052.1
SRGAP2C (1p12 region): JH636053.1

Figure 2: View of SRGAP2 gene family on chromosome 1 ideogram (with 1q on the left). The arrows show the order and direction of duplication with the estimated time (in millions of years ago) below that. (Dennis et al., 2012)