Friday, May 18, 2012

Updating the assembly: What's in a base? (Part 1)

The human genome reference assembly is the highest quality mammalian assembly to have ever been produced. As reported in the summary paper for the human genome sequencing project (PMID:15496913), the assembly covers ~99% of the euchromatic genome and is accurate to an error rate of ~1 per 100,000 bases. With recent advances in technology driving down the cost of sequencing for individual labs, as well as the efforts of large consortia like the 1000 Genomes Project, there has been an explosion in the numbers of human genomes sequenced, providing not only more sequence representation for our species, but a means to look at human genetic variation. The availability of this sequence now presents the GRC the opportunity to identify and address incorrect and rare bases in the reference assembly.

An analysis of genotyped bases from the phase 1 data of the 1000 Genomes project identified approximately 27,000 GRCh37 reference bases that were never observed in any of the sequenced individuals, suggesting that they may be sequencing errors. Notably, this number is consistent with the expected error rate for a finished genome comprised of 2.85 billion bases. At an additional ~650 sites, the reference allele exhibits a minor allele frequency (MAF) <5% across all combined populations, implying that the assembly contains rare alleles at these positions.  However, it is important to note that distinguishing erroneous bases from rare bases is not a trivial task. Even with the large number of individuals sequenced in the 1000 Genomes project, it is to be expected that some of the “erroneous” bases will be reclassified as “rare” if additional individuals or populations are sequenced. 

The GRC is reviewing these sites with respect to read depth, map quality and where feasible, with additional sequence analysis with the aim of correcting confirmed erroneous bases. Special attention is being given to erroneous bases reported to affect coding sequences (approximately 300 instances), and by extension, gene annotation. More than 20 of these coding errors have already been corrected by the GRC, and they are being released publicly as FIX patches prior to the release of GRCh38. An example of one such correction (shown below in Fig. 1: SLC46A1, a gene in which mutations result in hereditary folate malabsorption disease), was achieved by changing the switch point positions of existing assembly components. Providing error-free gene models in the reference assembly is a GRC priority, as it should improve the ability of clinical geneticists to use the reference assembly as a model for review of test results.

The GRC is hard at work on these base corrections- if you have concerns or questions about this process, let us know!

Next week: Dealing with "rare" bases in the reference assembly.

Fig. 1
Fig. 1. Graphical view of region on GRCh37 encompassing SLC46A1. The blue bars represent the underlying assembly components. The gene is shown in green, and alignments of the corresponding RefSeq transcripts and FIX patch are shown below in grey. The red marks in the alignments correspond to mismatches. The positions of clinical and cited variants of this gene are shown in purple and blue boxes, respectively, In GRCh37, a 1nt indel in the underlying component sequence resulted in a non-functional SLC46A1 representation in the assembly. The switch point between the components has now been changed, excluding the indel and resulting in a functional SLC46A1. This change is represented in the FIX patch JH159145.1.

No comments:

Post a Comment