Friday, April 27, 2012

Updating the Human Reference Assembly, part 1

Talking about updating the human reference assembly, currently GRCh37 (hg19), can elicit groans and howls of protests from genome scientists who have put considerable effort into analyzing a given data set against the reference assembly. To address this concern, we introduced the notion of a 'Genome Patch'; that is, scaffold size sequences that either add additional sequence representation (NOVEL patches) or fix existing problems in the current reference assembly (FIX patches). In this way, we can make our best representation of the assembly available without disrupting the reference chromosome coordinates. We are at our eighth patch release (GRCh37.p8) and we now have 69 FIX patches and 71 NOVEL patches. 


It is the FIX patches we'd like to consider right now. While the patch scaffolds are easy enough to use if you are interested in a single region, most analysis pipelines have not incorporated these sequences and the improved data remains largely unused for whole genome or exome analysis. It is worth noting that NCBI and Ensembl provide gene annotation on many of the patch releases. Doing a major update to the reference assembly (making GRCh38) will allow us to incorporate these FIX patches into the chromosome assembly, making them directly accessible to analysis pipelines.

There are 66 regions (>40.5 Mb) on GRCh37 that are associated with these 69 FIX patches. In addition to other sequence changes that improve the reference assembly, the sequences in these FIX patches provide more than 4.7 Mb of novel sequence.  Adding large amounts of novel sequence, like the 2.6 Mb added by JH636052.1 (1q21 region) is impressive, however, novel sequence is not the only metric to consider when evaluating FIX patches. For example, GL383543.1/NW_003315932.1 (described in HG-544) is a FIX patch for the FAM23A_MRC1 region on NC_000010.10 (chr10: 17613209-18252930) and adds no novel sequence to the reference assembly. Instead, it removes roughly 200 Kb of artificially redundant sequence and closes a gap in the assembly. The alignment of the patch to the chromosome is shown in Figure 1 (below). 

Alignment of FIX patch to chromosome for FAM23A_MRC1 region of chr10.




Figure 1: FAM23A_MRC1 region on chromosome 10: The top panel shows chr10 in GRCh37. The blue/black line at the top represents the sequence, the track below that is the GenBank components used to assemble the chromosome, below that are the NCBI genes, then the alignment of the chromosome to the patch sequence and finally the segmental duplication track. The second panel shows the FIX patch sequence, which has no gap, the genes annotated on the patch and the alignment to the chromosome. The patch removes roughly 200Kb of artificially redundant sequence (meaning the data in the segmental duplication track is an artifact) and corrects the gene annotation in the region, removing two gene models that represent false gene duplications and don't exist in the population. (see full size photo)



The artificial duplication in the assembly not only affects the gene annotation but also has a significant affect on the alignment of short reads as shown in Figure 2 (below), or you can see the alignments

Figure 2: Alignments of 1000 Genomes data the FAM23_MRC1 region on chromosome 10: The 1000 Genome data alignments are in the tracks below the orange bar noting where the artificial duplication exists in the reference assembly. Two low-coverage samples (NA19625 and NA19701) are aligned using BWA and Mosaic respectively. In the two Mosaic tracks, there is a visible drop off in alignment depth. This is less pronounced in the BWA alignments. The red coloring indicates mismatches in the alignments. (see full size photo)


While it is well-recognized that sequences from the reference, particularly missing paralogs (see Sudmant et al., 2010 for more information), have affects on next generation sequence analysis, it should be noted that artificial duplication within the assembly, such as the example shown here, can also significantly impact such analyses. With the latest patch release we have updated 5 such regions (covering 2 Mb). Several other regions that are phenotypically important, but were represented by a mixed haplotype in GRCh37, have also been updated, including the Williams region on chr7 and the 1q21 region on chr1.

We'll talk about other things FIX patches get you in a later post. Additionally, we'll be highlighting some biologically interesting regions!

6 comments:

  1. Where is the blog post related to GRCh38???

    ReplyDelete
  2. According to the GRC homepage: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml

    "We are planning to update the human reference assembly to GRCh38 in the summer of 2013"

    ReplyDelete
  3. This post is about one of the types of problems that will be fixed in GRCh38. We are planning on doing the update next summer and we want to explain to folks why we think this is important. We also want to get feedback from the community about this update.

    ReplyDelete
  4. There is a great deal of confusion and misunderstanding regarding patches - many people labor under the misapprehension that the primary assemblies have been updated with each patch release. The GRC must make clear who and what the patches are for, and how they can be integrated into analyses, or perhaps why they shouldn't.

    ReplyDelete
  5. No sh!t!! I've been looking for a couple hours now to try to find something on how to (correctly) build an updated GRC37.p11 with no luck.

    ReplyDelete