Friday, May 4, 2012

Filling in the gaps to better understand human biology

Duplicated segments pose serious problems for the assembly and annotation of the human genome. In the human reference genome there are still large gaps that require specialized efforts to fill. Many of these gaps lie within highly duplicated segments in which the degree of sequence variation among duplicated loci approaches levels of allelic variation. Many people assume that much of the sequence that is still missing from the reference assembly is not very biologically interesting. However, it has become increasingly apparent that the segmental duplications themselves provide the molecular basis for many human genetic disorders. The resolution of these regions is therefore essential for a complete understanding of the genetic basis of human disease. 

Three patches released in GRCh37.p8, that add almost 400Kb of novel sequence, prove the concept that sequence missing so far from the reference genome can be of crucial importance.  The biological story surrounding these sequences can be found in a recent publication from the Eichler lab (Dennis et al., 2012) but here we'll tell you a little bit about how we worked with the Eichler lab to create these assembly patches. 

Figure 1: Ancestral copy of SRGAP2 in 
chimpanzee (left) and human (right). The other 
red ticks on the human chromosome show
the human specific duplications added by 
this effort.

One of the impediments in resolving the complexity of these regions is the diploid nature of the human genome. We recently took advantage of a haploid BAC library resource (CHORI-17) from hydatidiform mole DNA to close gaps and resolve the genomic structure of segmental duplications encompassing highly identical paralogs of SRGAP2, a gene important in cortex development. Hydatidiform moles are conception abnormalities that most often arise from the fertilization of an enucleated ovum by a single X-bearing sperm. Subsequent diploidization results in a 46 XX karyotype in which all allelic variation has been eliminated allowing the unambiguous delineation of duplicated DNA as well as haplotype characterization. Our SRGAP2 sequencing efforts resolved the sequence and structure of 4 copies of the gene on human chromosome 1, three of which represent human-specific duplicate truncations of the original ancestral gene. 

Overall, we added >380 kbp of new sequence previously absent from the human reference genome, including 40 kbp within the conserved ancestral copy of the gene. Additionally, we discovered ~560 kbp of sequence mapped incorrectly either in orientation or position. This region in GRCh37 contained 15 gaps, and now in the new sequence patch, only two gaps remain.  Combined, we generated or corrected 0.4% of human chromosome 1 euchromatic sequence. The sequencing of these genes have made it possible to explore the function of the human-specific duplicate copies, particularly their role in  neurological traits and disorders unique to humans.
SRGAP2A (1q32 region): JH636054.1
SRGAP2B,D (1q21 region): JH636052.1
SRGAP2C (1p12 region): JH636053.1

Figure 2: View of SRGAP2 gene family on chromosome 1 ideogram (with 1q on the left). The arrows show the order and direction of duplication with the estimated time (in millions of years ago) below that. (Dennis et al., 2012)