Centromeres are specialized chromatin structures that are required for cell division. The composition of these regions is complex, as they are made up of a series of tandem repeats that are arranged into nearly identical multi-megabase arrays. The size and repetitive nature of these regions mean they are typically not represented in reference assemblies. The Human Genome Project (HGP) employed a clone based strategy (largely BAC clones) to produce the reference assembly, but cloning centromere sequences generally requires special effort, and isn't readily applicable to all human centromeres (see Kouprina et al., 2003 for one such effort). With the recent widespread adoption of whole genome sequencing (WGS), there are clearly alpha-satellite sequences in the reads produced, but assembling these sequences into faithful representations of centromeres using standard techniques is impossible due to the repetitive nature of these sequences. In all previous versions of the human reference assembly, the centromere regions have been represented by a 3 Mb gap (that is a stretch of 3 million Ns). Recent efforts by Karen Miga and her colleagues are helping us improve centromere representation in the reference assembly. The GRCh38 reference assembly incorporates centromere models created by Miga and colleagues, along with their modeled region of one of the heterochromatic regions on the long arm of chromosome 7. These models replace the multi-megabase gaps that are in GRCh37.
As described in Miga et al., 2013, Karen and her colleagues used the whole genome shotgun (WGS) reads that were generated as part of the Venter sequencing project (Levy, et al., 2007) to build centromere models (Fig.1). They started by identifying sequence reads containing alpha-satellite centromere sequences. They then used these reads to construct models representing the approximate repeat number and order for each of the centromeric alpha-satellite higher order arrays in the genome. Because there are two copies of each centromere for each autosome, these centromere models represent an average of the two centromere copies. On the acrocentric chromosomes, where there is extreme inter-chromosomal array sequence homogeneity, the array models found in GRCh38 include data from all four acrocentric regions. The team was also able to use read pair information to link the modeled scaffold arrays to the adjacent euchromatic sequence present in the Venter assembly.
As described in Miga et al., 2013, Karen and her colleagues used the whole genome shotgun (WGS) reads that were generated as part of the Venter sequencing project (Levy, et al., 2007) to build centromere models (Fig.1). They started by identifying sequence reads containing alpha-satellite centromere sequences. They then used these reads to construct models representing the approximate repeat number and order for each of the centromeric alpha-satellite higher order arrays in the genome. Because there are two copies of each centromere for each autosome, these centromere models represent an average of the two centromere copies. On the acrocentric chromosomes, where there is extreme inter-chromosomal array sequence homogeneity, the array models found in GRCh38 include data from all four acrocentric regions. The team was also able to use read pair information to link the modeled scaffold arrays to the adjacent euchromatic sequence present in the Venter assembly.