Monday, April 18, 2016

Chicken assembly curation at GRC

Upon the release of Gallus_gallus-5.0 (GCA_000002315.3), the GRC assumed responsibility for the continued curation of the chicken reference genome assembly from the International Chicken Genome Consortium. The assembly represents the Red Jungle Fowl strain, inbred line UCD001. All sequences in the assembly are derived from a single individual from this line, female "RJF #256". The assembly is a hybrid comprised primarily of WGS contigs, into which genomic clones have been integrated. Planned curation efforts are focused on selecting genomic clones (BACs) for sequencing and assembly, specifically from the CHORI-261 library, to fill known gaps and resolve assembly errors.
Figure 1. Ideogram of Gallus_gallus-5.0 (GCA_000002315.3)

Information about the current assembly and curation efforts is now available on the GRC website. An interactive interface (Figure 2) provides access to genome regions currently under review, while a series of issue pages provide details and mapping information for each of these regions, as well as a graphical view (Figure 3). For more information on these pages, see our previous blog posts on issue pages and the issue overview page.

Figure 2: Overview of issues reported on the chicken reference genome assembly.

Figure 3: Example of page with issue-specific details
The GRC welcomes feedback on the current assembly from members of the chicken research community. Users can either Report an Issue or Contact Us for more information about the assembly. A survey regarding the usability of the GRC website is also ongoing.

Thursday, April 7, 2016

Updates to DUX4 region on subtelomeric chromosome 4q


Sequence improvements and increased variant representation in the human genome reference assembly are priorities for the GRC. This blog post describes two updates in the recent GRCh38.p7 patch release affecting the DUX4 region located at the chromosome 4q sub-telomere: (1) a FIX patch correcting the chromosomal representation (KQ983257.1and (2) a NOVEL patch representing a variant of the region (KQ983258.1). A third haplotype of DUX4 region which is represented in GRCh37 will be also described.

The DUX4 region contains tandem arrays of a 3.3 Kb D4Z4 macrosatellite repeat located in the sub-telomeric region of chromosome 4q. The number of D4Z4 repeats is highly polymorphic, ranging from 8 to 100 copies in the healthy individuals, but only 1 to 10 units in individuals with Facioscapulohumeral muscular dystrophy-1 (FSHD1: MIM#158900). The contraction of the repeat arrays is believed to result in a decreased epigenetic repression effect of D4Z4 and subsequent transcriptional activation of the DUX4 gene. Three main haplotypes, known as 4qA, 4qB and 4qA-L, have been reported for this region (1-3). The 4qB haplotype has homology to 4qA in the D4Z4 repeats, but is completely different in the distal region (1,2). 4qA, considered the reference haplotype, is the ancestral and most common haplotype. 4qA and 4qA-L are associated with FSHD, while 4qB is not (2). The different haplotypes exhibit population stratification (3).

The GRC received user feedback that the DUX4 region is incorrectly represented on chromosome 4 of the GRCh38 assembly. The repeat structure in the region complicates both sequencing and assembly. Clones representing both the 4qA and 4qB haplotypes are present in GRCh38, creating a haplotype expansion and accompanying "false" gap  at  NC_000004.12: 190,123,122 bp, between components AC225782.3 (WI2-3035O22; 4qB) and AC215524.3 (ABC7-42391500H16; 4qA (partial)) (Figure 1, top). In the GRCh38.p7 fix patch scaffold KQ983257.1, the GRC provides a complete representation for the 4qA haplotype by replacing AC225782.3 with CT476828.7 (RP11-242C23) and eliminating the gap (Figure 1, bottom). The patch provides 14 copies of the DUX4 repeat. This fix patch representation for the region will be incorporated into chromosome 4 in the as yet unscheduled GRCh39 assembly release, and the 4qB representation currently present on the GRCh38 chromosome will be provided as an alternate loci scaffold.


Figure 1 Top: DUX4 region in GRCh38. Incomplete representation of DUX4 gene in GRCh38 due to a mix haplotype representation of 4qB and partial 4qA. Bottom: DUX4 fix patch in GRCh38.p7. The gap is closed and a complete representation of the 4qA haplotype is provided.

The GRCh38.p7 release also includes KQ983258.1 as novel patch scaffold providing representation for the variant 4qA-L haplotype. This variant is very similar to haplotype 4qA (Figure 2, top, green dashed line) but diverge in the distal-most DUX4 copy (Figure 2, top, red dashed line) which is about 1.6 Kb larger in 4qA-L (Figure 2, bottom). The functional implication of this difference is not yet known.
 
 Figure 2 Top: Novel patch representing DUX4 variant 4qA-L. This variant is similar to haplotype 4qA but the region distal to last D4Z4 unit is longer in 4qA-L. Bottom: Schematic of D4Z4 repeat arrays for 4qA and 4qA-L regions, adapted from Lemmers et al., 2010.


References:
1.      van Geel, M. et al. Genomics 79, 210–217 (2002)
2.      Lemmers, RJLF. et al. Nature Genetics 32, 235-236 (2002).
3.      Lemmers, RJLF. et al. Science 329, 1650-1653 (2010).