Monday, October 24, 2016

Are you familiar with CYP2D6 and it's importance in drug metabolism?

CYP2D6 (Gene ID: 1565) is a gene associated with the metabolism of ~25% of clinically prescribed drugs, including antidepressants, neuroleptics and opioids. CYP2D6 is located on chromosome 22 (22q13.1), near two cytochrome P450 pseudogenes CYP2D7 and CYP2D8PBecause of its functional importance, the GRC updated the chromosomal representation for CYP2D6 to the sequence-corrected clinical standard (CYP2D6*1A) in GRCh38. The version of CYP2D6 found in the prior assembly version, GRCh37, was retained in GRCh38 as an alternate loci scaffold (KI270928.1).

The Genome Reference Consortium, in conjunction with the Pharmacogenomics Research Network, has also sought to identify and provide reference assembly representation for structural variation at the CYP2D6 locus. Much of this work was done by examining the end-sequence alignments of different fosmid libraries to the reference (Kidd et al.). As the reference assembly was known to represent CYP2D6, CYP2D7 and CYP2D8P each in single copy, we could ascertain potential duplication and deletion alleles of these genes by identification of discordant fosmid end-sequences (Figure 1).

Figure 1. Alignment of fosmid ends from the ABC12 library to GRCh38 chr. 22 in the vicinity of CYP2D6. Lines connect ends belonging to the same clone. Concordant placements (length within 3 standard deviations of the library insert average and inward facing ends) are shown in blue; clones with discordant placements are in red. 
As of the GRCh38.p7 assembly release, there are 3 alt loci and 3 novel patch scaffolds that provide representation for significant structural variation in the CYP2D6 region (Table 1). Figure 2 shows the alignment of these scaffolds to the reference chromosome, highlighting the diversity in the variant representations. An example of a CYP2D6 triplication haplotype is shown in Figure 3.

Table 1: Scaffolds providing alternate sequence representations for the CYP2D6 region, as of GRCh38.p7.
The inclusion of these additional representations for the locus in the reference assembly is intended to help in the evaluation of CYP2D6 variant alleles from other samples. The variant scaffolds can be included in the target assembly when using an alternate aware aligner, such as bwa-mem or SRPRISM, to align reads and should enable the identification of the haplotype that is the closest match to the query sample.
Figure 2. Alignment of alt loci and patch scaffolds to the CYP2D6 region of GRCh38 chr. 22. The blue bars at top represent the genomic clones comprising chr. 22. The NCBI RefSeq annotation of the chromosome is shown in the middle. The grey bars at the bottom are the alignments. Thin horizontal red lines represent deletions in the scaffolds relative to the chromosome, while vertical blue bars represent the locations of scaffold insertions.

Figure 3. Graphical view of patch scaffold NW_014040931.1, with the alignment of CYP2D6, revealing a haplotype containing a triplication of this locus.

Tuesday, May 17, 2016

ZFIN joins the GRC

ZFIN, the zebrafish model organism database, is joining the GRC and is planning to take over the maintenance and curation of the zebrafish genome assembly. ZFIN curators visited with GRC curators at the Wellcome Trust Sanger Institute in January 2016 to undergo training in the use of GRC tools and processes. ZFIN anticipates taking over responsibility after the next planned release of the zebrafish assembly, GRCz11. A specific timeline for this release has not yet been established, but is likely to be towards the end of 2016. ZFIN curators will continue training through 2016 with GRC curators on zebrafish genome issues.

ZFIN will respond to potential problems in the assembly that are identified by researchers. Users can report issues on the assembly via the Report an Issue link at the GRC website. There are currently no plans to actively identity problems in the assembly. Once problems are brought to the attention of ZFIN GRC curators, updates will be issued as patches.

The GRC welcomes your feedback on this update.

Monday, April 18, 2016

Chicken assembly curation at GRC

Upon the release of Gallus_gallus-5.0 (GCA_000002315.3), the GRC assumed responsibility for the continued curation of the chicken reference genome assembly from the International Chicken Genome Consortium. The assembly represents the Red Jungle Fowl strain, inbred line UCD001. All sequences in the assembly are derived from a single individual from this line, female "RJF #256". The assembly is a hybrid comprised primarily of WGS contigs, into which genomic clones have been integrated. Planned curation efforts are focused on selecting genomic clones (BACs) for sequencing and assembly, specifically from the CHORI-261 library, to fill known gaps and resolve assembly errors.
Figure 1. Ideogram of Gallus_gallus-5.0 (GCA_000002315.3)

Information about the current assembly and curation efforts is now available on the GRC website. An interactive interface (Figure 2) provides access to genome regions currently under review, while a series of issue pages provide details and mapping information for each of these regions, as well as a graphical view (Figure 3). For more information on these pages, see our previous blog posts on issue pages and the issue overview page.

Figure 2: Overview of issues reported on the chicken reference genome assembly.

Figure 3: Example of page with issue-specific details
The GRC welcomes feedback on the current assembly from members of the chicken research community. Users can either Report an Issue or Contact Us for more information about the assembly. A survey regarding the usability of the GRC website is also ongoing.

Thursday, April 7, 2016

Updates to DUX4 region on subtelomeric chromosome 4q

Sequence improvements and increased variant representation in the human genome reference assembly are priorities for the GRC. This blog post describes two updates in the recent GRCh38.p7 patch release affecting the DUX4 region located at the chromosome 4q sub-telomere: (1) a FIX patch correcting the chromosomal representation (KQ983257.1and (2) a NOVEL patch representing a variant of the region (KQ983258.1). A third haplotype of DUX4 region which is represented in GRCh37 will be also described.

The DUX4 region contains tandem arrays of a 3.3 Kb D4Z4 macrosatellite repeat located in the sub-telomeric region of chromosome 4q. The number of D4Z4 repeats is highly polymorphic, ranging from 8 to 100 copies in the healthy individuals, but only 1 to 10 units in individuals with Facioscapulohumeral muscular dystrophy-1 (FSHD1: MIM#158900). The contraction of the repeat arrays is believed to result in a decreased epigenetic repression effect of D4Z4 and subsequent transcriptional activation of the DUX4 gene. Three main haplotypes, known as 4qA, 4qB and 4qA-L, have been reported for this region (1-3). The 4qB haplotype has homology to 4qA in the D4Z4 repeats, but is completely different in the distal region (1,2). 4qA, considered the reference haplotype, is the ancestral and most common haplotype. 4qA and 4qA-L are associated with FSHD, while 4qB is not (2). The different haplotypes exhibit population stratification (3).

The GRC received user feedback that the DUX4 region is incorrectly represented on chromosome 4 of the GRCh38 assembly. The repeat structure in the region complicates both sequencing and assembly. Clones representing both the 4qA and 4qB haplotypes are present in GRCh38, creating a haplotype expansion and accompanying "false" gap  at  NC_000004.12: 190,123,122 bp, between components AC225782.3 (WI2-3035O22; 4qB) and AC215524.3 (ABC7-42391500H16; 4qA (partial)) (Figure 1, top). In the GRCh38.p7 fix patch scaffold KQ983257.1, the GRC provides a complete representation for the 4qA haplotype by replacing AC225782.3 with CT476828.7 (RP11-242C23) and eliminating the gap (Figure 1, bottom). The patch provides 14 copies of the DUX4 repeat. This fix patch representation for the region will be incorporated into chromosome 4 in the as yet unscheduled GRCh39 assembly release, and the 4qB representation currently present on the GRCh38 chromosome will be provided as an alternate loci scaffold.

Figure 1 Top: DUX4 region in GRCh38. Incomplete representation of DUX4 gene in GRCh38 due to a mix haplotype representation of 4qB and partial 4qA. Bottom: DUX4 fix patch in GRCh38.p7. The gap is closed and a complete representation of the 4qA haplotype is provided.

The GRCh38.p7 release also includes KQ983258.1 as novel patch scaffold providing representation for the variant 4qA-L haplotype. This variant is very similar to haplotype 4qA (Figure 2, top, green dashed line) but diverge in the distal-most DUX4 copy (Figure 2, top, red dashed line) which is about 1.6 Kb larger in 4qA-L (Figure 2, bottom). The functional implication of this difference is not yet known.
 Figure 2 Top: Novel patch representing DUX4 variant 4qA-L. This variant is similar to haplotype 4qA but the region distal to last D4Z4 unit is longer in 4qA-L. Bottom: Schematic of D4Z4 repeat arrays for 4qA and 4qA-L regions, adapted from Lemmers et al., 2010.

1.      van Geel, M. et al. Genomics 79, 210–217 (2002)
2.      Lemmers, RJLF. et al. Nature Genetics 32, 235-236 (2002).
3.      Lemmers, RJLF. et al. Science 329, 1650-1653 (2010).

Wednesday, March 9, 2016

GRC reference assembly curation with BioNano maps

Commercial whole genome mapping systems, such as OpGen and BioNano, are playing an increasingly important role in a variety of genomic analyses, including de novo assembly, structural variant detection and assembly curation.

Comparison of a reference assembly to a collection of whole genome maps can help curators find potential regions of misassembly and identify genomic variations that are candidates for representation in alternate loci scaffolds. For example, optical maps played a key role in the GRC's resolution of the human 10q11.22 tiling path, as described in this prior GRC blog post.

To assist with curation efforts, the Wellcome Trust Sanger Institute, a GRC member, has integrated BioNano data sets for human, mouse and zebrafish assemblies into the gEVAL browser. They have also provided data from OpGen maps and optical maps from David Schwartz (PMID: 20534489). Common curation use cases for these browser data include:

  • Gap sizing
  • Confirmation of assembly components
  • Identification of problems in assembly components
  • Identification of assembly regions missing sequence
  • Detection of haplotype over-expansions

Below is an example highlighting the potential usage.

The issue (HG-172) reports possible assembly error or missing sequence from reference component AL691432.54 that affects CDC2L1 (GeneID:984) in GRCh37.  Navigating to this region in gEVAL and turning on the tracks for:
  • REFSEQ transcript mappings
  • BspQI insilico digest
  • JIRA issue entries
  • Bionano genome maps
    • NA12878
    • NA24143 (Ashkenazim trio mother)
    • NA24149 (Ashkenazim trio father)
    • NA24385 (Ashkenazim trio son)
    • NA24631 (Chinese trio son)
In the REFSEQ track there are 4 distinct transcript mapping groups, in which most are orange colour indicating the mapping is incomplete in terms of coverage.  This indicates that perhaps there may be an issue in the assembly.  From left to right the first group is mib2, mm23A/B, cdc2L1(cdk11b) and slc35e2.
Jump to this Region in gEVAL
Zooming into the region of the red features in the Bionano genome maps reveals clearly the fragment sizes between Bionano BspQI nicks/labels.  The red features within each map indicates a discordance in size with the BspQI insilico digest.

The insilico digest reveals that the region in question on the assembly has a fragment size of ~26,5kb, whereas all 5 genome maps reveal two fragments totaling ~30kb.  This discordant region encompasses the cdc2L1 gene (which from further analysis indicates 2 exons missing) as well, giving evidence that perhaps this region is not represented correctly and missing roughly 3-4kb of sequence. 

Upon further review and analysis, the GRC curation team was able to place in to motion new clone sequence to represent correctly this region.  The new path is represented in GRCh38 below.  The addition of clone FO704657 on to the assembly path correctly contains the complete coding region for the cdc2L1(cdk11b) gene.  It also corrects the flanking gene mappings as well.  This is indicated by the green mappings of the REFSEQ transcripts.  The discordant red features of the Bionano genome maps are no long present, replaced with concordant fragment mappings.  The issue can be updated as resolved.
Jump to this Region in GRCh38

For more information on how to use BioNano displays in gEVAL, check out the gEVAL browser blog.

Thursday, February 19, 2015

GRC Website: Individual Genome Issues

Are you looking for the latest status updates from the GRC on the human, mouse or zebrafish reference genome assemblies? In companion to our previous post, we now explain how to use the Individual Genome Issue reports on the GRC website. As described in our last blog, you can filter and search for issues of interest using the organism-specific "Issues Under Review" pages. To provide an example for this blog post, we applied the follow filtering options on the human "Issues Under Review" page: issues location = GRCh38.p2, chromosome = chr3 , type = variation and scaffold type = ALT (alternative loci). We then selected HG-1291 from column 1 of the results table to go to the individual issue page, shown below in Figure 1. On this and other individual issue pages, you'll find the following information:
  • Summary fields describing the issue and its latest status updates or resolution (blue box)
  • Ideogram showing the issue's genomic location (green box)
  • Patch and/or alternate loci status and history (orange box)
  • Graphical view of genomic region to which issue is mapped (red box). 
    • Note: graphical views are provided for all mapped locations in the previous and current assembly versions. For example, HG-1291 has been mapped to chr. 3 and an alternate locus scaffold in GRCh38.p2, and to chr. 3 and a novel patch in GRCh37.p13. Use the radio buttons to toggle the display between the different sequence locations.
Figure 1. GRC Issue page for HG-1291, with page features highlighted.

Below, figure 2 shows the graphical view of the GRCh38.p2 alternate locus scaffold to which HG-1291 has been mapped (NW_003871060.2). Default tracks in the graphical views provide you with additional information about the assembly composition and quality. They include:

  • Assembly components
  • Alignments of alternate loci/patch scaffolds to the primary assembly
  • Annotated component assembly problems
  • All GRC issues mapped to the region
  • NCBI Gene annotation
  • Ensembl Gene annotation

In this image of HG-1291, review of the Genes and Alignment tracks reveals two exons in a region of the alternate loci that has no alignment to the chromosome (arrow and circle). This annotation supports the description in the Issue Summary fields. You can further configure the tracks or upload your own data files to the graphical view by clicking on the "Configure" button at the top right of the viewer (red box).
Figure 2. Graphical view of NW_003871060.2, the GRCh38.p2 alternate loci scaffold to which issue HG-1291 is mapped. The exons captured by the additional sequence in the scaffold are highlighted.

If you have questions about any of the issues you see, please contact the GRC and reference the issue number. If you know of a genome issue that isn't found on these pages, please report the issue to the GRC.

Tuesday, February 3, 2015

GRC Website Update: Genome Issues Under Review

GRC "Genome Issues under Review" webpage update!

Do you know how to find genome issues on the GRC website? To get started, select an organism from the top of the GRC homepage, and in the corresponding organism overview page select the link for "Issues Under Review". These pages provide you with the latest information about potential problems and other issues related to the human, mouse and zebrafish reference genome assemblies that the GRC are working on. Recent updates to these pages make them more interactive, informative and easier to navigate so you can pinpoint issues relevant to your research interests. Some of the page features are highlighted in Figure 1, which shows "Human Genome Issues".
  • Show issue locations on (blue box): Use this to define the assembly version on which you want to see mapped issues. We support issue mapping to the current assembly and the last release of the prior assembly version.
  • Ideogram (green box): The histogram above presents the number of issues related to each chromosome, and the annotations show issue locations. Looking for issues related to a single chromosome? Click on a chromosome or histogram of interest to see a more detailed ideogram with annotated issues (more on this below).
  • Search (purple box)Use this to finding issues related to a specific gene/clone/accession number/chromosomal location.
  • Data table: Provides a summary of issues. Within this table, click on issue ID (brown box) to go to web pages for specific issues or View in browsers (brown box) to see the relevant genome regions in browsers at Ensembl, NCBI, and UCSC.
Figure 1. Human Genome Issues overview
Additional page features shown below in Figure 2 will help you identify the issues that interest you most:
  • Filter: Located to the left of the data table, this section contains various display filters, including issue type and issue status, to help you find GRC issues meeting specified criteria.
  • Issue Annotations: In the single chromosome ideogram displays, issues are annotated below the figure.
    • Tool-tips: Click on any annotation for a summary and a link to the issue page
    • Bar charts: Click on either of the interactive bar charts below the ideogram to re-categorize the issue annotation display by Type or Status.
Figure 2. Chromosome 1 genome issues
If you have questions about any of the issues you see, please contact the GRC and reference the issue number. If you know of a genome issue that isn't found on these pages, please report the issue to the GRC.