GenomeRef: July 2012

Thursday, July 26, 2012

The GRC and the 10th International Zebrafish Genetics and Development Meeting (June 20-24, 2012 - Madison, Wisconsin)

Members from the GRC attended the 10th International Zebrafish Genetics and Development meeting in Madison, Wisconsin, to gain insight into trends of current research and provide information to the community. New approaches for the identification and targeting of mutations (e.g. Whole Genome Sequencing and the use of Transcription Activator-Like Effector Nucleases (TALENs)) were a recurring theme throughout the meeting, highlighting the importance of improved reference assemblies with accurate sequence. Whole genome sequencing is being used in forward genetic screens for the identification of mutations, whereas TALENs are being used to generate targeted gene mutations in zebrafish. A high quality reference genome assembly is of utmost importance to the success of both of these technologies to allow accurate mapping for experimental design (oligonucleotide design for TALENs) and subsequent data analysis (mapping whole genome sequence data to the reference for mutation identification). Discussions held at the meeting indicated that the zebrafish research community would welcome a new assembly.

An interim build, GRCz9b, was performed internal to the GRC prior to the conference as a measure to gauge the value added since the Zv9 release in 2010 and also to assess the scope of work required for further improvements in preparation for GRCz10. This test assembly was met with high interest at the meeting, including requests for access to the data. Improvements to the assembly include the exclusion of 19Mb previously duplicated over-expanded sequence (Fig. 1 and 2); the addition of 935 new clones and an increase in scaffold N50 by 14%. The gene count does not differ significantly from Zv9.

Fig. 1: A PGPviewer screenshot from the Zv9 and GRCz9b assemblies, showing localisation of a previously unlocalised scaffold. In Zv9, FP236471 was not localised to a chromosome and remained on a single scaffold (Scaffold Zv9_scaffold3539), whereas FP003601 was located to chromosome 4. The red block at the end of FP236471, in Zv9, indicates sequence similarity between the two clones. In GRCz9b, an overlap has been made, adjoining the two contigs, with a perfect sequence alignment, indicated by the green clone overlap block at the bottom of the screenshot. The feature tracks have been kept to a minimum for illustration purposes.

Fig. 2: A PGPviewer screenshot from the Zv9 and GRCz9b assemblies, showing the correction of an over-expanded region in Zv9. In Zv9, CU571256 and BX927241 reside next to each other with no overlap. However, the red blocks show a potential alignment between them and the end sequence alignments also indicate an over-expansion. In GRCz9b, the clones overlap with a highly variable alignment due to the haplotypic nature of the two libraries used, as highlighted by the red clone overlap block at the bottom of the screenshot. The feature tracks have been kept to a minimum for illustration purposes.

Approaches underway, to complement existing genome curation, include increasing the coverage of SATMAP, the high-density meiotic map used to allocate all genomic contigs, along with the sequencing of more than 1000 genomic clones to fill gaps and cover those genes still missing. To ensure effective utilization of the new clone sequence, GRCz10 will contain unfinished sequence with HTGS phase 2 (ordered and oriented contigs), unlike its predecessor. The GRC are planning to release GRCz10 in 2013.

Friday, July 6, 2012

Hidden assembly problems exposed

The human reference genome GRCh37 represents the highest quality mammalian genome assembly ever to be produced. It's played a major role in advancing both basic research and clinical research, and it continues to teach us that there's much we still don't know about human genomic biology. However, it is important to keep in mind that the assembly isn't perfect. While many users may be aware of some of the more visible issues with the assembly, such as gaps or missing genes, there are other assembly problems that may be less apparent. As noted in a previous GRC blog post, erroneous bases are one such problem. While the reference assembly is accurate to an error rate of ~1 in 100,000 bases, this still means that approximately 28,000 of the 2.85 billion bases are inaccurate. Using data from sources such as the 1000 genomes project, the GRC is working to address these sequencing errors for GRCh38, the next genome version that will be publicly released.

Component assembly problems represent another less-recognized source of error in the reference genome. The human reference genome is a clone-based assembly, sequenced using Sanger technology. Generating the consensus sequence for each of these genomic clone components involved their sub-cloning into plasmid vectors, sequencing and reassembly. During the course of the human genome project, many of the sequencing centers made note of problems encountered during the sequencing of these clones as annotations on the sequence records. Colloquially known as "black-tag" annotations, these represent regions of uncertainty due to force joins, unresolved tandem repeats, low quality or coverage sequence, and other biological and technical challenges. Although these annotations are part of public sequence records, they have historically been difficult to interpret in the context of the genome assembly because their coordinates were component, rather than scaffold or chromosome based.

The GRC has now mapped annotated clone assembly problems of assembly components to the coordinates of the top-level molecules in which they are found (scaffold or chromosome) in the human, mouse and zebrafish reference assemblies. These data are available in GFF3 and ASN.1 format on the GRC's public FTP site. These files allow users to see whether problems have been annotated at their coordinates of interest or if the locations of other genome features, such as segmental duplications or structural variation, fall in annotated problem regions and potentially represent false calls. Additionally, the GRC is cross-checking reports of genome problems with these coordinates in an effort to understand the underlying cause of these issues (Figure 1). It should be noted, however, that not all genomic components were equivalently annotated during the course of the human genome project. Thus, the absence of an annotation in these files does not mean that no problem was encountered during component sequencing, just that none was noted in its record. We encourage users to review the interpretations of their own assembly annotations in light of this data and we welcome your feedback!

Fig. 1

Figure 1. Graphical image of a region on GRCh37 chr. 15 (NC_000015.9; 99.63-99.65 Kb). The blue bar at top represents the component from which the chromosome sequence is derived at this location (AC036108.19 (RP11-6O2)), while an NCBI gene annotated in this vicinity is shown just below (green bar). At bottom, the small vertical black tick marks show the positions of clone assembly problems that were annotated on the component sequence. The GRC received two separate reports of assembly problems in this region (HG-971 and HG-352). Review shows that both correspond to the positions of annotations indicating problems in the assembly component. Both genome errors have been corrected by the addition of a new component to the assembly and will be released as FIX patches in GRCh37.p9.