Thursday, July 26, 2012

The GRC and the 10th International Zebrafish Genetics and Development Meeting (June 20-24, 2012 - Madison, Wisconsin)

Members from the GRC attended the 10th International Zebrafish Genetics and Development meeting in Madison, Wisconsin, to gain insight into trends of current research and provide information to the community. New approaches for the identification and targeting of mutations (e.g. Whole Genome Sequencing and the use of Transcription Activator-Like Effector Nucleases (TALENs)) were a recurring theme throughout the meeting, highlighting the importance of improved reference assemblies with accurate sequence. Whole genome sequencing is being used in forward genetic screens for the identification of mutations, whereas TALENs are being used to generate targeted gene mutations in zebrafish. A high quality reference genome assembly is of utmost importance to the success of both of these technologies to allow accurate mapping for experimental design (oligonucleotide design for TALENs) and subsequent data analysis (mapping whole genome sequence data to the reference for mutation identification). Discussions held at the meeting indicated that the zebrafish research community would welcome a new assembly.

An interim build, GRCz9b, was performed internal to the GRC prior to the conference as a measure to gauge the value added since the Zv9 release in 2010 and also to assess the scope of work required for further improvements in preparation for GRCz10. This test assembly was met with high interest at the meeting, including requests for access to the data. Improvements to the assembly include the exclusion of 19Mb previously duplicated over-expanded sequence (Fig. 1 and 2); the addition of 935 new clones and an increase in scaffold N50 by 14%. The gene count does not differ significantly from Zv9.

Fig. 1: A PGPviewer screenshot from the Zv9 and GRCz9b assemblies, showing localisation of a previously unlocalised scaffold. In Zv9, FP236471 was not localised to a chromosome and remained on a single scaffold (Scaffold Zv9_scaffold3539), whereas FP003601 was located to chromosome 4. The red block at the end of FP236471, in Zv9, indicates sequence similarity between the two clones. In GRCz9b, an overlap has been made, adjoining the two contigs, with a perfect sequence alignment, indicated by the green clone overlap block at the bottom of the screenshot. The feature tracks have been kept to a minimum for illustration purposes.

Fig. 2: A PGPviewer screenshot from the Zv9 and GRCz9b assemblies, showing the correction of an over-expanded region in Zv9. In Zv9, CU571256 and BX927241 reside next to each other with no overlap. However, the red blocks show a potential alignment between them and the end sequence alignments also indicate an over-expansion. In GRCz9b, the clones overlap with a highly variable alignment due to the haplotypic nature of the two libraries used, as highlighted by the red clone overlap block at the bottom of the screenshot. The feature tracks have been kept to a minimum for illustration purposes.

Approaches underway, to complement existing genome curation, include increasing the coverage of SATMAP, the high-density meiotic map used to allocate all genomic contigs, along with the sequencing of more than 1000 genomic clones to fill gaps and cover those genes still missing. To ensure effective utilization of the new clone sequence, GRCz10 will contain unfinished sequence with HTGS phase 2 (ordered and oriented contigs), unlike its predecessor. The GRC are planning to release GRCz10 in 2013.

Friday, July 6, 2012

Hidden assembly problems exposed

The human reference genome GRCh37 represents the highest quality mammalian genome assembly ever to be produced. It's played a major role in advancing both basic research and clinical research, and it continues to teach us that there's much we still don't know about human genomic biology. However, it is important to keep in mind that the assembly isn't perfect. While many users may be aware of some of the more visible issues with the assembly, such as gaps or missing genes, there are other assembly problems that may be less apparent. As noted in a previous GRC blog post, erroneous bases are one such problem. While the reference assembly is accurate to an error rate of ~1 in 100,000 bases, this still means that approximately 28,000 of the 2.85 billion bases are inaccurate. Using data from sources such as the 1000 genomes project, the GRC is working to address these sequencing errors for GRCh38, the next genome version that will be publicly released.

Component assembly problems represent another less-recognized source of error in the reference genome. The human reference genome is a clone-based assembly, sequenced using Sanger technology. Generating the consensus sequence for each of these genomic clone components involved their sub-cloning into plasmid vectors, sequencing and reassembly. During the course of the human genome project, many of the sequencing centers made note of problems encountered during the sequencing of these clones as annotations on the sequence records. Colloquially known as "black-tag" annotations, these represent regions of uncertainty due to force joins, unresolved tandem repeats, low quality or coverage sequence, and other biological and technical challenges. Although these annotations are part of public sequence records, they have historically been difficult to interpret in the context of the genome assembly because their coordinates were component, rather than scaffold or chromosome based.

The GRC has now mapped annotated clone assembly problems of assembly components to the coordinates of the top-level molecules in which they are found (scaffold or chromosome) in the human, mouse and zebrafish reference assemblies. These data are available in GFF3 and ASN.1 format on the GRC's public FTP site. These files allow users to see whether problems have been annotated at their coordinates of interest or if the locations of other genome features, such as segmental duplications or structural variation, fall in annotated problem regions and potentially represent false calls. Additionally, the GRC is cross-checking reports of genome problems with these coordinates in an effort to understand the underlying cause of these issues (Figure 1). It should be noted, however, that not all genomic components were equivalently annotated during the course of the human genome project. Thus, the absence of an annotation in these files does not mean that no problem was encountered during component sequencing, just that none was noted in its record. We encourage users to review the interpretations of their own assembly annotations in light of this data and we welcome your feedback!

Fig. 1
Figure 1. Graphical image of a region on GRCh37 chr. 15 (NC_000015.9; 99.63-99.65 Kb). The blue bar at top represents the component from which the chromosome sequence is derived at this location (AC036108.19 (RP11-6O2)), while an NCBI gene annotated in this vicinity is shown just below (green bar). At bottom, the small vertical black tick marks show the positions of clone assembly problems that were annotated on the component sequence. The GRC received two separate reports of assembly problems in this region (HG-971 and HG-352). Review shows that both correspond to the positions of annotations indicating problems in the assembly component. Both genome errors have been corrected by the addition of a new component to the assembly and will be released as FIX patches in GRCh37.p9.

Tuesday, May 29, 2012

Updating the Assembly: What's in a base? (Part 2)

How the GRC should address rare bases, as well as common bases that result in non-functional alleles (a.k.a. polymorphic pseudogenes), in the reference assembly is a substantially more complex task than dealing with erroneous bases. This is made even more so by the wide range of opinions held by assembly users, which include:
  • Most common allele
  • Ancestral allele
  • Coding allele
All views have merit, and the opinion held by any genome user is likely to be colored by their own research needs, which may include short read mapping, variation analysis and clinical testing, among others. Regardless of view, it is important to recognize that updating bases in isolation to meet any of these goals runs the risk of creating false haplotypes- those not observed in any individual. Thus, further analyses are needed to investigate the possible mechanisms by which such any changes can be made.

The GRC currently favors a model in which haplotypic integrity is retained within blocks of linkage disequilibrium (LD) as best possible, every base is found at an MAF >5% in some population (i.e. no universally rare alleles) and coding alleles are favored over non-coding alleles, so long as they too are not universally rare. However, additional analyses will be performed before any bases changes are made. Examples of genomic regions where the existing reference base is associated with disease (ASPN, PMID:15640800) or non-coding variants (CYP3A5, PMID:11279519) are presented below in Figures 1 and 2. In the former case, the reference base is the minor allele, while in the latter, it is the major allele. We invite you to consider examples such as these as you form your own views of what should be represented in the reference assembly. If you have questions or concerns about base updates for GRCh38, let us know!

Fig. 1. Zoomed-in graphical view of the ASPN gene in GRCh37. The assembly sequence is shown at top. The ASPN gene is shown in green, and alignments of the corresponding RefSeq transcripts are in grey. The thin red line in the alignments corresponds to a 3 nt indel (TCA). The reference insertion creates an additional aspartic acid in a run of aspartic acids (red box). The reference allele (D14) is a minor allele (MAF between 0.05 and 0.10 in various populations) and is associated with osteroarthritis susceptibility. Other clone based sequences exist that contain the more common, non-disease associated, allele.

Fig. 2

Fig. 2. Zoomed-in graphical view of the CYP3A5 gene in GRCh37. The assembly sequence is shown at top. The CYP3A5 gene is shown in green. The highlighted base in the GRCh37 reference assembly represents the major allele in many populations. This allele creates cryptic splice site that disrupts the reading frame of CYP3A5 and results in a non-coding transcript. However, in other populations, the coding allele is the major allele.

Friday, May 18, 2012

Updating the assembly: What's in a base? (Part 1)

The human genome reference assembly is the highest quality mammalian assembly to have ever been produced. As reported in the summary paper for the human genome sequencing project (PMID:15496913), the assembly covers ~99% of the euchromatic genome and is accurate to an error rate of ~1 per 100,000 bases. With recent advances in technology driving down the cost of sequencing for individual labs, as well as the efforts of large consortia like the 1000 Genomes Project, there has been an explosion in the numbers of human genomes sequenced, providing not only more sequence representation for our species, but a means to look at human genetic variation. The availability of this sequence now presents the GRC the opportunity to identify and address incorrect and rare bases in the reference assembly.

An analysis of genotyped bases from the phase 1 data of the 1000 Genomes project identified approximately 27,000 GRCh37 reference bases that were never observed in any of the sequenced individuals, suggesting that they may be sequencing errors. Notably, this number is consistent with the expected error rate for a finished genome comprised of 2.85 billion bases. At an additional ~650 sites, the reference allele exhibits a minor allele frequency (MAF) <5% across all combined populations, implying that the assembly contains rare alleles at these positions.  However, it is important to note that distinguishing erroneous bases from rare bases is not a trivial task. Even with the large number of individuals sequenced in the 1000 Genomes project, it is to be expected that some of the “erroneous” bases will be reclassified as “rare” if additional individuals or populations are sequenced. 

The GRC is reviewing these sites with respect to read depth, map quality and where feasible, with additional sequence analysis with the aim of correcting confirmed erroneous bases. Special attention is being given to erroneous bases reported to affect coding sequences (approximately 300 instances), and by extension, gene annotation. More than 20 of these coding errors have already been corrected by the GRC, and they are being released publicly as FIX patches prior to the release of GRCh38. An example of one such correction (shown below in Fig. 1: SLC46A1, a gene in which mutations result in hereditary folate malabsorption disease), was achieved by changing the switch point positions of existing assembly components. Providing error-free gene models in the reference assembly is a GRC priority, as it should improve the ability of clinical geneticists to use the reference assembly as a model for review of test results.

The GRC is hard at work on these base corrections- if you have concerns or questions about this process, let us know!

Next week: Dealing with "rare" bases in the reference assembly.

Fig. 1
Fig. 1. Graphical view of region on GRCh37 encompassing SLC46A1. The blue bars represent the underlying assembly components. The gene is shown in green, and alignments of the corresponding RefSeq transcripts and FIX patch are shown below in grey. The red marks in the alignments correspond to mismatches. The positions of clinical and cited variants of this gene are shown in purple and blue boxes, respectively, In GRCh37, a 1nt indel in the underlying component sequence resulted in a non-functional SLC46A1 representation in the assembly. The switch point between the components has now been changed, excluding the indel and resulting in a functional SLC46A1. This change is represented in the FIX patch JH159145.1.

Friday, May 11, 2012

Updating the genome: correcting the assembly of 10q11.22

Human GRCh37 patch release 8 contains an update to previously released fix patch HG1211_PATCH.

This encompasses a 3Mb region in GRCh37 between chr10: 46,256,855-49,299,273.

The tile path in the 10q11.22 region has been extensively altered from its previously fragmented state to one where a single gap remains, between BX649215.1 and AC245041.3. The reworking of the tile path in the region has been carried out using clones in the existing build and additional finished clones not previously in GRCh37.

Working with optical map data provided by the Schwartz Lab we have been able to identify errors in the GRCh37 assembly and have consequentially worked to correct them. The optical map analysis also highlighted redundancy in the assembly causing artificial duplication, which has now been addressed within this patch.
Above: Optical map consensus alignments to GRCh37 10q11.22.
Below: Optical map consensus alignments to the fix patch (JH591181.2)
Legend: Pink track: Clone path; Green: Contig gap; Blue: In silico SwaI fragments.
For the aligned optical map consensus Gold: Concordant fragment ; Red: Missing fragment (seen where OM consensus span gap); Grey: Unaligned fragment

The optical map information was consistent with a path problem in this region. The map data suggested that several clones in the region were misplaced and did not represent a valid chromosome structure in this region. In addition to rearranging several clones (including changing the orientation of some clones in the path), 3 finished clones were added to the path and several redundant clones were removed. The new path contains a single gap that we estimate, based on optical mapping, to be about 90 Kb. The figure below shows an alignment of the patch sequence to the current chr10 assembly.
The panel to the left shows an overview of chr. 10. The orange dots represent fix patches we've released and the blue dots represent novel patches. The arrow shows the location of the 10q21 fix patch. To the right, the top panel shows the chr. 10 tiling path (in grey), the annotated RefSeq genes are below that (in green) and the alignment to the fix patch below that (in purple). The bottom panel shows the patch tiling path and alignment to the chromosome. 

Friday, May 4, 2012

Filling in the gaps to better understand human biology

Duplicated segments pose serious problems for the assembly and annotation of the human genome. In the human reference genome there are still large gaps that require specialized efforts to fill. Many of these gaps lie within highly duplicated segments in which the degree of sequence variation among duplicated loci approaches levels of allelic variation. Many people assume that much of the sequence that is still missing from the reference assembly is not very biologically interesting. However, it has become increasingly apparent that the segmental duplications themselves provide the molecular basis for many human genetic disorders. The resolution of these regions is therefore essential for a complete understanding of the genetic basis of human disease. 

Three patches released in GRCh37.p8, that add almost 400Kb of novel sequence, prove the concept that sequence missing so far from the reference genome can be of crucial importance.  The biological story surrounding these sequences can be found in a recent publication from the Eichler lab (Dennis et al., 2012) but here we'll tell you a little bit about how we worked with the Eichler lab to create these assembly patches. 

Figure 1: Ancestral copy of SRGAP2 in 
chimpanzee (left) and human (right). The other 
red ticks on the human chromosome show
the human specific duplications added by 
this effort.

One of the impediments in resolving the complexity of these regions is the diploid nature of the human genome. We recently took advantage of a haploid BAC library resource (CHORI-17) from hydatidiform mole DNA to close gaps and resolve the genomic structure of segmental duplications encompassing highly identical paralogs of SRGAP2, a gene important in cortex development. Hydatidiform moles are conception abnormalities that most often arise from the fertilization of an enucleated ovum by a single X-bearing sperm. Subsequent diploidization results in a 46 XX karyotype in which all allelic variation has been eliminated allowing the unambiguous delineation of duplicated DNA as well as haplotype characterization. Our SRGAP2 sequencing efforts resolved the sequence and structure of 4 copies of the gene on human chromosome 1, three of which represent human-specific duplicate truncations of the original ancestral gene. 

Overall, we added >380 kbp of new sequence previously absent from the human reference genome, including 40 kbp within the conserved ancestral copy of the gene. Additionally, we discovered ~560 kbp of sequence mapped incorrectly either in orientation or position. This region in GRCh37 contained 15 gaps, and now in the new sequence patch, only two gaps remain.  Combined, we generated or corrected 0.4% of human chromosome 1 euchromatic sequence. The sequencing of these genes have made it possible to explore the function of the human-specific duplicate copies, particularly their role in  neurological traits and disorders unique to humans.
SRGAP2A (1q32 region): JH636054.1
SRGAP2B,D (1q21 region): JH636052.1
SRGAP2C (1p12 region): JH636053.1

Figure 2: View of SRGAP2 gene family on chromosome 1 ideogram (with 1q on the left). The arrows show the order and direction of duplication with the estimated time (in millions of years ago) below that. (Dennis et al., 2012)

Friday, April 27, 2012

Updating the Human Reference Assembly, part 1

Talking about updating the human reference assembly, currently GRCh37 (hg19), can elicit groans and howls of protests from genome scientists who have put considerable effort into analyzing a given data set against the reference assembly. To address this concern, we introduced the notion of a 'Genome Patch'; that is, scaffold size sequences that either add additional sequence representation (NOVEL patches) or fix existing problems in the current reference assembly (FIX patches). In this way, we can make our best representation of the assembly available without disrupting the reference chromosome coordinates. We are at our eighth patch release (GRCh37.p8) and we now have 69 FIX patches and 71 NOVEL patches. 

It is the FIX patches we'd like to consider right now. While the patch scaffolds are easy enough to use if you are interested in a single region, most analysis pipelines have not incorporated these sequences and the improved data remains largely unused for whole genome or exome analysis. It is worth noting that NCBI and Ensembl provide gene annotation on many of the patch releases. Doing a major update to the reference assembly (making GRCh38) will allow us to incorporate these FIX patches into the chromosome assembly, making them directly accessible to analysis pipelines.

There are 66 regions (>40.5 Mb) on GRCh37 that are associated with these 69 FIX patches. In addition to other sequence changes that improve the reference assembly, the sequences in these FIX patches provide more than 4.7 Mb of novel sequence.  Adding large amounts of novel sequence, like the 2.6 Mb added by JH636052.1 (1q21 region) is impressive, however, novel sequence is not the only metric to consider when evaluating FIX patches. For example, GL383543.1/NW_003315932.1 (described in HG-544) is a FIX patch for the FAM23A_MRC1 region on NC_000010.10 (chr10: 17613209-18252930) and adds no novel sequence to the reference assembly. Instead, it removes roughly 200 Kb of artificially redundant sequence and closes a gap in the assembly. The alignment of the patch to the chromosome is shown in Figure 1 (below). 

Alignment of FIX patch to chromosome for FAM23A_MRC1 region of chr10.

Figure 1: FAM23A_MRC1 region on chromosome 10: The top panel shows chr10 in GRCh37. The blue/black line at the top represents the sequence, the track below that is the GenBank components used to assemble the chromosome, below that are the NCBI genes, then the alignment of the chromosome to the patch sequence and finally the segmental duplication track. The second panel shows the FIX patch sequence, which has no gap, the genes annotated on the patch and the alignment to the chromosome. The patch removes roughly 200Kb of artificially redundant sequence (meaning the data in the segmental duplication track is an artifact) and corrects the gene annotation in the region, removing two gene models that represent false gene duplications and don't exist in the population. (see full size photo)

The artificial duplication in the assembly not only affects the gene annotation but also has a significant affect on the alignment of short reads as shown in Figure 2 (below), or you can see the alignments

Figure 2: Alignments of 1000 Genomes data the FAM23_MRC1 region on chromosome 10: The 1000 Genome data alignments are in the tracks below the orange bar noting where the artificial duplication exists in the reference assembly. Two low-coverage samples (NA19625 and NA19701) are aligned using BWA and Mosaic respectively. In the two Mosaic tracks, there is a visible drop off in alignment depth. This is less pronounced in the BWA alignments. The red coloring indicates mismatches in the alignments. (see full size photo)

While it is well-recognized that sequences from the reference, particularly missing paralogs (see Sudmant et al., 2010 for more information), have affects on next generation sequence analysis, it should be noted that artificial duplication within the assembly, such as the example shown here, can also significantly impact such analyses. With the latest patch release we have updated 5 such regions (covering 2 Mb). Several other regions that are phenotypically important, but were represented by a mixed haplotype in GRCh37, have also been updated, including the Williams region on chr7 and the 1q21 region on chr1.

We'll talk about other things FIX patches get you in a later post. Additionally, we'll be highlighting some biologically interesting regions!

Tuesday, April 17, 2012

GRCh37.p8 is now available!

The latest patch release for human (patch 8) is now available! For GRCh37.p8, we've released 9 new FIX patches, 1 new NOVEL patch and we updated one FIX patch from a previous release. We'll provide some more information about specific patches in future blog posts, but if you wanted to get the latest data now go to our FTP site.