Dynamic Programming Algorithms for Haplotype Block Partitioning ...

6 downloads 247503 Views 270KB Size Report
Email: {kuizhang, fsun, msw, tingchen}@hto.usc.edu. ABSTRACT. Recent studies have shown that the human genome has a haplotype block structure such that ...
Dynamic Programming Algorithms for Haplotype Block Partitioning: Applications to Human Chromosome 21 Haplotype Data Kui Zhang, Fengzhu Sun, Michael S. Waterman, Ting Chen* Molecular and Computational Biology Program, Department of Biological Sciences University of Southern California, Los Angeles, CA 90089-1113

Email: {kuizhang, fsun, msw, tingchen}@hto.usc.edu

ABSTRACT Recent studies have shown that the human genome has a haplotype block structure such that it can be divided into discrete blocks of limited haplotype diversity. Patil et al. [6] and Zhang et al. [12] developed algorithms to partition haplotypes into blocks with minimum number of tag SNPs for the entire chromosome. However, it is not clear how to partition haplotypes into blocks with restricted number of SNPs when only limited resources are available. In this paper, we first formulated this problem as finding a block partition with a fixed number of tag SNPs that can cover the maximal percentage of a genome. Then we solved it by two dynamic programming algorithms, which are fairly flexible to take into account the knowledge of functional polymorphism. We applied our algorithms to the published SNP data of human chromosome 21 combining with the functional information of these SNPs and demonstrated the effectiveness of them. Statistical investigation of the relationship between the starting points of a block partition and the coding and non-coding regions illuminated that the SNPs at these starting points are not significantly enriched in coding regions. We also developed an efficient algorithm to find all possible long local maximal haplotypes across a subset of samples. After applying this algorithm to the human chromosome 21 haplotype data, we found that samples with long local haplotypes are not necessarily globally similar.

Categories and Subject Descriptors G.1.6 [Optimization]: Constrained Optimization and Global Optimization. G.2.1 [Combinatorics] Combinatorial Algorithms. I.2.8 [Problem Solving, Control Methods, and Search] Dynamic Programming. J.3 [LIFE AND MEDICAL SCIENCES] Biology and Genetics.

General Terms * Corresponding

author.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RECOMB’03, April 10-13, 2003, Berlin, Germany. Copyright 2003 ACM 1-58113-635-8/03/0004…$5.00.

Algorithms, Measurement, Verification

Keywords Dynamic programming, haplotype haplotypes, clustering, tag SNPs

block,

local

maximal

1. INTRODUCTION The pattern of linkage disequilibrium (LD) plays a central role in genome-wide association studies of identifying genetic variation responsible for common human diseases. Single nucleotide polymorphism (SNP) markers are preferred over microsatellite markers for association studies because of their abundance along the human genome (SNPs with minor allele frequency greater than 0.1 occur once about every 600 basepairs) [8], the low mutation rate and accessibilities to high-throughput genotyping. However, genotyping a large number of individuals for every SNP is too expensive to be practical using current technology. The number of SNPs required for genome-wide association studies depends on the LD pattern. Recent studies [1,2,3,5,6] have shown that the human genome can be partitioned into discrete blocks of extensive LD separated by shorter regions of little or no LD, such that only a small fraction of characteristic (“tag”) SNPs are sufficient to capture most of haplotype structure of the human genome in each block. The haplotype block structures with the corresponding tag SNPs can be extremely useful for association studies in which it is not necessary to genotype all SNPs. A recent simulation study [12] indicated that the genotyping effort could be significantly reduced without much loss of power for association studies. In a large-scale study of chromosome 21, Patil et al. [6] identified 20 haplotypes by a rodent-human somatic cell hybrid technique for 24,047 SNPs (at least 10% minor allele frequency) spanning over about 32.4 Mbps. They developed a greedy algorithm to partition the haplotypes into 4,135 haplotype blocks with 4,563 tag SNPs based on two criteria: (1) In each block, at least 80% of the observed haplotypes are represented more than once, and (2) the total number of tag SNPs for distinguishing at least 80% of haplotypes was as small as possible. For the same data, Zhang et al. [12] reduced the number of blocks and tag SNPs to 2,575 and 3,582, respectively, using a dynamic programming algorithm. Both studies tried to minimize the total number of tag SNPs for the entire chromosome. Thus, when resources are limited, investigators may want to restrict the number of tag SNPs used in their studies. An objective of this paper is to prioritize the SNPs

and the corresponding chromosomal regions for genotyping in association studies. We first gave the mathematical formulation for this problem, and then developed two dynamic programming algorithms for haplotype block partitioning to maximize the fraction of the genome covered using a fixed number of tag SNPs.

contained, we first define ambiguous and unambiguous haplotypes. Considering haplotypes defined as hi ,! , h j , two

The ultimate goal of association studies is to identify genetic variant responsible for human complex diseases and traits. Thus it is necessary to know the functional SNPs. Besides, to understand the biological implications of haplotype block structure, we must know if the results, such as the SNPs at the starts of the long blocks, are associated with biological functions. As an effort to assessing these questions, we first applied our algorithms to the published haplotype data of Human Chromosome 21 (Patil et al., 2001) and obtained the corresponding haplotype block partition. Then we searched the dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/) and the NCBI database to identify SNPs located in regions with known biological functions, which, specifically, are genes, exons, coding regions, and nonsynonymous SNPs. At last, we statistically characterized the relationship between the SNPs at the starting points of the haplotype blocks and the SNPs in these regions to assess the biological functions of the haplotype blocks.

that hl (k )hl ( k ') ≠ 0 . A haplotype in the block is ambiguous if it is compatible with two other haplotypes that are themselves incompatible. For example, consider three haplotypes h1 = (1,0,0,2), h2 = (1,1,2,0), and h3 =(1,1,1,2). Haplotype h1 is

As the first step to understand the structure of haplotypes, we studied the local haplotypes. A local haplotype is a haplotype shared by a subset of samples, instead of all the samples. A local maximal haplotype has maximal length and is shared by a maximal number of samples. Long local haplotypes with sufficient abundance are statistically significant, because they could correspond to important historical events happened during the evolution of the species. Combined with the phenotypes of the samples, and the local information of genes and their functions, local haplotypes can be very useful for the association studies. In this paper, we developed an algorithm to search all local maximal haplotypes. Here, we report our discovery of local maximal haplotypes that are shared by at least two haplotype samples and contain at least 100 SNPs. We compared samples sharing long local haplotypes with global similarity of the samples, and found no significant association between these two.

haplotypes, the k-th and k’-th haplotypes, are compatible if the alleles are the same for the two haplotypes at the loci with no missing data, that is, hl (k ) = hl (k ') for any l , i ≤ l ≤ j such

compatible with haplotypes h2 and h3 , but h2 is not compatible with h3 because they differ at the third locus. Thus, h1 is an ambiguous haplotype, whereas h2 and h3 are unambiguous haplotypes. In the rest of paper, only unambiguous haplotypes are included in the analysis, thus compatible haplotypes will be treated as identical haplotypes. A segment of consecutive SNPs can form a block if at least α percent of unambiguous haplotypes are represented more than once in the samples [6, 12]. The tag SNPs are selected based on the measure of haplotype quality in each block. Different block quality measures have been used based on the purpose of a study. For example, as a possible definition of tag SNPs, Patil et al. [6] defined them as the minimum number of SNPs that can distinguish at least α percentage of the haplotypes. While on the basis of haplotype diversity [5], we can choose the tag SNPs as the minimum number of SNPs that can account at least β percent of overall haplotype diversity. In this study, we followed the definition of tag SNPs in [6]. Given α and the criterion for defining tag SNPs, Zhang et al. [12] developed a dynamic programming algorithm for haplotype block partitioning which finds the partition with the minimum total number of tag SNPs. For a fixed number of SNPs to be genotyped, we consider haplotype block partitions with some SNPs being excluded. Given a set of consecutive SNPs si , si +1 ,!, s j , we first define the following functions: •

2. METHODS We first formulated the problem of partitioning haplotypes into blocks using at most a fixed number of tag SNPs, and another problem of finding local maximal haplotypes for a subset of haplotypes. Then we provided algorithmic solutions to both problems.

block (i,!, j ) : a Boolean function. block (i,!, j ) =1 if and only if at least α M (α ≤ 1) unambiguous haplotypes defined by the SNPs si , si +1 ,!, s j are represented more than once,

where M ≤ K is the total number of unambiguous haplotypes defined by the SNPs si , si +1 ,!, s j .



f (i) : the number of tag SNPs within a block. Given a set of disjoint blocks, B = {B1 , B2 ,!, BI } and B1 ≺ ! ≺ BI , where B1 ≺ B2 indicates that the last SNP of B1 is located before

2.1 Haplotype Block Partitioning with a Fixed Number of Tag SNPs

the first SNP of B2 , (if the last SNP of B1 and the first SNP

Assume that we are given K haplotype samples comprised of n consecutive SNPs: s1 , s2 ,! , sn . For simplicity, the SNPs are

of B2 are not consecutive, the interval between them is excluded from this block partition), the total number of tag

referred as 1, 2,!, n in the context. Let hi , i = 1,2,!, n, be a K -

SNPs for these blocks is f ( B ) = ∑ f ( Bi ) .

I

dimensional vector with the k -th component hi (k ) = 0, 1, or 2 being the allele of the k -th haplotype at the i -th SNP locus, where 0 indicates missing data, and 1 and 2 are the two alleles. Here, we follow the definitions of ambiguous and unambiguous haplotypes, and the haplotype block proposed by Patil et al. [6] and by Zhang et al. [12]. To make the present paper self-

i =1



L(i,!, j ) : the length of a block. For different applications, the L-function can be defined by different measurements. We can simply define it as the number of SNPs in the block, then L(i,!, j ) = i − j + 1. We can also set it as the actual length or weighted length of the genome spanned from the i-th SNP

to the j-th SNP if we want to incorporate genomic location of these SNPs into analysis. Then given a set of disjoint blocks, B = {B1 , B2 ,!, BI } , the total length of these blocks is I

L( B ) = ∑ L( Bi ). i =1

Our goal of finding the haplotype block partition with a fixed number of tag SNPs to cover the maximum length of a genome, can be formulated as follows: Block Partition with a Fixed Number of Tag SNPs (FTSNP): Given K haplotypes consisting of n consecutive SNPs, and an integer m , find a set of disjoint blocks B = {B1 , B2 ,!, BI } with f ( B ) ≤ m such that L( B ) is maximized. This problem can be converted to an equivalent, “dual” problem as follows: Block Partition with a Fixed Genome Coverage (FGC) Given a chromosome of length L, K haplotypes consisting of n consecutive SNPs and β ≤ 1 , find a set of disjoint blocks B = {B1 , B2 ,!, BI } with L( B ) ≥ β L such that f ( B ) is minimized. In the following, we proposed a dynamic programming algorithm for the FTSNP problem, and then a parametric dynamic programming algorithm for the FGC problem.

2.1.1 A 2D Dynamic Programming Algorithm Let S ( j , k ) be the maximum length of the genome that is covered by at most k tag SNPs for the optimal block partition of the first j SNPs, j = 1, 2,!, n. Set S (0, k ) = 0 for any k ≥ 0 and S (0, k ) = −∞ for any k < 0 . Then, applying the dynamic programming theory,  S ( j − 1, k )  S ( j , k ) = max  S (i − 1, k − f (i,!, j )) + L(i,!, j )  if 1 ≤ i ≤ j and block (i ,! , j ) = 1. 

Let B = {B1 ,!, BJ } be the set of disjoint blocks for S ( j , k ) such that L( B) is maximum with the constraint f ( B) ≤ k . Then either the last block BJ ends before j, such that S ( j , k ) = S ( j − 1, k ) , or BJ ends exactly at j and starts at i*, 1 ≤ i* ≤ j , such that S ( j , k ) = S (i * −1, k − f ( BJ )) + L ( BJ ) . Using this recursion, we can design a dynamic programming algorithm to compute S (n, m) , the maximum length of genome that is covered by m tag SNPs. The optimal block partition B can be found by backtracking the elements of S that contribute to S (n, m) . The space complexity for this algorithm is O (m ⋅ n) . If we know the functions block (⋅) , f (⋅) and L(⋅) in advance, the time complexity of this algorithm is O ( N ⋅ m ⋅ n) , where N is the number of SNPs contained in the largest block, and N 0, let the total length of the included

blocks be β L. Then, the number of tag SNPs is S (n, λ ) − λ (1 − β ) L. In fact, S (n, λ ) − λ (1 − β ) L must be equal to the minimum number of tag SNPs that is necessary to include at least β L of the genome length. Otherwise, there must exist a partition that needs only m < S (n, λ ) − λ (1 − β ) L tag SNPs to cover at least β L of the genome, and this partition can reduce the score to m + λ (1 − β ) L < S (n, λ ) , which contradicts the assumption that S (n, λ ) is the minimum. Using similar ideas as in [9], it can be shown that S (n, λ ) : S (n, λ ) is an increase, piecewise-linear, and convex function of λ. The right-most linear segment of S (n, λ ) is constant. The intercept and slope for S (n, λ ) for each piecewise-linear segment are the total number of tag SNPs and the total length of excluded intervals, respectively.

Waterman et al. [9] proposed a method to find S (n, λ ) for all λ ≥ 0 efficiently. To make the present paper self-contained, a brief description of the idea and a brief sketch of the algorithm are given in the following. Assume that for an arbitrary λi , we obtain S (n, λi ) = a + λib , where a equals the number of tag SNPs for the included blocks, and b equals the length of the deleted intervals. Define a linear function s(λ ) = a + λb by a and b . By the definition, S (n, λ ) is minimum, so s(λ ) ≥ S (n, λ ) for any λ ≥ 0 , including λi where s(λi ) = S (n, λi ) . Since S (n, λ ) is piecewise-linear, the point (λi , S (n, λi )) must be located within some linear segment Li (λ )

of S (n, λ ) . If (λi , S (n, λi )) is not located at the end of Li (λ ) ,

the computational time can be reduced to O (n) by using the following recursion [9, 10],  S ( j − 1, λ ) + λ L( j,!, j ); S ( j , λ ) = min   S (i − 1, λ ) + f (i,!, j ), 1 ≤ i ≤ j and block (i,!, j ) = 1. So, the time complexity of this algorithm for finding S (n, λ ) is O (2 N ∧ K ⋅ N ⋅ n) + O( K 2 i N 2 in) + O ( N ⋅ S ⋅ n) , where N is the number of SNPs contained in the largest block, K are the total number of haplotype samples and S is the number of segments in S (n, λ ) and will not exceed the total number of tag SNPs.

After finding all the line segments of S (n, λ ) , we know the entire function of S (n, λ ) . At each intersection point ( x, S (n, x)) , several block partitions with different number of tag SNPs and length of excluded intervals may have the same score. We will choose the right-most one with the maximum number of tag SNPs and the minimum length of excluded intervals. For each line segment between two intersection points, the total number of tag SNPs and the total length of excluded intervals are constant along this segment, and are equivalent for the low intersection point between this segment and the previous segment. We can sort the number of tag SNPs by an ascending order according to the deletion parameters at the intersection points. The gaps between these numbers give us information as to how the block partition is affected by the deletion parameter λ .

2.2 Finding Local Maximal Haplotype for Subset of Samples

s(λ ) ≥ Li (λ ) when λ is around λi , and at the same time,

If there exists a subset of k > 1 haplotype samples that share the same haplotype from SNPs i to j, we say these k samples have the same local haplotype, specifically hi ! h j , from the i -th SNP

s(λi ) = Li (λi ) . Repeating this idea, we are able to find all such

to the j -th SNP. A haplotype hi ! h j is maximal, if there does

s (λ ) should be exactly the line defined by Li (λ ) , because

line segments for S (n, λ ) .

not exits any other haplotype hu ! hv that is shared by these k

The algorithm begins with S (n,0) and S (n, ∞). S (n,0) = 0 , and the slope of the corresponding line L0 equals the total length of

samples,

contains

hi ! h j :

either

u