Thursday, August 4, 2011

HOW SIMILAR ARE WE ;)



The evolution of life on this planet is based around DNA replication . Over 400 million years, glitches in DNA replication lead to genetic mutations, leading in turn to a variety of life forms. Molecular evidence suggests that between 8 and 4 million years ago, first the gorillas and then the chimpanzees split off from the line leading to the humans. So how similar are we and our closest pal, chimpanzee? The answer must obviously lie in the DNA. Will it surprise you if told that you can extract this solution from a DNA strand using algorithms ? ;) Then good, this blog will let you know how..:) The power of algorithms!!.. :)
Consider a DNA strand each from 2 different organisms. Let the strands be X=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA and Y=GTCGTTCGGAATGCCGTTGCTCTGTAAA. As is our aim ,let us try finding similarity between X and Y. Similarity can be defined in different ways. If one strand is the substring of another, then we can say that they are similar. As you can see neither X nor Y is the substring of the other. Another way is ,if the number of changes required to turn one strand to other is small then the strands are similar. And yet another way, this the one we are interested in, is to find a subsequence common to both and whose length is the largest. Okay ,now major thought in most of your minds will be “ how the hell will we know what you meant by subsequence, forget about being 'common' ”.. Right isn’t it?.. :D So now lets get into the learning of what is a subsequence.



Subsequence of a sequence is nothing but a sequence obtained by deleting 0 or more elements. Confused !!.. As always examples to the rescue.
Given a sequence X= [A,B,C,D,B], Z=[B,D,B] is one of the subsequences with index sequence [2,4,5].
Given 2 sequence, the subsequence similar to both is called common subsequence.
Eg: Given X = [ A,B,C,B,D,A,B] ,Y=[B,D,C,A,B,A] some of the common subsequences are
1. X = [ A , $\bf{B}$ , C , B , D , $\bf{A}$ , $\bf{B}$ ] ,
Y = [ $\bf{B}$ , D , C , $\bf{A}$ , $\bf{B}$ , A ]
$\rightarrow$ [B,A,B]
2. X = [ A , $\bf{B}$ , $\bf{C}$ , B , D , $\bf{A}$ , $\bf{B}$ ] ,
Y = [ $\bf{B}$ , D , $\bf{C}$ , $\bf{A}$ , $\bf{B}$ , A ]
$\rightarrow$ [B,C,A,B]
and so on.
Of these commom subsequences the one with maximum length is called LONGEST COMMON SUBSEQUENCE [LCS]. As you can notice now,LCS that can be obtained from the given X and Y is of length 4 and is not unique. [B,C,A,B] and [B,D,A,B] are both subsequences of length 4. Hence they are the LCS.
Now that we understand what is LCS, finding a common subsequence of maximum length between X and Y shows us , to what extent they are similar. So now our aim is , given 2 sequences X = [ $x_1$ , $x_2$ ,.., $x_m$ ] and Y = [ $y_1$ , $y_2$ ,.., $y_n$ ] find a maximal length common subsequence of X and Y.
Before jumping into finding the LCS we need to know something more. Given a sequence X = GMAIL,
$X_1$ = G
$X_2$ = GM
$X_3$ = GMAI
$X_4$ = GMAIL
are also subsequences of X. They are called prefixes which are denoted by name of the sequence followed by a subscript indicating the number of characters the prefix contain. A prefix is a sequence with end deleted. To be precise, we define the ith prefix of X where i=0,1,2,…,m as X = [ $x_1$ , $x_2$ ,.., $x_i$ ]. $X_0$ is an empty string. Another point worth mentioning is LCS(X,Y) gives the actual sequence whereas lcs(X,Y) denotes the max length of the subsequence.
A straight forward method of finding LCS(X,Y) is by taking all the subsequences of X and checking if they are subsequences of Y too and keep the longest. If X is of length $m$ then there are $2^m$ subsequences. Hence calculating by this method will take exponential time which is a drawback if $m$ is large.So brute force method ruled out. What other technique can we apply. Divide and conquer eh? Okay lets give it a shot.
Consider X= MAN456 , Y= 456SIN,using divide and conquer technique we need to find LCS of MAN,456, 456,SIN separately and combine them. LCS of each one is an empty string, combining which we get LCS(X,Y) also as an empty string , which is a wrong answer because LCS(X,Y)=456 . We cannot split the given inputs X and Y into half ,find their LCS and concatenate the results of fist half and the second half to give the right answer. So the divide and conquer technique does not work here as of now. [Hirschberg hs given an algorithm based on this technique which reduces the space complexity of LCS problem.A link to it is given towards the end.]
Instead of blindly trying out our hands at different techniques , let us try to look into the properties of this LCS.
Suppose 2 sequences end with same characters i.e NEWTON and NEUTRON , then LCS will definitely contain the common characters coming at the end. So its better if we find LCS of the subsequence obtained by deleting these common last characters, and then append the deleted words to it.
1. In NEWTON and NEUTRON , go on removing the last common letters. So we end up having NEWT and NEUTR.
2. Now find the LCS of NEWT and NEUTR which by inspection is NET.
3. Then all we have to do is append the deleted common letters .i. e ON
4. So we get LCS as NETON.

So what we basically did here is, found the LCS of ‘prefixes’ and then appended the common ending elements. By doing so we are reducing the the length of the sequence whose LCS is to be found. Now that’s a plus point.Finding LCS by shortening the sequence.

The above property can be generalized as :
1. X = [ $x_1$ , $x_2$ ,.., $x_m$ ] and Y = [ $y_1$ , $y_2$ ,.., $y_n$ ] if $x_m$ = $y_n$,
then LCS(X,Y) = LCS( $X_{m-1}$ , $Y_{n-1}$ ) , $x_m$
Remember $X_{m-1}$ and $Y_{n-1}$ are prefixes and $x_m$ denotes the $m^{th}$ element in X

Now what if X and Y do not end with same characters. Then LCS(X,Y) is longest of the
LCS( $X_{m-1}$ , $Y$ ) and LCS( $X$ ,$Y_{n-1}$).
Consider this example: X= PQRSTUV ,Y=QRSVW. The LCS of 2 sequences should either end with V or not
Case 1: The LCS ends with V. Then W does not come in the LCS. In that case LCS is nothing but LCS of $X_m$ , $Y_{n-1}$.

Case 2:LCS does not end with V. Hence we can remove the last element from the sequence leading us to LCS(X , Y)=LCS ( $X_{m-1}$ , $Y_n$ )
Whatever be the case, LCS(X , Y) should be the longest. Hence the longest of
LCS ( $X_{m-1}$ , $Y_n$ ) and LCS ( $X_m$ , $Y_{n-1}$ ) should be the answer.
What if length of either X or Y is 0? Then LCS is definitely 0. Koi doubt math rakhna.. :D

As we can infer from above discussions , LCS(X , Y) can be found by finding the LCS of smaller subproblems. Solution to the main problem depends on solution to the smaller sub-problems,which depends on smaller subsubproblems. Also the subproblems are overlapping. Now which technique can we apply to solve such problems.Dynamic programming would be the right answer…

Lets begin the solving process. What say?... let c[i,j] contain the length of the
LCS ( $X_i$ , $Y_j$ )i.e lcs ( $x_i$ , $y_j$ )].Using the observations done above we can write

c[i,j]= 0 $\Rightarrow$ if i = 0 or j = 0
c[i,j]=c[i-1,j-1]+1 $\Rightarrow$ if i,j> 0 and $x_i$ = $y_j$
c[i,j]=max{c[xi,yj-1],c[xi-1,yj]} $\Rightarrow$ if i,j> 0 and $x_i$ != $y_j$

Notice that in this recursive formulae we are restricting which subproblems to consider. That is when $x_i$ = $y_j$ we solve the subproblem LCS[ $X_{i-1}$ , $Y_{j-1}$ ] and other subproblems for different conditions… Using the bottom up approach we can find the LCS of all prefix pairs of $X_i$ and $Y_j$ , from shortest to longest. We iterate through these prefix pairs filling a rectangular table with the information needed to continue and to eventually reconstruct an LCS.Consider another table b[i,j] which points to subproblem used to compute c[i,j]. Combining the tables into one , we can find the solution.
Solving using an example makes the matter clear once and for all.
Let X = [A, B, C, B, D, A, B] and Y = [B, D, C, A, B, A]. Writng X as row header and Y as column header , we calculate c[i,j] for all i,j along with b[i,j]. The first row in the table contains lcs ( $x_0$ , $y_j$ ) which turns out to be 0. Filling the next row,
c[1,0] = 0
To find c[1,1]..here $x_1$ = B and $y_1$ = B , since $x_1$ = $y_1$,
c[1,1] = c[0,0]+1= 1 $\Rightarrow$ since c[0,0] was used to compute c[1,1] ,b[1,1] points towards the cell c[0,0]..



Going onto c[1,2].. here $x_1$ = B and $y_2$ = D , Since $x_1$ != $y_2$ we try to find the max
c[1,2] = max{c[0,1],c[1,1]} = 1 $\Rightarrow$ c[1,1] was used to compute c[1,2] hence b[1,2] points to the cell c[1,1]
What if c[i-1,j] and c[i,j-1] turns out to be equal. Then some of them follow the convention of making b[i,j] point to both the cells c[i-1,j] and c[i,j-1].But here we will choose any one cell.
Now proceeding this way we get a table as shown above which contains the lcs ( $x_i$ , $y_j$ ) and the pointers.

But how do we get LCS. Simple!! JUST trace back from bottom most last cell following the direction of arrows. Wherever $x_i$ = $y_j$ the arrow points to $\nwarrow$ .That means in that $i^{th}$ position of X and $j^{th}$ position of Y the elements are common.So they definitely form the part of LCS. Hence print them out. So following the arrows and printing out elements wherever we find $\nwarrow$ we get LCS in reverse order. Reverse this to get our LCS we are looking for.From the table above LCS(X,Y) turns out to be [B, C, B, A] of length 4.There, that’s how we do it.Here is an algorithm for the same.

LCS-LENGTH(X, Y )
1. m $\leftarrow$ length[X]
2. n $\leftarrow$ length[Y]
3. for i $\leftarrow$ 1 to m
4. $\,$do c[i, 0] $\leftarrow$ 0
5. for j $\leftarrow$ 0 to n
6. $\,$do c[0, j ] $\leftarrow$ 0
7. for i $\leftarrow$ 1 to m
8. $\,$do for j $\leftarrow$ 1 to n
9. $\,$$\,$do if $x_i$ = $y_j$
10.$\,$$\,$$\,$ then c[i, j ] $\leftarrow$ c[i − 1, j − 1] + 1
11.$\,$$\,$$\,$$\,$ b[i, j ] $\leftarrow$ “ $\nwarrow$ ”
12.$\,$$\,$$\,$ else if c[i − 1, j ] $geq$ c[i, j − 1]
13.$\,$$\,$$\,$$\,$then c[i, j ] $\leftarrow$ c[i − 1, j ]
14.$\,$$\,$$\,$$\,$ b[i, j ] ← “ $\uparrow$ ”
15.$\,$$\,$$\,$ else c[i, j ] $\leftarrow$ c[i, j − 1]
16 $\,$$\,$$\,$$\,$$\,$b[i, j ] $\leftarrow$ “ $\leftarrow$ ”
17. return c and b

If the size of the inputs are $m$ and $n$ , then there are m $\times$ n entries in the table, and each entry takes O(1) time to compute. Printing out LCS by retracing can be done in linear O(m + n) time, hence overall, running time of this algorithm is O(mn) and requires O(mn)space. If the input size ,m and n, are equal,then algorithm becomes quadratic. We can reduce the asymptotic space requirement of LCS-LENGTH,as you can notice that LCS-LENGTH is computed row by row,depending only on the row being computed and the row above. This reduces the space requirement to linear. But if we want to reconstruct the LCS,then we would require the complete table occuping quadratic space. Luckily Hirschberg has provided a linear space algorithm which combines the techniques of dynamic programming we used above with the classic divide and conquer approach. If ur interested to knoe more here is a link for the same..

http://www.cs.zju.edu.cn/people/yedeshi/Alinearspace-p341-hirschberg.pdf

LCS can be applied in comparing 2 DNA sequences for similarity , 2 English words for spelling, 2 Java files for repeated codes, gas chromatography, bird song analysis and so on. An equivalent problem to LCS is the “minimum edit distance” problem, where the legal operations are insert and delete. The minimum edit distance to transform X into Y is achieved by doing |X| − LCS(X,Y) deletes and |Y| − LCS(X,Y) inserts. Its the basis of unix “diff” command(a file comparison program that outputs the difference between 2 files), where X and Y are files, and the elements of X and Y are lines of text



Remember our initial DNA strand problem of X=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA and Y=GTCGTTCGGAATGCCGTTGCTCTGTAAA. Their LCS turns out to be GTCGTCGGAAGCCGGCCGAA


And what about our pal chimpanzees ?? How similar are we?..:) I know most of you are waiting for that answer. Well here it is..Its found that human and chimpanzee DNA are 96% similar... :)Now that’s very close… So next time someone shouts and calls you out as chimpanzee or a gorilla in anger,then they are almost right, so tell them “yup that’s right!!do you know HOW SIMILAR ARE WE??”









Shobitha Shetty

16 comments:

  1. Hey well started.giving the DNA touch was vry nice:)The problem is well illustrated with the examples.Liked the way the brute force,divide and conquer method was ruled out.Very well explained using the table.Reminded me of lecture on dynamic programming.This concept of dynamic programming was missing from Algorithinking.Perfect!:)
    PS-The last photo is awesome :D

    ReplyDelete
  2. Awesome start!! Hats off to the title and linkage to the dna strands. Its been more than 3 years I got to read something about ATGC, Adenine, Guanine, Thymine and Cytosine.Algorithms and biology? Well, this blog explains it all. The way in which evolution undergoes and subsequence relating to it is absolutely brilliant!!Superb photos..great job!!

    ReplyDelete
  3. Longest common subsequence nicely illustrated with examples. Various methods of finding it is explained neatly. As usual, your blog was interactive :) Pictures are beautiful :) Ending is too good :) Algorithms and DNA ?? Doctors and Engineers?? :) After reading this, i wondered if they were cousins ;)

    ReplyDelete
  4. A blog on algorithms should talk about time complexity? I don't see anything close????

    ReplyDelete
  5. Nice and intriguing.. just a few comments..
    1) A little more insight on why D&C wont work, after the illustration would help
    2)Also, how DP improves the complexity is not mentioned here.. Inclusion of that section is important.
    Otherwise, good job! :)

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. punctuation : ensure that there is always a space after a full stop/comma/exclamation

    ReplyDelete
  8. @ niki, shruti, shru : Thnks a lot for ur comments.. :)
    @anonymous : Thnk u so much for pointing that out.. :) I ve updated the same.. :)
    @Esha : Thnks a lot for ur inputs.. :) :) really appreciate it..Based on ur suggestions i ve updated it further.. :)

    ReplyDelete
  9. @shobitha

    what i really loved was the excellent part of the creation of the LCS table..and its use in getting LCS.

    And i marvel at what 4% of genetic difference means to the difference between two races.

    your explanation of each module of the algorithm beforehand made it a lot easier for me to understand the algo...

    As i ponder at how these algorithms were conceived...i realize that the beauty of the human mind is without bound. our subtleties and our epiphanies cannot be modeled by any programmer.except an entity with some higher level of perception.

    but i am rambling off into philosophy.

    your article rocks....

    good work,
    chetan.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. " Well, is there really FOUR % diff bw a Human n a Chimp when we look at our Central Govt (Ofcourse they claim to be Humans :P ).....
    but wait, just FOUR percent difference???? :P
    People like ANNA HAZARE stand out from the crowd !! " - is the 1st raw thought tht striked me when i just started reading ur blog...Soon i realised that there's whole new theory out here based on DNA and stuff.. I cud Actually Realise tht the whole World is built on just TWO Pillars...!!!! -'SCIENCE' n 'MATHS'........
    n u proved it...!!! Nice Work :)

    ReplyDelete
  12. @chethan : thnks a lot ..:) ritefully said "beauty of human mind is widout bound "..:)
    @preetam : hehe yea..:D well said abt d difference being 4% ..:) thnks a lot ..:) really appreciate u readin dis ..:)

    ReplyDelete
  13. All of your articles are a treat to algorithm lovers....why you haven't posted since months....please keep writing...waiting for your next article :)

    ReplyDelete
    Replies
    1. Hi Sumeet, would be great if you would give us your introduction :-)

      I will soon be flooding this blog with dozens of articles, As part of 5th semester project, I have asked my students to write a popular science article on their favourite algorithm. I will upload all of them here...

      Delete
    2. Hello Mr Sudarshan Iyengar...am a B.E Comp Sci student, 7th sem.......came across your blog a few days back....Happy to know that i'll get to read some more good stuff on algorithms....keep up the good job...:)

      Delete