Sunday, September 25, 2011

MARRY ALGORITHMICALLY :P

As soon as you see the title, you might wonder, "Why 'marriage' in algorithinking?!!". Well, before you end up thinking that I am going to describe how matrimonial sites work(!!), here is an excellent algorithm I came across recently, called the "Stable Marriage Algorithm".

According to Wikipedia, The Stable marriage problem is stated as, "Given n men and n women, where each person has ranked all members of the opposite sex with a unique number between 1 and n in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. If there are no such people, all the marriages are "stable"."

If this sounds confusing, let me restate it for you. There are two disjoint sets $M$ and $W$. The cardinality of both the sets are equal and let it be $n$. If each element in $M$ has ranked all the elements in $N$ from $1$ to $n$ (A unique number to each element of the set) and vice-verse, we have to find a stable match between the two sets. Let $A\inM$ and $B\inN$, where $A$ and $B$ are not matched with each other, then a match between the two sets is said to be stable if for any pair $(A,B)$, it should not happen such that $A$ would prefer $B$ over its current partner and $B$ as well prefer $A$ over its current partner.

When I first went through the problem statement, I thought how can anyone possibly find a match. I was referring to real time implications. I was flabbergasted to find out that this algorithm was basically framed to solve a problem where graduating medical students are assigned to hospital jobs. And guess what, it worked!
David Gale (left)and Lloyd Shapely (right), in the year 1962, proved that given any $n$, it is always possible to solve SMP(Stable Marriage Problem) and hence find a match.

Here's how they explain there algorithm:

It basically consists of a number of rounds. In each round, each unengaged man proposes to a woman he prefers the most and has not proposed yet. And each woman considers all her suitors and replies "Maybe" to the one she most prefers and to rest all she says a straight "No". This will go on and on unless and until there is a match. When this happens, the woman is engaged to her most preferred suitor and that suitor is engaged to her.
To begin with the first round, at first,each unengaged man proposes to the woman he prefers most. And then,each woman replies a "maybe" to the suitor she prefers most and "no" to the rest all.
In the following rounds, each unengaged man proposes to the most preferred woman he has not proposed yet. This will be irrespective of whether the woman is engaged or not. And each woman replies with "Maybe" to the most preferred suitor and "no" to the rest all. This will be again irrespective of whom she is currently engaged to.

However there is a prerequisite which says that all the participant's preference list should be complete and no candidate should be declared unacceptable.

In the end,


  • Everyone gets married: A woman is always engaged to someone since she replies "May Be" to a person in each round.( If she is happy with her first choice, she might not change her decision and just say "no" to remaining candidates in all the rounds. Even in that case she will be engaged to her first choice always).
    A man proposes to every woman (if necessary). So there can be no man and woman unengaged.

  • The marriages are stable: Plagiarising wikipedia, let me explain this with an example. Let Aishwarya and Abhishek be married at the end, but not with each other. Now, its impossible for both Abhishek and Aishwarya to prefer each other over their current partners.Here's how. Since Abhishek prefers Aishwarya over his current wife, he must have proposed Aishwarya, before he proposed his current partner. There are two cases:
    1. If Aishwarya had accepted his proposal, then she surely rejected him for her current partner. And that's the reason she is currently married to him. So, Aishwarya cant prefer Abhishek over her current partner.
    2. If Aishwarya hadn't accepted Abhishek's proposal, then she never liked him in the first place.
Even though this problem works well, there are few problems with respect to its optimality. You can refer to wikipedia page for more info.

You can try out and play with this algorithm. Code it in any language you like. Its quite easy. In the following you are asked for the order of preference and it gives you a stable match.
http://sephlietz.com/gale-shapley/.

The following is a link to the paper studying the Gale-Shapely algorithm. It focuses on proving how appropriate are stable matching mechanisms. It says, even if a woman knows the preference list of all other women, she cannot cheat.
www.columbia.edu/~js1353/pubs/tst-ms01.pdf

You can also find an interactive flash demonstration of this problem. After going through it you can easily understand the problem.

This algorithm is also currently being employed in New York and Boston Public schools, to assign students to schools.

Shruti Ranade

Thursday, August 4, 2011

Maximum Subsequence

6 interns sat in front of their laptops with enthusiasm flashing on their faces brighter than the laptop light.And why not,it was the first day of their internship & they were eager to start off.Their wait ended after their professor entered & without wasting a single moment,gave them a problem to solve.He called it the maximum sub sequence problem.It sounded trivial enough.

Given a sequence of numbers in a array can you find a continuous sequence of numbers such that they give you the maximum sum?If all the numbers are positive then the max subsequence would be the whole array.problem is when we encounter negative numbers.Completing the problem statement,if all the numbers turn out to be neg then the max sub sequence sum would be taken as zero.
If you are still not clear about the problem.It is equivalent to placing two sticks in the array in such a position that the sum between them is the highest compared to any other positions.
Within no time they had an algorithm ready.The only thing in their mind was to solve it as quickly as possible and never really gave any thought about bettering the algorithm.
Again the brute force came to their rescue.Simply do it for all positions and one with the highest sum would be the answer.Turns out the complexity was quadratic.Well atleast the problem was solved.
But the professor was adamant that the problem could be tacked in linear time.
Yup thats the professor!:)
How can it be said there can be linear time algorithm for this?
The awesome proof follows.Forget about the algorithms for a while and think this way.
Say there is an oracle which gives you the answer for array of numbers given.But there's a problem.It takes only $n-1$ inputs.Oops but we have $n$ numbers in the array.But think for a while,there has to be a way in which it can be used.
I'll just give the oracle the whole array except for the last element and the oracle returns with the answer for the $n-1$ elements.Now we have two numbers in hand.One is the last element which was left out and the other is the max subsequence sum for $n-1$ elements.Can we with these two components infer the actual max sub sequence sum of the whole array?And guess what yes we can!
There are just two scenarios
1.The last element is negative-This is the easiest one.In this whatever the oracle gave becomes the final answer
2.The last element is positive-Two cases arrive here
  1. The max subsequence given by the oracle includes the $n-1$th element-In this case since the last no is positive we include the last $n$th element too to the max subsequence and arrive at the final answer
  2. The max subsequence does not include the last $n-1$ element.This is the most tricky one.In this we never know,the max subsequence might be vry different and the one with the $n$th element.So what we can do is just find the max sub sequence including the last element($n$th).In simpler words what can be the maximum sum you can obtain fixing one position at the end.Compare this with the sub sequence sum given by the oracle.Which ever is greater will fetch us the answer.
So the oracle is making our task very easy.But you may wonder where is this discussion heading.How do u think the oracle will solve the problem?Dont you think the oracle having $n-1$ elements can further give all elements except one to another oracle.
Doesnt this thought trigger something?Its nothing but the decrease and conquer told differently!and as you can see this is done in linear time.
Interestingly this problem has a history too.It first arose in two dimensional form and inorder to simplify it was converted in one dimentional form.First a cubic algol was formulated after which a quadratic and a logarithmic time followed.The best part was the linear algol was designed in less than a minute by a person who attended a seminar where this problem was told about.

How does it really matter?
You may always say that my computer is very fast and so it doesnt really matter.ok whats the harm in comparing anyway.Lets say a computer is as fast as 10GHz. That is nothing but it can perform$10^{10}$ steps in one second.

$N$ $N^{2}$
$10$ $100$
$100$ 10000
$1000$ $10^{6}$
$10000$ $10^{8}$
$10^{5}$ $10^{10}$
$10^{9}$ $10^{18}$

Lets calculate the time for the last instance.
Time taken by the linear one would be=$10^{9}/10^{10}$=0.1seconds
Time taken by the quadratic one would be=$10^{18}/10^{10}$=$10^{8}$seconds
Converting that into years it would approximately 3.5yrs!!!!!!
You see what difference it makes.One takes less than a second to solve the problem whereas the other takes years.And this is just between linear and quadratic.Imagine the scenario for the cubic and the further ones.
Of course it is inavitable to use those when u cant find a linear time algol or when there is no such algol.But effort has to made to reduce the complexity and one way of bettering would be using the apporopriate algorithm design techniques.
"Science is what we understand well enough to explain it to a computer".The technology may advance even more tomorrow and there will be new programming languages formed,but independent of all this the essence of algorithms will always be the same.Thats what makes it the most important part computer science.Dijkstra was right when he said "Computer Science is no more about computers than astronomy is about telescope".
What do you think is the real foundation of computer science?:)

So what happened to our 6 friends?Well they had their best beginning and continued on what became the amazing 2 months of their learning experience.
Moral of the story-Between a fast computer and a fast algorithm always go for the latter.you know why :):)

Nikitha Shenoy

HOW SIMILAR ARE WE ;)



The evolution of life on this planet is based around DNA replication . Over 400 million years, glitches in DNA replication lead to genetic mutations, leading in turn to a variety of life forms. Molecular evidence suggests that between 8 and 4 million years ago, first the gorillas and then the chimpanzees split off from the line leading to the humans. So how similar are we and our closest pal, chimpanzee? The answer must obviously lie in the DNA. Will it surprise you if told that you can extract this solution from a DNA strand using algorithms ? ;) Then good, this blog will let you know how..:) The power of algorithms!!.. :)
Consider a DNA strand each from 2 different organisms. Let the strands be X=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA and Y=GTCGTTCGGAATGCCGTTGCTCTGTAAA. As is our aim ,let us try finding similarity between X and Y. Similarity can be defined in different ways. If one strand is the substring of another, then we can say that they are similar. As you can see neither X nor Y is the substring of the other. Another way is ,if the number of changes required to turn one strand to other is small then the strands are similar. And yet another way, this the one we are interested in, is to find a subsequence common to both and whose length is the largest. Okay ,now major thought in most of your minds will be “ how the hell will we know what you meant by subsequence, forget about being 'common' ”.. Right isn’t it?.. :D So now lets get into the learning of what is a subsequence.



Subsequence of a sequence is nothing but a sequence obtained by deleting 0 or more elements. Confused !!.. As always examples to the rescue.
Given a sequence X= [A,B,C,D,B], Z=[B,D,B] is one of the subsequences with index sequence [2,4,5].
Given 2 sequence, the subsequence similar to both is called common subsequence.
Eg: Given X = [ A,B,C,B,D,A,B] ,Y=[B,D,C,A,B,A] some of the common subsequences are
1. X = [ A , $\bf{B}$ , C , B , D , $\bf{A}$ , $\bf{B}$ ] ,
Y = [ $\bf{B}$ , D , C , $\bf{A}$ , $\bf{B}$ , A ]
$\rightarrow$ [B,A,B]
2. X = [ A , $\bf{B}$ , $\bf{C}$ , B , D , $\bf{A}$ , $\bf{B}$ ] ,
Y = [ $\bf{B}$ , D , $\bf{C}$ , $\bf{A}$ , $\bf{B}$ , A ]
$\rightarrow$ [B,C,A,B]
and so on.
Of these commom subsequences the one with maximum length is called LONGEST COMMON SUBSEQUENCE [LCS]. As you can notice now,LCS that can be obtained from the given X and Y is of length 4 and is not unique. [B,C,A,B] and [B,D,A,B] are both subsequences of length 4. Hence they are the LCS.
Now that we understand what is LCS, finding a common subsequence of maximum length between X and Y shows us , to what extent they are similar. So now our aim is , given 2 sequences X = [ $x_1$ , $x_2$ ,.., $x_m$ ] and Y = [ $y_1$ , $y_2$ ,.., $y_n$ ] find a maximal length common subsequence of X and Y.
Before jumping into finding the LCS we need to know something more. Given a sequence X = GMAIL,
$X_1$ = G
$X_2$ = GM
$X_3$ = GMAI
$X_4$ = GMAIL
are also subsequences of X. They are called prefixes which are denoted by name of the sequence followed by a subscript indicating the number of characters the prefix contain. A prefix is a sequence with end deleted. To be precise, we define the ith prefix of X where i=0,1,2,…,m as X = [ $x_1$ , $x_2$ ,.., $x_i$ ]. $X_0$ is an empty string. Another point worth mentioning is LCS(X,Y) gives the actual sequence whereas lcs(X,Y) denotes the max length of the subsequence.
A straight forward method of finding LCS(X,Y) is by taking all the subsequences of X and checking if they are subsequences of Y too and keep the longest. If X is of length $m$ then there are $2^m$ subsequences. Hence calculating by this method will take exponential time which is a drawback if $m$ is large.So brute force method ruled out. What other technique can we apply. Divide and conquer eh? Okay lets give it a shot.
Consider X= MAN456 , Y= 456SIN,using divide and conquer technique we need to find LCS of MAN,456, 456,SIN separately and combine them. LCS of each one is an empty string, combining which we get LCS(X,Y) also as an empty string , which is a wrong answer because LCS(X,Y)=456 . We cannot split the given inputs X and Y into half ,find their LCS and concatenate the results of fist half and the second half to give the right answer. So the divide and conquer technique does not work here as of now. [Hirschberg hs given an algorithm based on this technique which reduces the space complexity of LCS problem.A link to it is given towards the end.]
Instead of blindly trying out our hands at different techniques , let us try to look into the properties of this LCS.
Suppose 2 sequences end with same characters i.e NEWTON and NEUTRON , then LCS will definitely contain the common characters coming at the end. So its better if we find LCS of the subsequence obtained by deleting these common last characters, and then append the deleted words to it.
1. In NEWTON and NEUTRON , go on removing the last common letters. So we end up having NEWT and NEUTR.
2. Now find the LCS of NEWT and NEUTR which by inspection is NET.
3. Then all we have to do is append the deleted common letters .i. e ON
4. So we get LCS as NETON.

So what we basically did here is, found the LCS of ‘prefixes’ and then appended the common ending elements. By doing so we are reducing the the length of the sequence whose LCS is to be found. Now that’s a plus point.Finding LCS by shortening the sequence.

The above property can be generalized as :
1. X = [ $x_1$ , $x_2$ ,.., $x_m$ ] and Y = [ $y_1$ , $y_2$ ,.., $y_n$ ] if $x_m$ = $y_n$,
then LCS(X,Y) = LCS( $X_{m-1}$ , $Y_{n-1}$ ) , $x_m$
Remember $X_{m-1}$ and $Y_{n-1}$ are prefixes and $x_m$ denotes the $m^{th}$ element in X

Now what if X and Y do not end with same characters. Then LCS(X,Y) is longest of the
LCS( $X_{m-1}$ , $Y$ ) and LCS( $X$ ,$Y_{n-1}$).
Consider this example: X= PQRSTUV ,Y=QRSVW. The LCS of 2 sequences should either end with V or not
Case 1: The LCS ends with V. Then W does not come in the LCS. In that case LCS is nothing but LCS of $X_m$ , $Y_{n-1}$.

Case 2:LCS does not end with V. Hence we can remove the last element from the sequence leading us to LCS(X , Y)=LCS ( $X_{m-1}$ , $Y_n$ )
Whatever be the case, LCS(X , Y) should be the longest. Hence the longest of
LCS ( $X_{m-1}$ , $Y_n$ ) and LCS ( $X_m$ , $Y_{n-1}$ ) should be the answer.
What if length of either X or Y is 0? Then LCS is definitely 0. Koi doubt math rakhna.. :D

As we can infer from above discussions , LCS(X , Y) can be found by finding the LCS of smaller subproblems. Solution to the main problem depends on solution to the smaller sub-problems,which depends on smaller subsubproblems. Also the subproblems are overlapping. Now which technique can we apply to solve such problems.Dynamic programming would be the right answer…

Lets begin the solving process. What say?... let c[i,j] contain the length of the
LCS ( $X_i$ , $Y_j$ )i.e lcs ( $x_i$ , $y_j$ )].Using the observations done above we can write

c[i,j]= 0 $\Rightarrow$ if i = 0 or j = 0
c[i,j]=c[i-1,j-1]+1 $\Rightarrow$ if i,j> 0 and $x_i$ = $y_j$
c[i,j]=max{c[xi,yj-1],c[xi-1,yj]} $\Rightarrow$ if i,j> 0 and $x_i$ != $y_j$

Notice that in this recursive formulae we are restricting which subproblems to consider. That is when $x_i$ = $y_j$ we solve the subproblem LCS[ $X_{i-1}$ , $Y_{j-1}$ ] and other subproblems for different conditions… Using the bottom up approach we can find the LCS of all prefix pairs of $X_i$ and $Y_j$ , from shortest to longest. We iterate through these prefix pairs filling a rectangular table with the information needed to continue and to eventually reconstruct an LCS.Consider another table b[i,j] which points to subproblem used to compute c[i,j]. Combining the tables into one , we can find the solution.
Solving using an example makes the matter clear once and for all.
Let X = [A, B, C, B, D, A, B] and Y = [B, D, C, A, B, A]. Writng X as row header and Y as column header , we calculate c[i,j] for all i,j along with b[i,j]. The first row in the table contains lcs ( $x_0$ , $y_j$ ) which turns out to be 0. Filling the next row,
c[1,0] = 0
To find c[1,1]..here $x_1$ = B and $y_1$ = B , since $x_1$ = $y_1$,
c[1,1] = c[0,0]+1= 1 $\Rightarrow$ since c[0,0] was used to compute c[1,1] ,b[1,1] points towards the cell c[0,0]..



Going onto c[1,2].. here $x_1$ = B and $y_2$ = D , Since $x_1$ != $y_2$ we try to find the max
c[1,2] = max{c[0,1],c[1,1]} = 1 $\Rightarrow$ c[1,1] was used to compute c[1,2] hence b[1,2] points to the cell c[1,1]
What if c[i-1,j] and c[i,j-1] turns out to be equal. Then some of them follow the convention of making b[i,j] point to both the cells c[i-1,j] and c[i,j-1].But here we will choose any one cell.
Now proceeding this way we get a table as shown above which contains the lcs ( $x_i$ , $y_j$ ) and the pointers.

But how do we get LCS. Simple!! JUST trace back from bottom most last cell following the direction of arrows. Wherever $x_i$ = $y_j$ the arrow points to $\nwarrow$ .That means in that $i^{th}$ position of X and $j^{th}$ position of Y the elements are common.So they definitely form the part of LCS. Hence print them out. So following the arrows and printing out elements wherever we find $\nwarrow$ we get LCS in reverse order. Reverse this to get our LCS we are looking for.From the table above LCS(X,Y) turns out to be [B, C, B, A] of length 4.There, that’s how we do it.Here is an algorithm for the same.

LCS-LENGTH(X, Y )
1. m $\leftarrow$ length[X]
2. n $\leftarrow$ length[Y]
3. for i $\leftarrow$ 1 to m
4. $\,$do c[i, 0] $\leftarrow$ 0
5. for j $\leftarrow$ 0 to n
6. $\,$do c[0, j ] $\leftarrow$ 0
7. for i $\leftarrow$ 1 to m
8. $\,$do for j $\leftarrow$ 1 to n
9. $\,$$\,$do if $x_i$ = $y_j$
10.$\,$$\,$$\,$ then c[i, j ] $\leftarrow$ c[i − 1, j − 1] + 1
11.$\,$$\,$$\,$$\,$ b[i, j ] $\leftarrow$ “ $\nwarrow$ ”
12.$\,$$\,$$\,$ else if c[i − 1, j ] $geq$ c[i, j − 1]
13.$\,$$\,$$\,$$\,$then c[i, j ] $\leftarrow$ c[i − 1, j ]
14.$\,$$\,$$\,$$\,$ b[i, j ] ← “ $\uparrow$ ”
15.$\,$$\,$$\,$ else c[i, j ] $\leftarrow$ c[i, j − 1]
16 $\,$$\,$$\,$$\,$$\,$b[i, j ] $\leftarrow$ “ $\leftarrow$ ”
17. return c and b

If the size of the inputs are $m$ and $n$ , then there are m $\times$ n entries in the table, and each entry takes O(1) time to compute. Printing out LCS by retracing can be done in linear O(m + n) time, hence overall, running time of this algorithm is O(mn) and requires O(mn)space. If the input size ,m and n, are equal,then algorithm becomes quadratic. We can reduce the asymptotic space requirement of LCS-LENGTH,as you can notice that LCS-LENGTH is computed row by row,depending only on the row being computed and the row above. This reduces the space requirement to linear. But if we want to reconstruct the LCS,then we would require the complete table occuping quadratic space. Luckily Hirschberg has provided a linear space algorithm which combines the techniques of dynamic programming we used above with the classic divide and conquer approach. If ur interested to knoe more here is a link for the same..

http://www.cs.zju.edu.cn/people/yedeshi/Alinearspace-p341-hirschberg.pdf

LCS can be applied in comparing 2 DNA sequences for similarity , 2 English words for spelling, 2 Java files for repeated codes, gas chromatography, bird song analysis and so on. An equivalent problem to LCS is the “minimum edit distance” problem, where the legal operations are insert and delete. The minimum edit distance to transform X into Y is achieved by doing |X| − LCS(X,Y) deletes and |Y| − LCS(X,Y) inserts. Its the basis of unix “diff” command(a file comparison program that outputs the difference between 2 files), where X and Y are files, and the elements of X and Y are lines of text



Remember our initial DNA strand problem of X=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA and Y=GTCGTTCGGAATGCCGTTGCTCTGTAAA. Their LCS turns out to be GTCGTCGGAAGCCGGCCGAA


And what about our pal chimpanzees ?? How similar are we?..:) I know most of you are waiting for that answer. Well here it is..Its found that human and chimpanzee DNA are 96% similar... :)Now that’s very close… So next time someone shouts and calls you out as chimpanzee or a gorilla in anger,then they are almost right, so tell them “yup that’s right!!do you know HOW SIMILAR ARE WE??”









Shobitha Shetty

FACE TO FACE!!!

Facebook. It needs no introduction. It has become a grand face of the social networking era. From a school going kid to a retired man, everyone is facebooking. Its home page is the most frequently visited webpage on the entire internet. Many brands make their biggest impacts through this famous digital marketing tool. With its amazing applications, facebook has acquired over 500 million users. So, what is the connect with ‘Algorithinking’?

As soon as you log in, news feed is the first page that you see. Ever wondered how news feed shows you the relevant content? Do you think its some random event? If facebook had showed you all contents generated by your friends, well, you know what would have happened. Especially when it comes to brands, facebook ensures that specific brands are seen by the right people through news feeds. Facebook recently showed us some insight into the algorithm that credits directly to the ‘News Feed Optimization’, the edgerank algorithm.

There are two distinct parts of the facebook’s news feed. 1.’Top news’ and 2. ‘Most recent’ . The news feed that you get to see, is always, by default, the top news. According to facebook, top news is “based on an algorithm[that] uses factors such as how many friends are commenting on a Post to aggregate content that [a person] will find interesting. It displays stories based on their relevance rather than in chronological order.”
On the shallow level, there exists something called an object.
Facebook views all the inputs as “Objects”. And those “objects” may be, status updates, video links, pictures, etc.
When you have an ‘object’ in your news feed, whenever some other user interacts with this object, ( say he/she comments on a status update) they are creating what is known as an Edge.

The edge rank formula:
$\sum_{edge e}U_e W_e D_e$
The image below shows the News feed optimization expression.
It is the snapshot of the algorithm, posted by a researcher at MIT.


There are three main components related to an Edge: Affinity, weight and time.
Affinity: You will have more affinity towards some friends on facebook than others. The more often you like, tag, comment, view, and as well click on a friend’s object, the so-called affinity score between you and your friend increases. And the affinity is always a one way path. If your friend has liked or commented on your posts, then there is a high probability that your news feed has your friend’s posts.
For example, I like The Beatles and I frequently visit its pages and comment on its posts. So, my news feed has its contents because of the high affinity I have to it, on facebook.

Weight: This measure deals with the value of an object’s edges. More the number of edges an object has, more is its weight and so will be its Edge rank. An edge differs from the other based on whether its created due to commenting or liking or tagging that post.
It is obvious that more weight will be given to an edge if it’s a comment on the respective object compared to when it is liked.

Time: Your news feed is most likely to contain today’s news rather than yesterday’s. Hence facebook has this criteria add to its edge rank. Affinity and weight are not the only criteria that facebook looks into. Because time is the most valuable factor that might change the way your pages look. For example if Karnatka’s CM has been replaced on Wednesday and it is the top most news, everyone is commenting on, and let say it has received some 500 comments, it won’t be on your news feed after a week despite of having large number of comments. ‘Time decay score’ is the major factor here playing its role and the reason behind our news feed getting the latest news.

Now, is this really important?

If it wasn’t showing you relevant information, then the chances of you missing on something really important would be high. Or you might just become bored with all kinds of news, all the time, both relevant and irrelevant which might become one of the reasons for facebook’s breakdown!

When it comes to business and brands, a user’s affinity towards your brand is your key ingredient for success. If you want your brand’s contents on almost everyone’s news feed then you have to create some, that will have high edge rank based on all the three components.
A white paper has been written to explain Facebook's edge rank algorithm. You can refer to it for more information.
http://forms.buddymedia.com/whitepaper-form_facebooks-edgerank.html

Happy facebooking :-)


Shruti Ranade

Monday, August 1, 2011

In Quest of the Best !!


Jerry, a cute looking mouse, extremely intelligent(Stupid at times !! ) is eating cheese.. He hears some crackling noise ou

t and goes out to check. He reads a note that says,
“ Dear Jerry,
Please take care of Tuffy over the Thanksgiving holiday. P.S : he loves to eat! “ :P

Jerry being a little lazy frowns over the idea of spending a nice holiday with Tuffy. A little later, “A Nanny!!” ,he exclaims to himself and thinks “I can always hire a nanny as long as I have enough cheese to pay her ;)”. But Jerry is a perfectionist. He wants to hire ‘The Best’ nanny for his dear nephew and was quite excited about the idea. But on what basis would he know who the Best was? He had an answer for that too. He said “Whoever has scored the highest marks in ‘Mouse-keeping’ school will acquire this post”. By the way, you have to know this, Jerry loves bossing around. Ah...ha…you guessed it right!! That is another reason why he wanted to ‘hire’ someone :P

Much to Jerry’s delight, he saw a bunch of them waiting for the post. Soon he murmured to himself, “Dude it is not going to be easy to choose the best”. He planned to do the following:

(Are you wondering "Is Jerry not smart enough to find out the highest-scorer among the lot?". Well, let me tell you. Jerry has no clue about anyone's marks. Only when a person enters the room for interview process, he gets to know the score and either he has to be hired or fired with no other option left. Also assuming that ‘Mouse-keeping’ school doesn’t give out the highest –scorer)

Ask them to maintain a queue and let everyone in one by one in the order A,B,C and so on.
Blindly hire ‘A ‘and ‘pay’ some cheese.
Look at ‘B’. In case if ‘B’ has a score better than the previously chosen one i.e. ‘A’, then he fires ‘A’ (along with which, the paid cheese is gone :P) and hires ‘B’. Not to forget, he has to pay cheese to ‘B’ for hiring.

He plans to do this until he interviews everyone in order to find the best among the lot.

“Oh dear God!! Won’t I run out of cheese if I do that? What if I feel restless in the middle of the whole process?? ” . Yes…finally...he realizes how much he loves cheese and how lazy he is!! So he contacts his good old mathematician friend Mr.X to help him out with this problem.
Mr.X thinks over the problem a little and speaks to Jerry. The conversation goes like this.

Mr. X : I can help you if you are ready to compromise a little.
Jerry : Oh ! I don’t mind as long as it finds me The Best nanny and I get a larger share of the cheese.
Mr.X : Ok I can’t promise you that I will find the best Nanny, but I promise you to save large amount of your cheese and get you a Nanny closest to the Best.
Jerry : Nice. Alright then, give out the solution.
Mr.X : Given members, find the best among 1 to by hiring and firing process. Let us call her . Hire $\beta$. Find the first person among $k+1$ to $n$ who beats $\beta$ with her score, say $B1$. There you are!! She is the nanny you want to hire.
Jerry (with a weird look) : Really?? You think this is gonna work?? And how do I know $k$?
Mr.X : Well ,$ k$ = number of people attending the interview($n$) /exponential constant ($e$)
Jerry : Oh great !! I will try that out.
Hurray !! he verifies that the method actually works !! He hires $B1$ and she really does take very well care of tuffy .

As usual, happy Ending of the story. :) :) :)

But wait a minute, what exactly happened behind the scene? How was he so sure that this will work out? Any miracle? By the way,Who is this Mr.X?
Mr.X is none other than the ‘Probability’ !!!!!

Let us see how probability finds us a solution for this !! Summarizing the method :

1. A $k$ is found. This $k$ is the point at which you can AFFORD to feel restless.
2. What is the use of this $k$?
Hire the best person $\beta$ in the range 1 to $k$ by ‘hiring and firing’ process. Find the person in the range $k+1$ to $n$ who beats $\beta$. Let us call it $B1$. There are high chances of $B1$ being the Actual best. If this works, then u save a lot of money, time and yet you get a best or closest to best. But how?? Perplexed??

Points to be noted is that ‘best’ i.e. $Bi$ has to occur after $k$ and most of the times it does !! So proceeding with this is worth a try. But the question is how to find a $k$ so that this method works?

$Bi$ : Probability of best occurring in ith p
osition i.e. $\frac{1}{n}$
$Wi$ : Probability of $\beta$ occurring in right position $\frac{k}{i}$
$\beta$ can occur in any one of the k positions. k has to occur before i.

Hence for the method to be a success, both these conditions needs to be satisfied. As both the events are independent, the probabilities have to be multiplied. Hence, by the product rule, the probability of this happening is $B_i.W_i$

$i$ has to occur after $k$, i.e. anywhere between $k+1$ to $n$

$\sum_{i=k+1}^{n}\; W_i B_i$

$\sum_{i=k+1}^{n}\frac{k}{i-1} \frac{1}{n}$

$\frac{k}{n}\sum_{i=k}^{n}\; \frac{1}{i}$


We know tat ,
$\ln n= 1\: + \frac{1}{2}\: + \frac{1}{3}\: + \frac{1}{4}+\cdots +\frac{1}{n}$

$f(k) = \frac{k}{n}\;\left ( \ln n - ln k \right )$



(Click on the image to get a clear view)

A tangent is drawn at the peak. As the slope is 0, f'(k) is equated to 0 in order to find the value of k.

f '(k) = 0

f '(k)=$ \frac{ln n}{n} -\frac{1}{n}\;\left ( k.\:\frac{1}{k}\;+\; ln\:k \right )$

$\frac{ln n}{n} - \frac{1}{n} - \frac{ln k}{n} = 0$
$\frac{1}{n} \left(ln \frac{n}{k} - 1 \right) = 0$
$\left(ln \frac{n}{k} - 1 \right) = 0 $
$ln \frac{n}{k} = 1$
$\frac{n}{k} = e $
$k = \frac{n}{e}$

Wow is that it?? At $k = \frac{n}{e}$, there is a high probability of beta being very close to ‘Best’ which is our aim. You can afford to stop your interviewing process at ‘k’ and also not compromise with your principle of finding the ‘Best’.
If there are n candidates for interview, you can stop the process at n/e and still find the Best with high probability.


Not convinced?? Let us work out the problem given :
(A python generated output :P you can trust this :P )
Work out this on your own and tally the answers given. This way, you will get more acquainted with the method.
Let us say there are 100 candidates for the interview (assuming that there is only one ‘Best’). Their marks are as follows.(These marks are 'random'ly generated by python)

Lets follow Jerry's method i.e. hiring and firing process...That way, the total number of hiring is 5. But once you reach '100' you dunno if that is the best or not. You end up interviewing everyone with the hope that you can find someone better than the one you have already found.
The main aim of the method proposed is to avoid the interviewing process for everyone and yet find a best one. Basically what it means to tell is that you can afford to get restless at $k$
According to the method proposed,

1.K=n/e =100/2.714 =36 (represented by the red line)
2.Number of hirings before k = 4
3.The Obtained Best within k = $\beta$ = 98
4. The ‘Best’ chosen after k = 100 !!!! Which is the actual Best !!

*************************************************************************************

Consider this case where you a $B1$ closest to $Bi$
1. K=n/e =100/2.714 =36 (represented by the red line)
2. Number of hirings before k = 6
3. The Obtained Best within k = $\beta$ = 91
4. The ‘Best’ chosen after k = 98 (represented with yellow)!!!! Which is closest to the actual Best (represented with yellow)!!

*************************************************************************************
Worst Case: If the Actual Best occurs within the range 1 to k, then the method fails and you
1. K=n/e =100/2.714 =36 (represented by the red line)
2. Number of hirings before k = 6
3. The Obtained Best within k = $\beta$ = 100 (Disaster :P)
4. The ‘Best’ chosen after k = 17 (represented with blue)!!!! If none of the values beat $\beta$ then by default the last value is chosen to show that you have traversed the complete list.

*************************************************************************************

Probability at its 'Best' ;) Hope you have your eyebrows raised ;) If not, you need to go through it once again ;)

FYI : The class of algorithms where input is fed in serial fashion and you don't have the prior knowledge about the entire input at the time of processing are called Online Algorithms. Also, this problem is the same as 'The Secretary Problem'. You can google and take a look at it :) :)

Shruthi R Nayak



FOR THE SAKE OF PANCAKE!

It was 12 in the afternoon and I was badly hungry. "One more hour for lunch break!”, I thought. One of the greatest teachers I strongly admire, started off with his lecture on some of the concepts of theoretical computer science. As expected we all were drowned in the world of his logics and were enjoying his lecture. It was as though my hypothalamus ceased to function and I forgot I was hungry. After a while he presented in front of us, a problem, that reminded me of my hunger once again!

It was about the yummiest south Indian food, dosa. The problem goes like this. A chef in some famous south Indian diner is preparing lots and lots of dosas, all of different sizes and is piling them up. But before delivering them, he wants to rearrange them such that the smallest one is on the top of the stack and the largest at the bottom. He has only one spatula, and he can place it somewhere in the stack, lifting the portion above the spatula, flipping it, and then placing it back on the stack. If there are $n$ dosas, what is the minimum number of flips performed by the chef to rearrange the stack? This, is the Indian version of the famous pancake problem.



Jacob Goodman, under the name Harry Dweighter, proposed the pancake problem in the year 1975. This scenario of a chef rearranging the pancakes was proposed by him, keeping in mind the minimum number of flips needed to obtain a correctly arranged stack of pancakes. One might wonder about the need to formulate such problem. It has many applications in the field of genetics, in sorting out the genome sequence.


The simplest solution to this problem requires $2n-2$ flips for $n$ pancakes. This can be explained with an example. The chef can place his spatula below the largest pancake, flip it, and place it back on the stack to obtain a newly arranged stack. He can then place his spatula below the second largest pancake and follow the same procedure for all the $n-1$ pancakes, because the last one left will be the smallest and it is evident that it will be on the top. Hence no flip is required for the smallest. From the above argument you can easily make out that each pancake is associated with two flips.


$2(n-1)+ 0= 2n-2$


The $0$ in the above equation indicates the number of flips for $n=1$.For remaining $n-1$ pancakes, the number of flips should be $2(n-1)$ flips, i.e $2n-2$ flips.


The above upper bound can be further reduced to $2n-3$. Consider you have only 2 pancakes with you. Now only one flip is required to re-arrange it. If there are $n$ pancakes, then apply the procedure mentioned above for $n-2$ pancakes and this will fetch you:


$2(n-2)+1=2n-3$ flips.


Here $+1$ is for the $n=2$ condition and the remaining $n-2$ pancakes, each being associated with 2 flips will amount to $2(n-2)$ flips, giving a total of $2n-3$ flips.


It occurred to some that the above algorithm can be improved and some attempts were made in the same.


In the year 1979, an algorithm was proposed by, the very famous Bill Gates of Microsoft fame ( perhaps the only technical paper that Gates came up with!) and Christos.H.Papadimitriou. The different sorting algorithm had lower and upper bounds to be $\frac{5n+5}{3}$ and $\frac{17n}{16}$ respectively.


Terminologies required to understand Gates’ and Papadimitriou’s algorithm:



$\sigma$: It is the permutation of numbers ranging from $1$ to $n$, you are provided with. The numbers in this permutation indeed signify the size of the pancakes. Larger the number greater is the size of the pancake.


For example, $\sigma$=( 5 1 4 2 3) denotes the arrangement of pancakes such that the largest i.e, number 5 is on the top of the stack with the third largest at the bottom and so on. The aim is to re-arrange it such that you get the identity permutation, which is $i$=( 1 2 3 4 5).


$S_n$ is the set of all permutations of numbers ranging from $1$ to $n$.



Adjacency: For a given stack of pancakes, an adjacency is an adjacent pair of pancakes in the stack such that there is no such pancake whose size is intermediate between the two.If the largest pancake is at the bottom this also counts as one extra adjacency.


For example if $\sigma$=( 2 5 3 7 6 1 4), (7,6) is an adjacency. Because they are adjacent to each other in the given $\sigma$ and as well as there are no pancakes of size in between 7 and 6.


Consecutive: If two pancakes are consecutive to each other in the identity permutation, then they are said to be consecutive.


For example: Given $\sigma$=( 3 5 1 2 4) : 1 is consecutive to 2, 2 is consecutive to 3, 3 is consecutive to 4, 4 is consecutive to 5, 5 is consecutive to 1.


A block: Sticking with the definition in the paper a block is a maximum length sublist $x=\sigma_i \sigma_{i+1}{.......}\sigma_{j-1} \sigma_j=y$ such that there is an adjacency between $\sigma_p$ and $\sigma_{p+1}$ for all $i\leq p \leq j$.


End points: The initial and final elements of the block are called end points.


Singleton: An element that does not belong to any block is called as a singleton.


For example consider $\sigma$=( 3 4 5 1 6 7 2):


In the above permutation:




  1. (3 4 5) is a block since the consecutive elements form an adjacency and so are ( 6 7). They form an adjacency as well as a block since its evident that every adjacency is also a block.

  2. 1 and 2 are singletons since they do not belong to any block.
A table has been given in the paper whose link is given below:
You can refer the paper for many cases which are used below
http://etd.ohiolink.edu/send-pdf.cgi/Armstrong%20Alyssa.pdf?wuhonors1242223287

In the following example $B$ denotes the initial element of the permutation, i.e., the leftmost element.The separation between the blocks and singletons is marked by the symbol $\__$.If B begins with any block then we denote it as $B\simC$. We do the same thing for the right most element of the initial block and denote the element consecutive to it as $D$. Remaining all elements between blocks and singletons are denoted by $\__$.

Example: let $\sigma=(3 4 1 5 2)$:




  • B=3

  • ( 3 4) is a block denoted by B$\sim$C.

  • A=2, since the element consecutive to 3 ( other than 4 which is already in the block) is 2.

  • D=5, since 4 is the rightmost element of the block and 5 is consecutive to 4 which is also a singleton.

  • So we denote this permutation as B$\sim$C$\__$D$\__$A.

Having these definitions in mind, we can proceed with the following example:


Let $\sigma \in S_8$, i.e., $\sigma=(2 3 4 7 6 1 5)$ and the identity permutation $i_8=( 1 2 3 4 5 6 7 8)$:


According to Gates' algorithm:




  • Begins with block ( 2 3 4) and 2 is consecutive to 1 and is also a singleton.Hence, by case 4,we have
    B=2, A=1
    So, we reverse at 6 to get ( 6 7 4 3 2 1 5).

  • Now, the permutation begins with the block ( 6 7 ). 6 is consecutive to 5 which is a singleton. So, by case 4,:
    B=6, A=5:
    We reverse at 1 to get ( 1 2 3 4 7 6 5).

  • Now, the initial block is ( 1 2 3 4) and 1 is consecutive to 7 of the block ( 7 6 5). By case 5 of the table,
    B=1, A=7
    So, we reverse at 4 to get ( 4 3 2 1 7 6 5).

  • Now, we got a single block, in such cases we go for the trivial algorithm,
    The largest is 7, so reverse at 7 to get ( 7 1 2 3 4 6 5) and further reverse at 5 to get ( 5 6 4 3 2 1 7).

  • Again we have got a single block, apply the trivial algorithm,
    Reverse at 6 to get ( 6 5 4 3 2 1 7) and at 1 to get (1 2 3 4 5 6 7).
Thus, we got the identity permutation after 7 reversals.

Using this algorithm Gates and Papdimitriou showed that the lower bound for the algorithm is 1.667. In 1997, Mohammad H. Heydari and I. Hal Sudborough improved the lower bound to $\frac{15n}{14}$ and computed the pancake numbers up to 13.



The above image shows the team of UT Dallas computer science students and their faculty adviser who improved upon a longstanding solution to the pancake problem.


An interesting variant of the problem, proposed by Gates and Papadimitriou, concerns pancakes that are burnt on one side. Here the pancakes must be sorted with the burnt side down.


Applications:



  1. The pancake problem is relevant to the construction of networks of parallel processors, in which pancake sorting can provide an effective routing algorithm between the processors.


  2. Analysis of Genomes: In evolutionary processes, changes in DNA sequences can cause a new species to get separated from an existing one, resulting in diversified life-forms. If you want to reconstruct such changes, you can look into genome sequences of related species. An event where a certain sequence is deleted and is replaced with its reverse sequence can be looked for. Thus analyzing the transformation from one species to the other is analogous to the problem of finding the shortest series of reversals.








KMP











Legendary Computer Scientist Donald Knuth.The 'K' of the KMP
"People think that computer science is the art of geniuses but the actual reality is the opposite, just many people doing things that build on each other, like a wall of mini stones."-Knuth
I never really gave a thought about string matching unless of course i opened the 1313 pages Coreman ebook and pressed Ctrl+f to search for “String Matching” :P.when you discover things like these you automatically start looking for it.So the next time i used the search option to look for a mail or used the find and replace button or when i wondered what page rank would do if there was never really an algorithm for searching strings,the KMP was always on my mind :-)
String matching the word says it all.It is simply finding occurrence of a pattern in a given text.why do we need this?Other than the things already said String Matching also finds its application in identification of patterns in a DNA sequence.
They say a problem well formulated is half solved.Keeping the solving part aside for while lets concentrate on the formulating part.We assume that the text is an array $T (1 . . n)$ of length $n$ and that the pattern is an array $P(1 . . m)$ of length $m$.
And we say pattern P occurs with shift $s$ in text T if

$P[1 .. q] match T[s+1 . . s+q]$
If P occurs with shift $s$ in T , then we call $s$ a valid shift,otherwise we call $s$ an invalid shift. The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T
For most of the problems there is always a naive method of solving it.One can call it The Brute Force.But why call it naive?Its just the easiest way out & necessarily(most of the time :P) not the best.So using this method if we have to find the occurrence of the pattern,we check for all possible shifts.After each shift, we one by one check characters at the current shift until a mismatch occurs and if all characters match then prints the match.
Google goggles
Say the pattern here is “goggles”
We can easily instruct the computer to do the following
Shift 1-
Google goggles
Goggles
Shift 2-Google goggles
goggles
Shift 3- Google goggles
goggles

Shift 4-Google goggles
goggles

Shift 5- Google goggles
goggles
Shift 6-Google goggles
goggles
Shift 7- Google goggles
goggles
Shift 8-Google goggles
goggles
Thus we say the pattern occurs at shift 8.
So what are we actually doing?Simply trying to match the pattern for all possible shifts.The algorithm is simple enough

NAIVE-STRING-MATCHER (T,P)
1. $n = T.length$ // Store the length
2. $m = P.length$
3. $for 0 to n-m$ // no of shifts
4. if $P (1 .. m) == T (s+1 .. s+m)$
5 .Print “pattern occurs with shift”,s
Just take a moment and think here.What can be the worst case?If you cant think of the worst case then take this example.Modifying the first example a tiny bit

T=Google google
P=Googles

In this case only the first shift changes.During the first shift you happily go on comparing when char after char matches only to be disappointed in the end when the last char turns out to be a mismatch(s and space are matched)*Damn!*.We can thus conclude The Naive pattern searching algorithm doesn't work well in cases where we see many matching characters followed by a mismatching character.
Hence in worst case we go through all the shifts which is $n-m+1$ & in each shift we end up comparing all the characters which is $m$.Hence the time complexity of this algorithm is $O((n-m+1)(m))$
In 1987 when two computer scientists fresh from their Turing Award looked at this problem they managed to give an altogether different approach.Their main idea was to represent the data in a different way.They used powerful computer science techniques like pre processing & hashing today.They converted both the pattern & and the sub-strings of the text into decimal numbers.The task of string matching would be then to simply check the equality between the numbers.It used to match the hash value of the pattern with the hash value of current sub-string of text, and if the hash values match then only it starts matching individual characters.Though the problem seemed to be simplified after the preprocessing,the main hinge was the amount of time taken in generating the hash values for the sub-strings of the text.Nevertheless its time complexity was better in average cases though it turned out to be same as the Naive method in the worst case scenario.
How can we say there is scope for improvement from the naive method?
From our example,we know that in shift 1 the first two characters “Go” match,can u not say confidently that shift 2 wont work.And this was just a small example.When dealing with thousands of words to match a pattern you can imagine the information wasted when a mismatch occurs using this method.The question that pops up now is whether it is possible to skip some shifts when we are confident enough from the previous shifts that it wont work.
You must have heard the proverb “The only real mistake is the one from which we learn nothing” or in this case it should be ”The only real mistake is the one from which we learn nothing from a mismatch:P”.
It is this ingenious idea that lead to what we call as KMP.

First published by Donald E. Knuth, James H. Morris and Vaughan R. Pratt, : “Fast Pattern Matching in Strings.” In SIA Journal on Computing.
So how do we go about formulating this idea?What they did was just this.Given that pattern characters P[1 .. q] match text characters T[s+1 .. s+q]

what is the least shift s' > s such that for some k < q,
P[1 .. k]=T[s'+1 .. s'+k]
You first take the pattern and calculate the value for 'k' for all cases like,if first char matches,first two char matches,first three matches and so on till all the char matches.Now we need to store this information for reference.Since we are not going to modify the data and access should be easy,we go for the array data structure.To make it appear more refined,lets call this array the prefix function.come on lets calculate the prefix function for the following.
P=Goggles

A[1]=Case where the first character matches.So we just shift by one pos
$s-k=1$
$1-k=1$
$k=0$

A[2]=The first 2 char “Go” matches.Now we are sure that shifting by one position wont work as G will be matched with o
$2-k=2$
$k=0$
A[3]=when “gog” matches.though shifting by one wont work,shifting by 2 may as it matches g with g
$3-k=2$
$k=1$
Similarly,
A[4]=1
A[5]=0
A[6]=0
Formulating the previous example(according to the algo below)
Shift 1- Google goggles
goggles
2 characters match-A[2]=k=0
shift by $q-k$ position,where $q$ is no of characters matched
2-0=2 positions
Shift 2-Google goggles
goggles
no characters match.simply move to next.
Shift 3-Google goggles
goggles
1 character has matched.and A[1]=0
1-0=1
shift by 1
Shift 4-Google goggles
goggles
no characters match.move next.
Shift 5-Google goggles
goggles.
Shift 6-Google goggles
goggles
And we are done :)
We can infer one thing.That the best case is when $k$ turns out to be 0.That is say $q$ characters have matched before the mismatch and $k$ turns out to be zero.Then we simply skip shifts from s,s+1......s+q.Look at shift 1 for instance in the above example.
Though given the trivial example above one may argue both methods being the same.But imagine the larger scenario where in we skip all wasteful shifts.
There are essentially two parts in the Algol.One to do the necessary preprocessing & the other to the matching task.

KMP-MATCHER (T,P)
1.$n$ =$T.length$
2.$m$=$P.length$
3.A=COMPUTE - PREFIX -FUNCTION(P)// Returns the array containing value of k
4 $q =0$
5 for i = 1 to n / / scan the text from left to right
6 while q > 0 and P[q+1] \neq T[i]
7 $q=A[q]$ / next character does not match
8 if P[q+1] == T[i]
9 q=q+1 / next character matches / is all of P matched?
10 if q == m
11 print “Pattern occurs with shift” i m
12 q=A[q] / look for the next match

COMPUTE -PREFIX-FUNCTION(P)
1 $m = P.length$
2 let A[1 .. m] be a new array
3 $A[1]=0$
4 $k = 0$
5 for q = 2 to m
6 while k > 0 and $P [k+1] \neq P[q]$
7 k=A[k]
8 if P[k+1]==P[q]
9 k=k+1
10 A[q]=k
11 return A

In the first one we taken the pattern and compute the value needed.This is done in O(m) time.

Give a thought about the second component.We learn from each mismatch & shift accordingly.what happens when the worst Case scenario in naive method occurs.That is,almost all but last character in the pattern matches.We shift by the ideal amount without wasting time over use less shifts.Thus matching can be done in at most O(n) time
Total time complexity=O(m+n) compared to the worst case in naive method i.e O(mn)
Reducing from the quadratic to a linear time!!! Isn't that awesome:-)
A great idea and the world follows.Thats just what happened here.If you think even you would have thought that way,then go,there are plenty of research problems out there waiting to be solved and if you succeed,who knows i may end up blogging about your algorithm too one day:)


Nikitha Shenoy

Thursday, July 28, 2011

Shamir's Secret Shring


Introduction:
In the year 1979, an article appeared in the 22nd volume and 11th issue of the journal “Communications of ACM”. The article was written by a person called Adi Shamir from Massachusetts Institute of Technology. The article was called “How to share a secret?”. The article spoke about a secret sharing scheme which divided the secret into parts and distributes these parts to the secret-keepers.
Over the years, this scheme has been the focal point of many cryptographic techniques. Also, many new schemes have been developed based on this idea. This scheme is particularly fascinating because, it opens up a whole new realm of computational hardness problems (such as the RSA encryption scheme).
General idea of Shamir’s secret sharing!
Data security is one of the most important and essential fields today. Why is data security necessary?
Consider this example.
There is a strong room of a bank. It should have some security. Consider that the bank has three managers. The lock of the strong room must be designed such that if any two of the three managers come they must be in a position to open the room.  Here the idea is that the security is more when two keys are required to open the lock rather than just one key.
                                             
What are the possible combinations in which the managers open the room or in other words how many different doors are required for the strong room? There are 3 managers and any two can open the door. Hence there are 3 choose 2 possibilities32. If the number of managers is less then this method can be easily implemented.
In the original article the author used the analogy of 11 scientists working on a science project. They are faced with the problem of placing their findings in a secret cabin. The scheme is such that 6 or more scientists should be able to open the cabin at any given time. It is not hard to show that the minimal solution involves 462 locks and 252 keys per scientist. These numbers are clearly impractical as far as implementation is concerned. What is more intriguing and impossible is that these numbers become exponentially large as the number of participants increases.
Secret sharing principle:
  • Let the number of people be n.
  • Let the number of people required to open the door be k.
  • The idea is to choose a polynomial whose degree is k-1.
  • The roots of the equation are distributed as the keys.
  • The polynomial is calculated to open the door.
  • n keys are generated for the polynomial. Each of the keys is a point on the polynomial.
  • For example, consider a straight line ax+b=c . Two points are enough to find the equation of the straight line. However, any point on the straight line can be distributed as a key.
  • Generalizing the idea for a k th degree polynomial k keys are sufficient to obtain the polynomial. But any n points on the polynomial can be distributed as keys.
  • Lagrange’s interpolation method is used to obtain the polynomial initially.
Mathematical Representation:   
  Formally our goal is to divide D( the secret) into n pieces say D1, D2, D3……….,Dn in such a way that :
  1. Knowledge of any k or more Di pieces make D easily computable.
  2. Knowledge of any (k-1)  or fewer Di pieces leaves D undetermined.
This scheme is called (k,n) threshold scheme.
Mathematical Representation:   
  Formally our goal is to divide D( the secret) into n pieces say D1, D2, D3……….,Dn in such a way that :
  1. Knowledge of any k or more Di pieces make D easily computable.
  2. Knowledge of any (k-1)  or fewer Di pieces leaves D undetermined.

This scheme is called (k,n) threshold scheme.
the idea of this secret sharing scheme is that, 2 points are sufficient to define a straight line, 3 points are sufficient for a parabola, 4 points for cubic circle etc…….
That is, it takes k points to define a polynomial of degree (k-1). Say, F is a finite field and say a0,a1,a2,……ak-1are coefficients.
f(x)=a0+a1x+a2x2+……………….ak-1xk-1
Let us construct any n points out of the function f(x). For instance, i=1……..n  to retrieve (i, f(i) ). Every participant is given a point (a pair of input to the polynomial and output). Given any subset of k points, the coefficients of the polynomial can be calculated using interpolation. The secret is the constant term a0.

Shamir’s secret sharing scheme provides are very secure, dynamic, and extensible scheme. Participants can be added or deleted whenever necessary. Thus this secret sharing method is one of the most robust and secure methods available for situations involving many participants and one secret!!