Wednesday, February 20, 2013

[leetcode] Implement strStr() | String matching algorithms

Problem:
Returns a pointer to the first occurrence of needle in haystack, or null if needle is not part of haystack.
e.g. haystack=>"mississippi", needle=>"issip".  return "issippi"

This is not a difficult problem. A first thought of brute force algorithm is easy to come up with. However, it gets very tricky if we want to improve the efficiency.

Knuth-Morris-Pratt(KMP) Algorithm
See details at wiki and here
public class StrStr {

	public String strKMP(String haystack, String needle){
		if(needle.length()==0)
			return haystack;
		//get next array
		int[] next = generateNext(needle.toCharArray());
		int len = 0;     // Number of characters matched            
        
        for( int i=0; i<haystack.length(); i++ ){            
        	while( len>0 && needle.charAt(len)!= haystack.charAt(i) )            
        		len = next[len];            
            if( needle.charAt(len) == haystack.charAt(i) )            
                len++;            
            if( len == needle.length() )            
                return haystack.substring(i-needle.length()+1);
        }      
        
        return null;
	}
	/**
	 * KeyPoint. Get next array.
	 * @param needle
	 * @return
	 */
	private int[] generateNext(char[] needle){
		//next array. next[i] means when match length = i, the common prefix and real suffix length = next[i] 
		int[] next = new int[needle.length+1];
		//the longest common length of  prefix and real suffix.
		int presufix = 0;
		//next[1] = 0, starting from 2. i = match length.
		for(int i=2;i<next.length;i++){
			while(presufix>0 && needle[presufix]!=needle[i-1])//trickiest part
				presufix=next[presufix];
			if(needle[presufix]==needle[i-1])
				presufix++;
			next[i]=presufix;
		}
		return next;
	}

}

Rolling Hash Algorithm
See details at wiki. An important application of rolling hash is Rabin-Karp Algorithm. While Rabin-Karp algorithm is a good choice for multiple pattern search , I borrowed the idea to solve this problem. Although it is slower than KMP algorithm in worst case in terms of time complexity, it is much easier to understand than KMP.
The hash function H is:


Where a is constant and c1..ck are input chars.
Here's a simple example of rolling hash.
Input: char array A[1...10]. Length k=5.
H[1..5] = A[1]*a^4 + A[2]*a^3 + A[3]*a^2 + A[4]*a^1 + A[5]*a^0
Then how can we compute H[2..6] ?
Instead of computing it from scratch, we can do as following based on H[1..5]
1. Remove first item of H[1..5] :  H[2..6] = H[1..5] - A[1]*a^4
2. Times constant a                  :  H[2..6] *= a
3. Add  last item A[6]              :  H[2..6] += A[6]

Based on this idea, I coded a very naive implementation to solve the problem. Note that to avoid manipulate huge H values, all maths are done by mod in reality.
public class StrStrRollingHash {
	public static final int A = 3;
	
	public String strStr(String haystack, String needle) {
		int n = haystack.length();
		int m = needle.length();
		if(m==0)
			return haystack;
        if(n<m)
            return null;
		
		char[] input = haystack.toCharArray();
		int needleHash = getHash(needle.toCharArray(),0,m-1);
		int haystackHash = getHash(input,0,m-1);
		
		for(int i=0;i<n-m;i++){
			if(haystackHash==needleHash){
				if(haystack.substring(i, i+m).equals(needle))
					return haystack.substring(i);
			}
			//rolling hash
			haystackHash-=(int)(input[i]*Math.pow(A, m-1));//remove i th item
			haystackHash*=A;//times A
			haystackHash+=input[i+m];//add i+m th item	
		}
		
		if(haystackHash==needleHash){
			if(haystack.substring(n-m, n).equals(needle))
				return haystack.substring(n-m);
		}
		
		return null;
        
    }
	
	private static int getHash(char[] array,int start,int end){
		int k=end-start;
		int sum=0;
		for(int i=start;i<=end;i++){
			int item = (int)(array[i]*Math.pow(A,k--));
			sum+=item;
		}
		return sum;
	}
}


You can try this problem on leetcode.
There's another related Boyer-Moore Algorithm, I will study it when get free.

Monday, February 18, 2013

交换使两个序列差最小| Two Equal-sized Subsets Partition Problem

题目: 有两个序列a,b,大小都为n,序列元素的值任意整数,无序;
要求:通过交换a,b中的元素,使[序列a元素的和]与[序列b元素的和]之间的差最小。


这是微软的一道面试题,也在July整理的面试100题之中。本来我想了一下,没有什么特别好的方案,于是百度之。结果百度第一页出来的答案,都一样就算了,还是错的。吐槽无力, 然后就有了这篇中文文章。

首先很容易想到,这问题的本质,就是将一个长度为2n的数组,分成长度为n的两个子数组,使它们和的差值最小。这是数组分隔问题的一个特例。见Partition Problem。在原问题中,原数组允许被分为两个任意长度的子数组(只要它们合起来为2n)。然而有了子数组长度必须为n这个条件后,这个问题难度并没有降低--它还是一个NP问题。证明见这里

网上搜出来的很多解法,其实不能说全错。但它们不能保证始终得到最优解。大多数方法用的都是贪婪算法的思想,然而贪婪算法是以局部最优解去揣测全局最优解--可能成功,也可能失败。贪婪算法获取全局最优解的典型例子比如Dijkstra最短路径算法以及求最小生成树的kruskal算法。 但其有效性需要严格的证明,当然,证明其失败的话,一个反例即可。网上的大多数算法,不能保证总是最优解,但有一些确实是优秀的渐近算法,能保证结果于最优解的差距在一定范围以内。关于这些算法可以参考这里

那么,如何得到最优解?
1. 穷举法。
天下武功,唯快不破。算法么,只要够暴力,总是能解的。- -!
问题本质是从2n个元素中取出n个元素,使其和最接近于Sum/2,其中Sum为2n个元素的总和。
所以只要枚举出2n个元素中n个元素的所有组合,迭代更新最接近的值,总和得到最好的解。当然,这个算法的时间复杂度就是阶乘级别的了,不过对于一个NP问题,这也不丢脸呵。

2. DP 算法。
详见编程之美209页。
public class PartitionArray {
 
 /**
  * @param array:
  *        assuming that the input array is the combination of two n-sized array.
  * @return int :
  *        minimal difference of the two n-sized subsets.
  */
 public static int minimalSum (int[] array){
  int total = sum(array);
  int sum = total/2;
  int n=array.length/2;
  
  //dp[i][v] = true means that there's i elements added up to value v.
  boolean[][] dp = new boolean[n+1][sum+1];
  dp[0][0]=true;
  
  for(int k=1;k<=2*n;k++){
   int val = array[k-1];
   for(int i=1;i<=k&&i<=n;i++){
    for(int v=1;v<=sum;v++){
     //try to take k as i th element
     if(v>=val && dp[i-1][v-val])
      dp[i][v]=true;
    }
   }
  }
  //find the answer. we need n-sized subset, so just check dp[n][*].
  for(int i=sum;i>0;i--){
   if(dp[n][i])
    /**
     * we find a subset with size=n and sum=i,another subset will have sum=total-i
     * so the difference will be (total-i)-i=total-2*i
     */
    return total-2*i;
  }
  return -1; 
 }
 
 private static int sum(int[] array){
  int sum=0;
  for(int i:array)
   sum+=i;
  return sum;
 }
}
当数组之和非常大的时候,DP算法的数组就会非常大,就不那么高效了。
对于一些极端例子,比如A=[ 12345,12332,14098], B=[ 32213,12321,23132]. DP还不如暴力破解。C(6,3) = 120。 总共才120种可能。

Tuesday, February 5, 2013



     

   Talk is cheap. Show me the code.


                                   --Linus Torvalds in Linux-kernel mail list                          

Monday, February 4, 2013

Notes on LCS, LIS, and related problems

LCS (Longest Common Subsequence) is a well-known dynamic programming(DP) problem. I learned it several times, but sadly I forgot it several times as well. These kinds of problems are not that straightforward for someone first encounters it. While I'm learning(to be more specific, re-learning) these related problems recently, I still haven't found any short path to master DP algorithms. Personally, I think completely understanding more and more problems will enlarge our knowledge to solve other similar problems in the future. And of course, coding is inevitable and essential.

Circus Tower Sorting Problem
There are many interesting problems related to LCS problem. For example:
A circus is designing a tower routine consisting of people standing atop one another’s shoulders. For practical and aesthetic reasons, each person must be both shorter and lighter than the person below him or her. Given the heights and weights of each person in the circus, write a method to compute the largest possible number of people in such a tower.
EXAMPLE:
Input (ht, wt): (65, 100) (70, 150) (56, 90) (75, 190) (60, 95) (68, 110)
Output: The longest tower is length 6 and includes from top to bottom: (56, 90) (60,95) (65,100) (68,110) (70,150) (75,190)

This problem can be easily converted to LCS problem and therefore solution is quite simple:
  1. Sort the data according to height. Denote it as D1. That is O(nlogn).
  2. Sort the data according to weight. Denote it as D2. That is O(nlogn).
  3. Find the length of the LCS of D1 and D2. Using dynamic programming, it’s O(n^2).
Given a good understanding of LCS problem, this problem can be solved quickly, though it's tricky.
  1. Sort the Sequence in increasing order. Denote it as S'. That is O(nlogn).
  2. Find the length of the LCS of original sequence S and S'. Using dynamic programming, it’s O(n^2).
This reveals the overall O(n^2) time complexity solution. Another  O(n^2) DP solution is shown here. There's exists O(nlogn) solution to solve it and that's not hard either. They can be found on wiki and a good example is shown on a Chinese blog.

The boxes in this particular problem can be rotated and multiple instance of same box is allowed. This problem can be solved based on the above LIS problem.

There's another box stacking problem which is slightly different from the above one in Cracking the Coding Interview 5th Edition Chapter 9.10 :

You have a stack of n boxes, with width w(i), depth d(i), and height h(i). The boxes cannot be rotated and can only be stacked on top of one another if each box in the stack is strictly larger than the box above it in width, height, and depth. Implement a method to build the tallest stack possible, where the height of a stack is the sum of the heights of each box.

My initial thought is to adapt a similar approach as the Circus Tower problem. However, sorting the boxes in width, depth and height separately and find their LCS is kind of 3-Dimension DP, which is much more complicated. Actually I don't if it will truly work.

While answer in the book is taking the similar idea of solving LIS problem, I don't understand it well, and I don't like the code either. So I just put my idea here:

1. Sort the boxes on increasing order of height.
2. Let:
H(i) = Maximum possible Stack Height with box i at bottom of stack
H(i) = { Max ( H(j) ) + height(i) }
           where j is in [1..i-1] and width(j) < width(i) and depth(j) < depth(i) and height(j) < height(i)
           If there is no such j then H(i) = height(i).
3. Scan H(1) to H(n). Find the max as the answer.

I may put the code later.