Wednesday, February 20, 2013

[leetcode] Implement strStr() | String matching algorithms

Problem:
Returns a pointer to the first occurrence of needle in haystack, or null if needle is not part of haystack.
e.g. haystack=>"mississippi", needle=>"issip".  return "issippi"

This is not a difficult problem. A first thought of brute force algorithm is easy to come up with. However, it gets very tricky if we want to improve the efficiency.

Knuth-Morris-Pratt(KMP) Algorithm
See details at wiki and here
public class StrStr {

	public String strKMP(String haystack, String needle){
		if(needle.length()==0)
			return haystack;
		//get next array
		int[] next = generateNext(needle.toCharArray());
		int len = 0;     // Number of characters matched            
        
        for( int i=0; i<haystack.length(); i++ ){            
        	while( len>0 && needle.charAt(len)!= haystack.charAt(i) )            
        		len = next[len];            
            if( needle.charAt(len) == haystack.charAt(i) )            
                len++;            
            if( len == needle.length() )            
                return haystack.substring(i-needle.length()+1);
        }      
        
        return null;
	}
	/**
	 * KeyPoint. Get next array.
	 * @param needle
	 * @return
	 */
	private int[] generateNext(char[] needle){
		//next array. next[i] means when match length = i, the common prefix and real suffix length = next[i] 
		int[] next = new int[needle.length+1];
		//the longest common length of  prefix and real suffix.
		int presufix = 0;
		//next[1] = 0, starting from 2. i = match length.
		for(int i=2;i<next.length;i++){
			while(presufix>0 && needle[presufix]!=needle[i-1])//trickiest part
				presufix=next[presufix];
			if(needle[presufix]==needle[i-1])
				presufix++;
			next[i]=presufix;
		}
		return next;
	}

}

Rolling Hash Algorithm
See details at wiki. An important application of rolling hash is Rabin-Karp Algorithm. While Rabin-Karp algorithm is a good choice for multiple pattern search , I borrowed the idea to solve this problem. Although it is slower than KMP algorithm in worst case in terms of time complexity, it is much easier to understand than KMP.
The hash function H is:


Where a is constant and c1..ck are input chars.
Here's a simple example of rolling hash.
Input: char array A[1...10]. Length k=5.
H[1..5] = A[1]*a^4 + A[2]*a^3 + A[3]*a^2 + A[4]*a^1 + A[5]*a^0
Then how can we compute H[2..6] ?
Instead of computing it from scratch, we can do as following based on H[1..5]
1. Remove first item of H[1..5] :  H[2..6] = H[1..5] - A[1]*a^4
2. Times constant a                  :  H[2..6] *= a
3. Add  last item A[6]              :  H[2..6] += A[6]

Based on this idea, I coded a very naive implementation to solve the problem. Note that to avoid manipulate huge H values, all maths are done by mod in reality.
public class StrStrRollingHash {
	public static final int A = 3;
	
	public String strStr(String haystack, String needle) {
		int n = haystack.length();
		int m = needle.length();
		if(m==0)
			return haystack;
        if(n<m)
            return null;
		
		char[] input = haystack.toCharArray();
		int needleHash = getHash(needle.toCharArray(),0,m-1);
		int haystackHash = getHash(input,0,m-1);
		
		for(int i=0;i<n-m;i++){
			if(haystackHash==needleHash){
				if(haystack.substring(i, i+m).equals(needle))
					return haystack.substring(i);
			}
			//rolling hash
			haystackHash-=(int)(input[i]*Math.pow(A, m-1));//remove i th item
			haystackHash*=A;//times A
			haystackHash+=input[i+m];//add i+m th item	
		}
		
		if(haystackHash==needleHash){
			if(haystack.substring(n-m, n).equals(needle))
				return haystack.substring(n-m);
		}
		
		return null;
        
    }
	
	private static int getHash(char[] array,int start,int end){
		int k=end-start;
		int sum=0;
		for(int i=start;i<=end;i++){
			int item = (int)(array[i]*Math.pow(A,k--));
			sum+=item;
		}
		return sum;
	}
}


You can try this problem on leetcode.
There's another related Boyer-Moore Algorithm, I will study it when get free.

No comments:

Post a Comment