Knuth-Morris-Pratt (KMP) Algorithm in Java: A Beginner’s Guide

Continuing our exploration of string pattern matching, we now delve into the Knuth-Morris-Pratt (KMP) algorithm. The KMP algorithm is a clever approach that efficiently searches for a pattern within a text by avoiding unnecessary comparisons. This blog post will explain the KMP algorithm’s logic and provide a step-by-step Java implementation.


Introduction to Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is designed to efficiently search for occurrences of a “pattern” string within a “text” string. It avoids re-checking characters by pre-processing the pattern to create an auxiliary array, known as the Longest Prefix Suffix (LPS) array. This array helps skip unnecessary comparisons during the search.

How It Works

  1. LPS Array: Preprocess the pattern to create the LPS array that indicates the length of the longest proper prefix which is also a suffix.
  2. Pattern Matching: Use the LPS array to skip characters in the text, allowing the search to continue from a position that guarantees no redundant comparisons.

Step-by-Step Execution of KMP Algorithm

1. Build the LPS Array

The LPS array is the cornerstone of the KMP algorithm. It helps to avoid repeating comparisons by storing the lengths of the longest proper prefix which is also a suffix for each substring of the pattern.

2. Pattern Matching Using LPS

Use the LPS array to shift the pattern efficiently whenever a mismatch occurs. This allows skipping over sections of the text that have already been matched with parts of the pattern.


Java Implementation of KMP Algorithm

Here’s a complete Java program implementing the KMP algorithm:

import java.util.ArrayList;
import java.util.List;

public class KMP {
    public static List<Integer> kmpSearch(String text, String pattern) {
        List<Integer> result = new ArrayList<>();
        int[] lps = computeLPSArray(pattern);
        int i = 0, j = 0;
        int n = text.length();
        int m = pattern.length();

        while (i < n) {
            if (pattern.charAt(j) == text.charAt(i)) {
                i++;
                j++;
            }

            if (j == m) {
                result.add(i - j); // Record the starting index of the match
                j = lps[j - 1];
            } else if (i < n && pattern.charAt(j) != text.charAt(i)) {
                if (j != 0) {
                    j = lps[j - 1];
                } else {
                    i++;
                }
            }
        }
        return result;
    }

    // Compute the LPS array for the pattern
    private static int[] computeLPSArray(String pattern) {
        int m = pattern.length();
        int[] lps = new int[m];
        int length = 0;
        int i = 1;
        lps[0] = 0;

        while (i < m) {
            if (pattern.charAt(i) == pattern.charAt(length)) {
                length++;
                lps[i] = length;
                i++;
            } else {
                if (length != 0) {
                    length = lps[length - 1];
                } else {
                    lps[i] = 0;
                    i++;
                }
            }
        }
        return lps;
    }

    public static void main(String[] args) {
        String text = "ababcabcabababd";
        String pattern = "ababd";
        List<Integer> positions = kmpSearch(text, pattern);

        System.out.println("Pattern found at positions: " + positions); // Output: [10]
    }
}

Detailed Explanation

1. computeLPSArray Function

This function pre-processes the pattern to create the LPS array, which indicates the longest proper prefix that is also a suffix.

private static int[] computeLPSArray(String pattern) {
    int m = pattern.length();
    int[] lps = new int[m];
    int length = 0;
    int i = 1;
    lps[0] = 0; // LPS for the first character is always 0

    while (i < m) {
        if (pattern.charAt(i) == pattern.charAt(length)) {
            length++;
            lps[i] = length;
            i++;
        } else {
            if (length != 0) {
                length = lps[length - 1];
            } else {
                lps[i] = 0;
                i++;
            }
        }
    }
    return lps;
}

Explanation

  • Initialization: Start with the first character, with the LPS value as 0.
  • Matching Characters: If characters match, increase the length and set LPS value.
  • Mismatch Handling: If characters mismatch, fall back to the previous LPS value or set it to 0 if no proper prefix is found.

2. kmpSearch Function

This function uses the LPS array to efficiently search for the pattern in the text.

public static List<Integer> kmpSearch(String text, String pattern) {
    List<Integer> result = new ArrayList<>();
    int[] lps = computeLPSArray(pattern);
    int i = 0, j = 0;
    int n = text.length();
    int m = pattern.length();

    while (i < n) {
        if (pattern.charAt(j) == text.charAt(i)) {
            i++;
            j++;
        }

        if (j == m) {
            result.add(i - j);
            j = lps[j - 1];
        } else if (i < n && pattern.charAt(j) != text.charAt(i)) {
            if (j != 0) {
                j = lps[j - 1];
            } else {
                i++;
            }
        }
    }
    return result;
}

Explanation

  • Matching Characters: If characters match, continue checking the next character.
  • Pattern Found: If the entire pattern matches, record the starting index and use the LPS array to continue searching.
  • Mismatch Handling: If characters mismatch, use the LPS array to skip unnecessary comparisons.

Performance Considerations

Time Complexity

  • Preprocessing Time: ( O(m) ) for constructing the LPS array.
  • Matching Time: ( O(n) ) for searching the pattern in the text.
  • Overall Time: ( O(n + m) ).

Space Complexity

  • Space Complexity: ( O(m) ) for storing the LPS array.

Use Cases

  • Text Editors: Efficient find-and-replace operations.
  • DNA Sequencing: Matching genetic patterns in sequences.
  • Log Analysis: Quickly locate specific patterns in large logs.

Conclusion

The Knuth-Morris-Pratt (KMP) algorithm provides a robust and efficient method for pattern matching by using the LPS array to avoid redundant comparisons. Its linear time complexity makes it suitable for various real-world applications.

Experiment with different patterns and texts to see the KMP algorithm in action, and enjoy the efficiency it brings to pattern matching!


Tags: #Java #Algorithm #KMP #PatternMatching #TextSearch

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top