A Novel Parallel Algorithm for Edit Distance Computation

The edit distance between two sequences is the minimum number of weighted transformation-operations that are required to transform one string into the other. The weighted transformation-operations are insert, remove, and substitute. Dynamic programming solution to find edit distance exists but it becomes computationally intensive when the lengths of strings become very large. This work presents a novel parallel algorithm to solve edit distance problem of string matching. The algorithm is based on resolving dependencies in the dynamic programming solution of the problem and it is able to compute each row of edit distance table in parallel. In this way, it becomes possible to compute the complete table in min(m,n) iterations for strings of size m and n whereas state-of-the-art parallel algorithm solves the problem in max(m,n) iterations. The proposed algorithm also increases the amount of parallelism in each of its iteration. The algorithm is also capable of exploiting spatial locality while its implementation. Additionally, the algorithm works in a load balanced way that further improves its performance. The algorithm is implemented for multicore systems having shared memory. Implementation of the algorithm in OpenMP shows linear speedup and better execution time as compared to state-of-the-art parallel approach. Efficiency of the algorithm is also proven better in comparison to its competitor.


INTRODUCTION
This work will be focusing on the problem of checking how similar two strings are, in other words, how closely two strings resemble.In this regard, a well-defined measure exists, called Levenshtein distance.In simple words, Levenshtein distance is the number of transformationoperations (deletion, insertion, or substitution) required to transform one string to another.Sometimes, Levenshtein distance is also referred as edit distance between two strings.Edit distance find its applications in natural language processing where spell correction is most common use of it and in computational biology it is used C omparison of two strings helps in solving problems from many domains including bioinformatics (DNA analysis) [1], textprocessing (spell-checkers, plagiarism detection, and error correction), signal processing, information retrieval, speech recognition, and web mining.String matching or string comparison comes into different forms: finding if a string is substring of another string, identifying the longest common subsequence, and checking how similar or dissimilar two strings are [2].All these forms of string matching have their own applications in different areas.
for matching and aligning DNA sequences.It is also used for machine translation, information extraction and speech recognition.
Dynamic programming solutions exist to find edit distance but it becomes computationally intensive when the lengths of strings become very large.Hence, a parallel algorithm can always help in finding the solution in reasonable time.This study presents a novel parallel algorithm to compute edit distance.The theoretical design is thoroughly evaluated and compared with state-of-theart parallel approach.Further, the algorithm is implemented in OpenMP for multicore systems that showed improved results.
Rest of the paper is organized as follows: Section 2 explains the Levenshtein distance, Section 3 discusses parallel approaches to compute Levenshtein distance, hence cover the related work, Section 4 presents the justifications for our novel parallel approach to compute Levenshtein distance and theoretically compares our approach with state-of-the-art, and Section 5 discusses the implementation and experimental results.Finally, Section 6 concludes the work highlighting future directions.

LEVENSHTEIN DISTANCE (EDIT DISTANCE)
This section defines theLevenshtein distance or simply edit distance and explain by using a simple example, its mathematical formulation, and its dynamic programming solution.

Definition
Given two strings/sequences A = 〈a 1 , a 2 ,….,a m 〉 and B = 〈b with 'e') is required to transform A into B.

Mathematical Formulation
Given two strings/sequences A = 〈a 1 ,a Where (1 < I < m and 1 < j < n)

Dynamic Programming Solution
Given two strings of length m and n, a distance

Parallel Algorithm Formulation
This section discusses that how the distance table can be built in parallel.In order to do that, it is important to understand how the entries of the table are populated.
As mentioned in the previous section, in case of sequential computation, the table can be filled in row-major or column-major order.But, to compute more than one entry simultaneously, some dependency analysis is required.

RELATED WORK
To solve edit distance problem in parallel major solutions are based on bit parallel [3] and diagonal parallelism approach.Bit parallel algorithms depend upon machine word size but as machine word size increases their performance decreases hence these are not applicable to general processors [4].
Parallel algorithms to compute edit distance that are based on diagonal approach, compute the distance table diagonally i.e. one diagonal at a time because from dependence analysis it can be observed that each diagonal is dependent only on previous diagonal as shown in Fig. 3. Recently, most of the parallel algorithms to compute edit distance are based on diagonal approach.
With this approach, if there are two strings of size m and n, then at most min (m, n) cells of distance table can be computed in parallel as it would be size of largest diagonal.
Further, when m and n are almost same, this largest amount of parallelism will be attained only few number of times.
With varying amount of parallelism at each step, it is also hard to maintain load balancing in diagonal based approaches [5][6][7].

FIG. 3. DIAGONAL APPROACH
Parallel algorithm to solve edit distance problem used in [8] also uses diagonal based approach but it is specific to FPGA.Niewiarowski et.al. [9] used .NET Framework 4.0 technology with a specific implementation of threads using the System.Threading.Tasknamespace library and it requires specific number of threads to be executed in parallel.For different amount of threads it is not cost effective.

MAJOR CONTRIBUTION -NOVEL PARALLEL ALGORITHM
This section presents the novel parallel algorithm to compute edit distance.
Definition-1: if i th character of first string matches with j th character of second string then D[i,j] is called a match case.
Definition-2: if i th character of first string does not match with j th character of second string then D[i,j] is called a non-match case.
Considering a cell D[i,j-1] of edit distance table with the edit distance 'n', there are following observations: Observation-0: With the assumption that the weight of each transformation-operation (insert, remove, and substitute) is 1, it is obvious based on recurrence of Equation ( 1) that the edit distance of two adjacent cells in a row or in a column will not differ by more than 1.Hence, Observation-1: If the edit distance in the i th row is increasing and the edit distance in previous row is also increasing.Possible cases, depicted in Fig. 4.

Observation-2:
If the edit distance in the i th row is increasing and the edit distance in previous row remains same.Possible cases, depicted in Fig. 5.

FIG. 4. CASES FOR OBSERVATION-1 FIG. 5. CASES FOR OBSERVATION-2
Observation-3: If the edit distance in the i th row is increasing and the edit distance in previous row is decreasing.Possible cases, depicted in Fig. 6.
Observation-4: If the edit distance in the i th row remains same and the edit distance in previous row is increasing.
Observation-4(a): If the case 'a' of observation 4 continues for the next column, then it would definitely be a match case at D[i, j+1].This self-explanatory situation is depicted in Fig. 8.

Observation-5:
If the edit distance in the i th row remains same and the edit distance in previous row also remains same.Possible cases, depicted in Fig. 9.
Observation-6: If the edit distance in the i th row remains same and the edit distance in previous row is decreasing.

Observation-7:
If the edit distance in the i th row is decreasing and the edit distance in previous row is increasing.Possible cases, depicted in Fig. 11.

FIG. 6. CASES FOR OBSERVATION-3 FIG. 7. CASES FOR OBSERVATION-4 FIG. 8. SITUATION FOR OBSERVATION-4(A) FIG. 9. CASES FOR OBSERVATION-5
Observation-8: If the edit distance in the i th row is decreasing and the edit distance in previous row remains same.Possible cases, depicted in Fig. 12.
Observation-9: If the edit distance in the i th row is decreasing and the edit distance in previous row is also decreasing.Possible cases, depicted in Fig. 13.Fact-2: Under Observation-2 (Case b and c), and Observation-3 (Case c) the edit distance in i th row is increasing and in (i-1) th row is decreasing or stays same from the last match case.In this case, edit distance in (i-1) th row will be less than edit distance in i th row.Hence, for D

Theorem-1: For a non-match case D[i, j+k] (k> 0) with last match case D[i,j] D[i, j+k] = min(D
Fact-3: Under Observations-4(a), it cannot continue.
Fact-4: Under Observation-5 (Case a) and Observation-6 (Case c), the edit distance in i th row remains same and in (i-1) th row is decreasing or stays same from the last match case.In this case, edit distance in (i-1) th row will be less than edit distance in ith row.
According to Theorem-1, the dependency for the computation of edit distance has been shifted to previous row only as depicted in Fig. 14 iterations.This phenomenon will be dominant when lengths of both the strings mismatch greatly, hence our algorithm will significantly perform better than diagonal based algorithm.
Another facet of the algorithm is its load balanced approach.Each iteration of the algorithm has same amount of computation so all the independent tasks in a single iteration can be distributed uniformly among the processing nodes.On the other hand, diagonal based parallel algorithm lacks this feature.
Our algorithm will also have an additive advantage in implementation.As the algorithm processes row by row, it can always exploit spatial locality in underlying memory system.The diagonal based algorithm requires to access cells from different rows and columns in its each iteration and that will always increase cache misses.

IMPLEMENTATION AND RESULTS
The algorithm is implemented for shared memory environment using OpenMP in conjunction with C++.
OpenMP has emerged as a shared-memory standard and it is programming language tailored for a shared-memory multiprocessing so it is a natural fit compared to other API's.
The implementation is run on Intel Core-i3-2370M 2.40 GHZ having 2 cores and 4 logical processors and results are compared with sequential algorithm and diagonal parallel algorithm.The algorithm is utilizing CPU more than 90% so with increased computing power i.e. number of processors, this algorithm will perform even better.
Strings are generated randomly of equal sizes and results are averages of fifteen experiments.
Results of execution time, speedup, and efficiency are presented in Fig. 15(a-c).Considering strings of size m and n, the problem size is defined in terms of m+n.For the first scenario, m and n are equal.Execution time is calculated in milliseconds.Speedup is the measure of increase of performance of parallel algorithm compared to sequential algorithm.Efficiency is a measure of the fraction of time for which a processing element is usefully employed.It is defined as the ratio of speedup to the number of processing elements.From the results, it is evident that our algorithm outperforms the state-of-theart parallel approach to solve the edit distance problem.
Particularly, it has achieved almost linear speedup that is result of load balanced feature of our parallel algorithm.
The experiments were also performed for another setting when the length of both the strings is not equal and resulting in a rectangular edit distance table.In this setting, the experiments were performed for different proportion of m and n assuming αm = n.The á was varied from 2 to 9000.For increasing value of α, the performance of diagonal based approach becomes closer to the It is clear that our algorithm outperforms diagonal based approach in terms of execution time, speedup, and efficiency.
1 , b 2 ,….,b n 〉of size m and n respectively, over a finite X = 〈X 1 , X 2 ,….,X k 〉, the edit distance between A and B, represented by ED AB is the minimum number of weighted transformation-operations that are required to transform A into B.This work assumes that the weighted transformation-operations are insert, remove, and substitute and weight of each operation is 1.If A =〈Thursday〉 and B = 〈Tuesday〉then ED AB = 0 because no transformation-operation is required because both the strings are identical.If A =〈Thursday〉 and B = 〈Tuesday〉then ED AB = 2 because one remove (remove 'h') and one substitution (replace 'r'

Fact- 1 :
FIG. 10.CASES FOR OBSERVATION-6 Hence, for D[i, j+k], D[i-1, (j+k)-1]+1 or D[i-1, j+k ]+1 would serve.Based on above facts, for any valid permutation of Observations-1 (Case b and c), Observation-2 (Case b and c), Observation-3 (Case c), Observation-4 (Case a), Observation-5 (Case a), and/or Observation-6 (Case c), D[i, j+k] = min(D FIG. 15. RESULTS AND COMPARISON WHEN M AND N ARE EQUAL FIG. 16. RESULTS AND COMPARISON FORA = 5000 . It makes it possible to compute a complete row of D in parallel having the information of last match case available. row can be distributed among processing nodes.In this way, the computation of D can be completed in min(m,n) iterations because it is always possible to take max(m,n) as the rows of the table.Whereas, state-of-the-art parallel algorithm that is based on diagonal approach has max(m,n)