// common/diff.cc // This file is part of Anyterm; see http://anyterm.org/ // (C) 2005 Philip Endecott // This program is free software; you can redistribute it and/or modify // it under the terms of the GNU General Public License as published by // the Free Software Foundation; either version 2 of the License, or // any later version. // // This program is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU General Public License // along with this program; if not, write to the Free Software // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. #include "diff.hh" #include #include using namespace std; namespace DiffAlgo { // The algorithm used is the one described in // "An O(ND) Difference Algorithm and Its Variations" // by Eugene W Myers. // (Postscript can be found on Myers' web page.) // O(ND) refers to N = the size of the input (the sum of the two) // and D = the length of the resulting "edit script", i.e. the // number of differences. The paper notes that the expected // performance is O(N+D^2), with O(ND) being a pathological case. // This is the complexity for the first phase of the computation in // which the "edit script size" is found. There is a second phase // during which the actual "edit script" is determined. This runs // in O(N) time, but has space complexity O(D^2) (or worse O(ND) if // the implementation is naive). The paper presents an alternative // implementation for the second phase in section 4b which has O(N) // space requirements. This is NOT implemented here. It is likely // that for non-trivial applications, space complexity is likely to // be a concern. Comparing sequences of only a few thousand items // with signficant differences, i.e. an "edit script length" of a // few thousand will consume megabytes, yet execution time is only // seconds. (On the other hand, when the inputs are similar, space // performance is good.) // Myers and others have proposed improvements to this algorithm, // including one in the following paper: // "An O(NP) Sequence Comparison Algorithm" // Sun Wu, Udi Manber, Gene Myers // (Postscript ditto). // In this case, P is the number of deletions in the "edit script", // which is less than the size of the script D. This has NOT been // implemented here. // Here is a quick overview of the algorithm: // The two input strings are A and B. Imagine a grid with A // labelling the columns and B labelling the rows. There are // additional "zeroth" rows and columns. Say A = abc and B = baba: // a b c // . . . . // b . . * . // a . * . . // b . . * . // a . * . . // The points where the labels on the rows and columns match are // marked specially. // The aim is to find a path from the top-left to the bottom-right // of this grid in the following way: // - Rightward horizontal moves indicate taking an element from // sequence A. // - Downward vertical moves indicate taking an element from // sequence B. // - Down-Right diagonal moves indicate taking an element common to // A and B. // Horizontal and vertical moves are always allowed. Diagonal moves // are only allowed in order to reach the (specially marked) match // points. The aim is to find a path obeying these constraints that // has the fewest horizontal and vertical moves, and hence the most // diagonal moves. // In the example above, a possible solution is as follows: // a b c // + . . . // b | . * . // a .\* . . // b . .\*-. // a . * . | // (i.e.: vertically to take b, diagonally twice to take a and b, // horizontally to take c, and vertically to take a.) // The algorithm searches in a greedy fashion, that is, it is // breadth-first when it is doing badly, but then depth-first once // it is "on to a good thing" (i.e. a diagonal). It expands a // "frontier" across the grid from the top-left towards the bottom // right. Once the frontier hits the bottom-right, the problem is // solved. // Giving each horizontal and vertical move a cost of one and each // diagonal move a cost of zero, all points on the frontier during a // particular iteration have the same cost. The variable d is used // to refer to the costs (d is the number of differences). // Points on the grid can be referred to using (x,y) coordinates. // Row/column 0 are necessarily "empty" and do not correspond to // elements of A or B. Care is needed with off-by-one errors, since // the sequences A and B are indexed from 0. // Points can also be referred to using one or other of x and y (x // by convention) and the "diagonal index", k, defined by k=x-y. // The point about the diagonal index is that the frontier will // always expand in such a way that it cuts each diagonal exactly // once: // k // 2 // 1 / . // 0 \ . // -1 \ . // -2 /\/\/ . // / . // So we can record the position of the frontier by giving the // x-coordinate for each value of k. In the code, the vector V // records these values. // Thinking of "snakes and ladders", diagonals are named "snakes" // (though surely, since they lead toward the goal, they should be // ladders?). A "snake" is a (possibly empty or singleton) sequence // of diagonals. // Once the frontier has reached the target, a second phase of the // algorithm identifies the optimal path by studying saved copies of // the V vector from each step of the expansion. This is the // space-hungry step mentioned above. // To reduce the space-hungryness from O(ND) to O(D^2), something of // a hack is used. The first phase is run twice. In the first run, // nothing is stored (so memory use is moderate). At the end of // this run, the "edit script length" is known. In the second run, // the V vectors are stored but this knowledge is used to limit // their size. // The Differ class is template-parameterised by the sequence type // that it operates on. This is normally string, but if you want to // use a different type, you should just be able to create a Differ // object specifying a different type. vector should // work, as long as operator= is defined on 'something'. See the // end of the file for how Differ is used. template class Differ { private: const SEQ& A; // Input sequences const SEQ& B; const int N; // length of A const int M; // length of B const int max_D; bool store; // Output typename fragment_seq::Type& result; // Ideally V would be an array indexed from -(M+N) to (M+N) // inclusive, but we only have zero-indexed arrays. So we use a // zero-indexed array and apply an offset. const int V_size; const int V_offset; typedef vector V_impl_t; V_impl_t V_impl; int& V ( int k ) { return V_impl[V_offset+k]; } typedef vector stored_V_impls_t; stored_V_impls_t stored_V_impls; int stored_V ( int d, int k ) const { return (stored_V_impls[d])[V_offset+k]; } // This is filled in when solve() finishes. If all that is wanted // is to know the distance between the two inputs, there is no // need to call find_trace() at all; just read this using // get_edit_distance(). int edit_distance; // Append an item to the result, with a tag. // If the tag matches the tag of the current end of the result, it // is merged with it. void append_result ( fragment_tag tag, typename SEQ::value_type datum ) { if (!result.empty() && (result.back().first == tag)) { result.back().second.push_back(datum); } else { result.push_back(make_pair(tag,SEQ(1,datum))); } } // Follow any snake from (k,x) to its end, and return the x // coordinate at the end. int follow_snake ( int k, int x ) { int y = x - k; while ( (x>=0) && (x=0) && (y=0) && (x=0) && (y::Type& r, bool s = true, int md = -1): A(a), B(b), N(A.size()), M(B.size()), max_D((md==-1)?(M+N):(min(md,M+N))), store(s), result(r), V_size(max(2*(max_D)+1,2)), // 2 allows for d=0 V(1) special case V_offset(max_D), V_impl(V_size) {} class max_D_exceeded {}; // Exception thrown if solution has not // been found after max_D // frontier-expansion iterations. // Perform the first phase of the algorithm, expanding the // frontier. // This function is essentially what is described in Figure 2 of // Myers' paper. void solve ( void ) { // The normal operation is (H or V) then any diagonal then repeat. // But this is broken if the first diagonal starts from the origin (e.g. for equal strings). // The following is a hack that works around that: V(1) = 0; // But this requires that V is large enough for this extra element. // Loop for increasing values of D until target reached int D = -1; bool done=false; while (!done) { ++D; if (D>max_D) { throw max_D_exceeded(); } // Scan across the width of the frontier for ( int k = -D; k <= D; k += 2 ) { // Find a new x value for this point on the frontier. // Special cases for either end. // Otherwise, move horizontally or vertically from a neighbour. int x; if ( (k==-D) || ((k!=D) && (V(k-1)=N) && (y>=M) ) { done = true; // We could probably leave the inner loop at this point, // but I'm not certain it's safe, and it certainly makes // debugging harder when only some of the points have been // updated, so don't bother. //break; } } // Save a copy of V for use during the second phase. if (store) { stored_V_impls.push_back(V_impl); } } edit_distance = D; } void find_trace ( void ) { find_trace_r ( edit_distance, N-M ); } int get_edit_distance(void) { return edit_distance; } }; void make_trivial_solution ( const string& A, const string& B, string_fragment_seq& result ) { result.push_back(make_pair(from_a,A)); result.push_back(make_pair(from_b,B)); } void string_diff ( const string& A, const string& B, string_fragment_seq& result ) { // Consider time efficiency. Aim not to take more than this much // time (arbitary units). Return a sub-optimal solution if this // time is exceeded. const int max_time = 1000; // Consider space efficiency. Aim not to use more than this much // memory (arbitary units). Take more time or return a // sub-optimal solution if this much memory is exceeded. const int max_mem = 10000000; // Consider changing the above settings if "top" shows that the // apache frontend request-handling processes are using more // memory or CPU time than you would like. // Reducing them means that Anyterm will give up looking for an // edit script and just send the complete new screen sooner. So // making them too low will use more network bandwidth. On the // other hand, for a fast local network, you might get a faster // response with a lower max_time setting. try { int sz = A.size() + B.size(); // If input is small, i.e. N^2 is acceptable, we don't worry about // space complexity. (This will take O(ND) space, but D could // equal N.) if ((sz*sz) d1(A,B,result,true,max_time); d1.solve(); d1.find_trace(); return; } // If input is larger, do a first pass to find the edit distance: Differ d2(A,B,result,false,max_time); d2.solve(); // We could now solve this with space complexity O(ND), if that // were acceptable: if (sz*d2.get_edit_distance() d3(A,B,result,true,d2.get_edit_distance()); d3.solve(); d3.find_trace(); return; } // If even O(ND) is not acceptable, we give up and return a result // indicating no common subset: make_trivial_solution(A,B,result); } catch (Differ::max_D_exceeded) { make_trivial_solution(A,B,result); } } void make_trivial_solution ( const ucs4_string& A, const ucs4_string& B, ucs4_string_fragment_seq& result ) { result.push_back(make_pair(from_a,A)); result.push_back(make_pair(from_b,B)); } void ucs4_string_diff ( const ucs4_string& A, const ucs4_string& B, ucs4_string_fragment_seq& result ) { // Consider time efficiency. Aim not to take more than this much // time (arbitary units). Return a sub-optimal solution if this // time is exceeded. const int max_time = 300; // Consider space efficiency. Aim not to use more than this much // memory (arbitary units). Take more time or return a // sub-optimal solution if this much memory is exceeded. const int max_mem = 10000000; // Consider changing the above settings if "top" shows that the // apache frontend request-handling processes are using more // memory or CPU time than you would like. // Reducing them means that Anyterm will give up looking for an // edit script and just send the complete new screen sooner. So // making them too low will use more network bandwidth. On the // other hand, for a fast local network, you might get a faster // response with a lower max_time setting. try { int sz = A.size() + B.size(); // If input is small, i.e. N^2 is acceptable, we don't worry about // space complexity. (This will take O(ND) space, but D could // equal N.) if ((sz*sz) d1(A,B,result,true,max_time); d1.solve(); d1.find_trace(); return; } // If input is larger, do a first pass to find the edit distance: Differ d2(A,B,result,false,max_time); d2.solve(); // We could now solve this with space complexity O(ND), if that // were acceptable: if (sz*d2.get_edit_distance() d3(A,B,result,true,d2.get_edit_distance()); d3.solve(); d3.find_trace(); return; } // If even O(ND) is not acceptable, we give up and return a result // indicating no common subset: make_trivial_solution(A,B,result); } catch (Differ::max_D_exceeded) { make_trivial_solution(A,B,result); } } };