pandorafms/extras/anytermd/src/diff.cc

545 lines
17 KiB
C++

// common/diff.cc
// This file is part of Anyterm; see http://anyterm.org/
// (C) 2005 Philip Endecott
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#include "diff.hh"
#include <string>
#include <vector>
using namespace std;
namespace DiffAlgo {
// The algorithm used is the one described in
// "An O(ND) Difference Algorithm and Its Variations"
// by Eugene W Myers.
// (Postscript can be found on Myers' web page.)
// O(ND) refers to N = the size of the input (the sum of the two)
// and D = the length of the resulting "edit script", i.e. the
// number of differences. The paper notes that the expected
// performance is O(N+D^2), with O(ND) being a pathological case.
// This is the complexity for the first phase of the computation in
// which the "edit script size" is found. There is a second phase
// during which the actual "edit script" is determined. This runs
// in O(N) time, but has space complexity O(D^2) (or worse O(ND) if
// the implementation is naive). The paper presents an alternative
// implementation for the second phase in section 4b which has O(N)
// space requirements. This is NOT implemented here. It is likely
// that for non-trivial applications, space complexity is likely to
// be a concern. Comparing sequences of only a few thousand items
// with signficant differences, i.e. an "edit script length" of a
// few thousand will consume megabytes, yet execution time is only
// seconds. (On the other hand, when the inputs are similar, space
// performance is good.)
// Myers and others have proposed improvements to this algorithm,
// including one in the following paper:
// "An O(NP) Sequence Comparison Algorithm"
// Sun Wu, Udi Manber, Gene Myers
// (Postscript ditto).
// In this case, P is the number of deletions in the "edit script",
// which is less than the size of the script D. This has NOT been
// implemented here.
// Here is a quick overview of the algorithm:
// The two input strings are A and B. Imagine a grid with A
// labelling the columns and B labelling the rows. There are
// additional "zeroth" rows and columns. Say A = abc and B = baba:
// a b c
// . . . .
// b . . * .
// a . * . .
// b . . * .
// a . * . .
// The points where the labels on the rows and columns match are
// marked specially.
// The aim is to find a path from the top-left to the bottom-right
// of this grid in the following way:
// - Rightward horizontal moves indicate taking an element from
// sequence A.
// - Downward vertical moves indicate taking an element from
// sequence B.
// - Down-Right diagonal moves indicate taking an element common to
// A and B.
// Horizontal and vertical moves are always allowed. Diagonal moves
// are only allowed in order to reach the (specially marked) match
// points. The aim is to find a path obeying these constraints that
// has the fewest horizontal and vertical moves, and hence the most
// diagonal moves.
// In the example above, a possible solution is as follows:
// a b c
// + . . .
// b | . * .
// a .\* . .
// b . .\*-.
// a . * . |
// (i.e.: vertically to take b, diagonally twice to take a and b,
// horizontally to take c, and vertically to take a.)
// The algorithm searches in a greedy fashion, that is, it is
// breadth-first when it is doing badly, but then depth-first once
// it is "on to a good thing" (i.e. a diagonal). It expands a
// "frontier" across the grid from the top-left towards the bottom
// right. Once the frontier hits the bottom-right, the problem is
// solved.
// Giving each horizontal and vertical move a cost of one and each
// diagonal move a cost of zero, all points on the frontier during a
// particular iteration have the same cost. The variable d is used
// to refer to the costs (d is the number of differences).
// Points on the grid can be referred to using (x,y) coordinates.
// Row/column 0 are necessarily "empty" and do not correspond to
// elements of A or B. Care is needed with off-by-one errors, since
// the sequences A and B are indexed from 0.
// Points can also be referred to using one or other of x and y (x
// by convention) and the "diagonal index", k, defined by k=x-y.
// The point about the diagonal index is that the frontier will
// always expand in such a way that it cuts each diagonal exactly
// once:
// k
// 2
// 1 / .
// 0 \ .
// -1 \ .
// -2 /\/\/ .
// / .
// So we can record the position of the frontier by giving the
// x-coordinate for each value of k. In the code, the vector V
// records these values.
// Thinking of "snakes and ladders", diagonals are named "snakes"
// (though surely, since they lead toward the goal, they should be
// ladders?). A "snake" is a (possibly empty or singleton) sequence
// of diagonals.
// Once the frontier has reached the target, a second phase of the
// algorithm identifies the optimal path by studying saved copies of
// the V vector from each step of the expansion. This is the
// space-hungry step mentioned above.
// To reduce the space-hungryness from O(ND) to O(D^2), something of
// a hack is used. The first phase is run twice. In the first run,
// nothing is stored (so memory use is moderate). At the end of
// this run, the "edit script length" is known. In the second run,
// the V vectors are stored but this knowledge is used to limit
// their size.
// The Differ class is template-parameterised by the sequence type
// that it operates on. This is normally string, but if you want to
// use a different type, you should just be able to create a Differ
// object specifying a different type. vector<something> should
// work, as long as operator= is defined on 'something'. See the
// end of the file for how Differ is used.
template <typename SEQ>
class Differ {
private:
const SEQ& A; // Input sequences
const SEQ& B;
const int N; // length of A
const int M; // length of B
const int max_D;
bool store;
// Output
typename fragment_seq<SEQ>::Type& result;
// Ideally V would be an array indexed from -(M+N) to (M+N)
// inclusive, but we only have zero-indexed arrays. So we use a
// zero-indexed array and apply an offset.
const int V_size;
const int V_offset;
typedef vector<int> V_impl_t;
V_impl_t V_impl;
int& V ( int k ) { return V_impl[V_offset+k]; }
typedef vector<V_impl_t> stored_V_impls_t;
stored_V_impls_t stored_V_impls;
int stored_V ( int d, int k ) const { return (stored_V_impls[d])[V_offset+k]; }
// This is filled in when solve() finishes. If all that is wanted
// is to know the distance between the two inputs, there is no
// need to call find_trace() at all; just read this using
// get_edit_distance().
int edit_distance;
// Append an item to the result, with a tag.
// If the tag matches the tag of the current end of the result, it
// is merged with it.
void append_result ( fragment_tag tag, typename SEQ::value_type datum )
{
if (!result.empty() && (result.back().first == tag)) {
result.back().second.push_back(datum);
} else {
result.push_back(make_pair(tag,SEQ(1,datum)));
}
}
// Follow any snake from (k,x) to its end, and return the x
// coordinate at the end.
int follow_snake ( int k, int x )
{
int y = x - k;
while ( (x>=0) && (x<N) && (y>=0) && (y<M) && (A[x]==B[y]) ) {
++x;
++y;
}
return x;
}
// Follow any snake from (k,x) to its end, recording the data in
// the result with tag "common".
void get_snake ( int k, int x )
{
int y = x - k;
while ( (x>=0) && (x<N) && (y>=0) && (y<M) && (A[x]==B[y]) ) {
append_result(common,A[x]);
++x;
++y;
}
}
// Find and record a trace from (0,0) to the point on diagonal k
// with cost d. (Recursive)
void find_trace_r ( int d, int k )
{
if (d==0) {
get_snake(0,0);
return;
} else {
// Look up x coordinate in saved V for cost=d in diagonal k.
int x = stored_V(d,k);
// How did we get to (k,x)?
// Either:
// - A vertical move from (k+1,something), possibly followed by a
// "snake" along diagonal k.
// - A horizontal move from (k-1,something), possibly
// followed by a "snake" along diagonal k.
// To find out which, we look up V(k) for d-1 in diagonals k+1
// and k-1, and see if snake-slides from either of those
// points would get to to (k,x). (One or other must do, so we
// now only do one check.)
// Notation: R = point before H or V move;
// S = point after H or V move;
// T = point after subsequent snake.
// Start by checking for a vertical move
int Rx = stored_V(d-1,k+1);
int Sx = Rx;
int Tx = follow_snake(k,Sx);
if (Tx == x) {
// OK, did a vertical move. Find how we got to that point.
find_trace_r ( d-1, k+1 );
// Find the data at the end of that move
int Ry = Rx - (k+1);
int Sy = Ry +1;
typename SEQ::value_type d = B[Sy-1];
// Record vertical move plus data.
append_result(from_b,d);
// Record any snake that followed it.
get_snake(k,Sx);
} else {
// It must have been a horizontal move.
int Rx = stored_V(d-1,k-1);
int Sx = Rx + 1;
// Find how we got to that point.
find_trace_r ( d-1, k-1 );
// Find the data at the end of that move
typename SEQ::value_type d = A[Sx-1];
// Record horizontal move plus data.
append_result(from_a,d);
// Record any snake that followed it.
get_snake(k,Sx);
}
}
}
public:
// Constructor, takes input sequences and reference to output.
// Optionally takes store flag and max_D.
Differ ( const SEQ& a, const SEQ& b, typename fragment_seq<SEQ>::Type& r,
bool s = true, int md = -1):
A(a),
B(b),
N(A.size()),
M(B.size()),
max_D((md==-1)?(M+N):(min(md,M+N))),
store(s),
result(r),
V_size(max(2*(max_D)+1,2)), // 2 allows for d=0 V(1) special case
V_offset(max_D),
V_impl(V_size)
{}
class max_D_exceeded {}; // Exception thrown if solution has not
// been found after max_D
// frontier-expansion iterations.
// Perform the first phase of the algorithm, expanding the
// frontier.
// This function is essentially what is described in Figure 2 of
// Myers' paper.
void solve ( void )
{
// The normal operation is (H or V) then any diagonal then repeat.
// But this is broken if the first diagonal starts from the origin (e.g. for equal strings).
// The following is a hack that works around that:
V(1) = 0;
// But this requires that V is large enough for this extra element.
// Loop for increasing values of D until target reached
int D = -1;
bool done=false;
while (!done) {
++D;
if (D>max_D) {
throw max_D_exceeded();
}
// Scan across the width of the frontier
for ( int k = -D; k <= D; k += 2 ) {
// Find a new x value for this point on the frontier.
// Special cases for either end.
// Otherwise, move horizontally or vertically from a neighbour.
int x;
if ( (k==-D) || ((k!=D) && (V(k-1)<V(k+1))) ) {
// vertical move
x = V(k+1);
} else {
// horizontal move
x = V(k-1)+1;
}
// Having made the horizontal or vertical move, follow any
// diagonal "snakes" from this point.
int y = x - k;
while ( (x<N) && (y<M) && (A[x]==B[y]) ) {
++x;
++y;
}
// Store the new x value for this point on the frontier.
V(k) = x;
// Test for reaching target
if ( (x>=N) && (y>=M) ) {
done = true;
// We could probably leave the inner loop at this point,
// but I'm not certain it's safe, and it certainly makes
// debugging harder when only some of the points have been
// updated, so don't bother.
//break;
}
}
// Save a copy of V for use during the second phase.
if (store) {
stored_V_impls.push_back(V_impl);
}
}
edit_distance = D;
}
void find_trace ( void )
{
find_trace_r ( edit_distance, N-M );
}
int get_edit_distance(void) { return edit_distance; }
};
void make_trivial_solution ( const string& A, const string& B, string_fragment_seq& result )
{
result.push_back(make_pair(from_a,A));
result.push_back(make_pair(from_b,B));
}
void string_diff ( const string& A, const string& B, string_fragment_seq& result )
{
// Consider time efficiency. Aim not to take more than this much
// time (arbitary units). Return a sub-optimal solution if this
// time is exceeded.
const int max_time = 1000;
// Consider space efficiency. Aim not to use more than this much
// memory (arbitary units). Take more time or return a
// sub-optimal solution if this much memory is exceeded.
const int max_mem = 10000000;
// Consider changing the above settings if "top" shows that the
// apache frontend request-handling processes are using more
// memory or CPU time than you would like.
// Reducing them means that Anyterm will give up looking for an
// edit script and just send the complete new screen sooner. So
// making them too low will use more network bandwidth. On the
// other hand, for a fast local network, you might get a faster
// response with a lower max_time setting.
try {
int sz = A.size() + B.size();
// If input is small, i.e. N^2 is acceptable, we don't worry about
// space complexity. (This will take O(ND) space, but D could
// equal N.)
if ((sz*sz)<max_mem) {
Differ<string> d1(A,B,result,true,max_time);
d1.solve();
d1.find_trace();
return;
}
// If input is larger, do a first pass to find the edit distance:
Differ<string> d2(A,B,result,false,max_time);
d2.solve();
// We could now solve this with space complexity O(ND), if that
// were acceptable:
if (sz*d2.get_edit_distance()<max_mem) {
Differ<string> d3(A,B,result,true,d2.get_edit_distance());
d3.solve();
d3.find_trace();
return;
}
// If even O(ND) is not acceptable, we give up and return a result
// indicating no common subset:
make_trivial_solution(A,B,result);
}
catch (Differ<string>::max_D_exceeded) {
make_trivial_solution(A,B,result);
}
}
void make_trivial_solution ( const ucs4_string& A, const ucs4_string& B, ucs4_string_fragment_seq& result )
{
result.push_back(make_pair(from_a,A));
result.push_back(make_pair(from_b,B));
}
void ucs4_string_diff ( const ucs4_string& A, const ucs4_string& B, ucs4_string_fragment_seq& result )
{
// Consider time efficiency. Aim not to take more than this much
// time (arbitary units). Return a sub-optimal solution if this
// time is exceeded.
const int max_time = 300;
// Consider space efficiency. Aim not to use more than this much
// memory (arbitary units). Take more time or return a
// sub-optimal solution if this much memory is exceeded.
const int max_mem = 10000000;
// Consider changing the above settings if "top" shows that the
// apache frontend request-handling processes are using more
// memory or CPU time than you would like.
// Reducing them means that Anyterm will give up looking for an
// edit script and just send the complete new screen sooner. So
// making them too low will use more network bandwidth. On the
// other hand, for a fast local network, you might get a faster
// response with a lower max_time setting.
try {
int sz = A.size() + B.size();
// If input is small, i.e. N^2 is acceptable, we don't worry about
// space complexity. (This will take O(ND) space, but D could
// equal N.)
if ((sz*sz)<max_mem) {
Differ<ucs4_string> d1(A,B,result,true,max_time);
d1.solve();
d1.find_trace();
return;
}
// If input is larger, do a first pass to find the edit distance:
Differ<ucs4_string> d2(A,B,result,false,max_time);
d2.solve();
// We could now solve this with space complexity O(ND), if that
// were acceptable:
if (sz*d2.get_edit_distance()<max_mem) {
Differ<ucs4_string> d3(A,B,result,true,d2.get_edit_distance());
d3.solve();
d3.find_trace();
return;
}
// If even O(ND) is not acceptable, we give up and return a result
// indicating no common subset:
make_trivial_solution(A,B,result);
}
catch (Differ<ucs4_string>::max_D_exceeded) {
make_trivial_solution(A,B,result);
}
}
};