Monday, April 12, 2010

Boyer-Moore Horspool in Delphi 2010

Today I found myself in need of a string search algorithm that must meet the following requirements:
1. fast as possible
2. work on ANSI and Unicode version of Delphi
3. must be able to search both case sensitive and insensitive
4. must be able to search from a given position in a string
5. it's code must be as easier to read as possible
Of course, the first thing I did(like anyone would do) was to search the Internet using Google Almighty for a Boyer-Moore implementation(preferably in Delphi) that would meet at least half of the requirements listed above, guess what? I did not found one that would NOT give a error.
Next step: „back to the drawing board¯ -- „If you want something done right, you must do it yourself¯ or something like that.
Here's with what I came up:
function FindString(const Value, Pattern: string;
  const CaseSensitive: Boolean = True;
  const StartPos: Integer = 1): integer;
var
  Index: Integer;
  jIndex: Integer;
  kIndex: Integer;
  LLenPattern: Integer;
  LLenValue: Integer;
  LSkipTable: array[Char] of Integer;
  LFound: Boolean;
  LChar: Char;

    function __SameChar: Boolean;
    begin
      if CaseSensitive then
        Result := (Value[Index] = Pattern[jIndex])
      else
        Result := (CompareText(Value[Index], Pattern[jIndex]) = 0);
    end; // function __SameChar: Boolean;

begin
  LFound := False;
  Result := 0;
  LLenPattern := Length(Pattern);
  if LLenPattern = 0 then begin
    Result := 1;
    LFound := True;
  end; // if LLenPattern = 0 then begin
  for LChar := Low(Char) to High(Char) do
    LSkipTable[LChar] := LLenPattern;
  if CaseSensitive then begin
    for kIndex := 1 to LLenPattern -1 do
      LSkipTable[Pattern[kIndex]] := LLenPattern -kIndex;
  end else begin
    for kIndex := 1 to LLenPattern -1 do
      LSkipTable[Windows.CharLower(@Pattern[kIndex])^] := LLenPattern -kIndex;
  end; // if CaseSensitive then begin
  kIndex := LLenPattern + (StartPos -1);
  LLenValue := Length(Value);
  while (NOT LFound) and (kIndex <= LLenValue) do begin
    Index := kIndex;
    jIndex := LLenPattern;
    while (jIndex >= 1) do begin
      if __SameChar then begin
        jIndex := jIndex -1;
        Index := Index -1;
      end else
        jIndex := -1;
      if jIndex = 0 then begin
        Result := Index +1;
        LFound := True;
      end; // if jIndex = 0 then begin
      kIndex := kIndex + LSkipTable[Value[kIndex]];
    end; // while (jIndex >= 1) do begin
  end; // while (NOT LFound) and (kIndex <= LLenValue) do begin
end;
If you find any problems(or improvements) in the algorithm please let me know.

14 comments:

  1. Dorin,

    Would you consider extending the TurboHashedStringList to duplicate the TStringList.Objects property?

    ReplyDelete
  2. Hi SportsGuy,
    until now, no one asked for that and I had no need for that yet.
    I'm pretty tied up for the moment and time is pretty limited, but you could implement that very easily, because each entry in the list hash also a pointer that can point to a object therefore extending the TurboHashedStringList class would be simple as 1, 2, 3.
    Bottom line I will extend it as soon as I get some free time.

    ReplyDelete
  3. Dorin,

    Here is what I added. I used the TStringList code in Classes.pas as my model.

    First, the private setter and getter:

    procedure PutObject(Index: Integer; AObject: TObject);
    function GetObject(Index: Integer): TObject;

    Then, the property declaration itself:

    property Objects[Index: Integer]:TObject read GetObject write PutObject;

    Finally, the setter and getter implementation:

    function TurboHashedStringList.GetObject(Index: Integer): TObject;
    begin
    if (Index < 0) or (Index >= self.Count) then result:=nil;
    Result:=PStringRec(FList[Index]).ObjectRef;
    end;

    procedure TurboHashedStringList.PutObject(Index: Integer; AObject: TObject);
    begin
    if (Index < 0) or (Index >= self.Count) then exit;
    PStringRec(FList[Index]).ObjectRef:=AObject;
    end;

    Does that look about right?

    ReplyDelete
  4. Thank you for investing a few mins on this, I will update the class immediately and create a new post for it.
    I thing tho'
    PStringRec(FList[Index]).ObjectRef:=AObject;
    should be replaced with
    PStringRec(FList[Index])^.ObjectRef:=AObject;
    It's better to dereference the pointer even tho' the compiler is smart enough to do it for you.

    ReplyDelete
  5. UpCase should not be used since it works *only* for ASCII characters!

    ReplyDelete
  6. @Anonymous yes... maybe LoCase should be used since it's a Windows routine function LoCase(ch: PWideChar): PWideChar; but this is just an example.

    ReplyDelete
  7. I was just looking for a b-m delphi version that handled unicode

    This looked promising, but you do know that it doesn't work ?

    Test app, 1 button, 1 memo
    Button click code here


    procedure TForm1.btn1Click(Sender: TObject);
    const
    NumItems = 150000;
    var
    x, y, z: Integer;
    t1: Int64;
    FindStr, DataStr: string; s:string;
    begin
    DataStr:='';
    z:= (2* (NumItems div 3) );
    // build a string like
    // "0 asdshfty","1 jgyrkdsl" etc
    for x := 0 to NumItems - 2 do
    begin
    s:= '"'+IntToStr(x)+ ' ';
    for y := 0 to 7 do
    s:=s+char( Ord('A')+ random(25) );
    DataStr :=datastr+ s+'",';
    if x = z then
    FindStr :=s
    end;
    s:= '"'+IntToStr(NumItems-1)+ ' ';
    for y := 0 to 7 do
    s:=s+char( Ord('A')+ random(25) );
    DataStr :=datastr + s +'"';

    x:= pos(FindStr,DataStr);
    memo1.Lines.Add(IntToStr(x));

    y:= FindString( DataStr ,FindStr,1,true);
    memo1.Lines.Add(IntToStr(x));

    if x=y then
    memo1.lines.add('OK')
    else memo1.lines.add('Fail');
    end;

    ReplyDelete
  8. did you test it? This don't working (D2010), always ret -1

    ReplyDelete
  9. I've rewrote the search function and done few test cases, unfortunately I do not have enough time to test more cases, so If you find bugs let me know.

    ReplyDelete
  10. Hi, very interesting article, I'm wondering if that could be modified to search a portion of a .bmp image inside a bigger one. Is that doable ?

    Thanks in advance

    ReplyDelete
  11. Yes it is, but be more clear on what you wish to do, do you wish for example to search for a portion of a bitmap inside other bitmap with exact color intensity? if so you can do it with scanline method from TBitmap class.
    But if you wish to do the search more abstract then you should be more specific on your expected results, tolerance, etc. let me know with precise specifications.

    ReplyDelete
  12. Sorry for the delay, I had lost my bookmarks :\

    Yes indeed, I wanted to search for a portion of a bitmap inside another one with exact color matching.

    That would be great

    Regards

    Mike

    ReplyDelete
  13. No worries, searching inside bitmaps is a bit tricky, because a bitmap can be flipped, mirrored, etc. therefore please give more details of what you wish to achieve exactly and if it can have impact on a larger scale(meaning that more people can benefit from the search algorithm) I will spend some time on this.

    ReplyDelete
  14. If you have some time, maybe you can add wildcard support. Actually I still having found that one that works with >D2009.

    ReplyDelete

Blogroll(General programming and Delphi feeds)