Searching

Searching Techniques

Developed by Hardeep Singh
Copyright	© Hardeep Singh, 2000-2009
EMail	[email protected]
Website	SeeingWithC.org
All rights reserved.
The code may not be used commercially without permission.
The code does not come with any warranties, explicit or implied.
The code cannot be distributed without this header.

Problem: Given a list of distinct numbers (LIST), and another number (A), find the index of A in LIST if it is present or indicate that it is not present.

Although we have narrated the problem for numbers, it can actually refer to any datatype that can be compared for equality and arranged in order, such as strings.

Note: In the text below, lg(n) means log of n to base 2.

Solution #1: The naive approach

The simplest way, of course, is to compare the given number to each of the numbers in the list. If a number in the list matches the given number A, we can return the index of that number. If we reach the end of the list, we can indicate that A does not exist in LIST. Here is an implementation of this simple algorithm:

#include <stdio.h>

#define MAXELT 50

int main()
{
        int i=-1,n=0,list[MAXELT],a;
        char t[10];
        int linear_search(int list[],int n,int a);

        do {                                    //read the list
                if (i!=-1)
                        list[i++]=n;
                else
                        i++;
                printf("\nEnter the numbers <End by #>");
                fgets(t,sizeof t,stdin);
                if (sscanf(t,"%d",&n)<1)
                        break;
        } while (1);

        printf("\nEnter the number to be searched for >");
                                                //read the number
        scanf("%d",&a);

        int pos=linear_search(list,i,a);        //do linear search

        if (pos==-1)                            //print the result
                printf("\nThe number does not exist in the list.\n");
        else
                printf("\nThe number exists at location %d.\n",pos+1);
        return(0);
}

int linear_search(int list[],int n,int a)
{
        int i;
        for (i=0;i<n;i++)                       //step through the list
                if (list[i]==a)                 //if found, return
                        return i;
        return -1;                              //else return not found
}

//Note that we did not even need to store the numbers in a list in this
//case. We could have kept comparing as we received. However, if multiple
//searches are to be done on a single list, this is not possible. In this
//case, we used this method to ensure conformity with other methods to be
//shown next.

Since this algorithm compares the given number to all N elements of the list, it is an O(n) algorithm. Due to this reason, this algorithm is known as Linear Search.

Better search algorithms do exist and we will come to that, however this one is quite flexible: it doesn't require any restriction on the order in which numbers appear in the input list. So it can be used even if the list is not sorted or if the medium on which the list is stored requires sequential access (eg. Tapes).

Solution #2: Binary Search

This algorithm goes as follows (it requires a sorted list in input, unlike the previous algorithm):

Find the middle element of the list.
Compare the middle element with A (the given element).
If both are equal, return the index of the middle element.
If they are not equal, then
1. if A is smaller than the middle element, then we must search for A in the upper half of the list (remember: the list is sorted)
2. otherwise, we must search in the lower half.
Now in the smaller list, find the middle element and repeat the process.

An implementation of the algorithm is given below:

(Here, we show only the changed function binary_search. The rest of the program is the same as in linear search, just change linear_search function with binary_search and change all references to linear_search.)

int binary_search(int list[],int n,int a)
{
        int first=0,last=n-1,mid;

        while (first<=last) {                   //iterate while first<=last
                mid=(first+last)/2;             //calculate mid=trunc((f+l)/2)
                if (list[mid]==a)               //found
                        return mid;
                else if (a<list[mid])           //not found - look in the
                        last=mid-1;             //upper half of list
                else
                        first=mid+1;            //look in lower half
        }
        return -1;                              //return "not found"
}

Some people also implement binary search in the following recursive way:

int binary_search_dash(int list[],int first,int last,int a)
{
        int mid;

        if (first<=last) {
                mid=(first+last)/2;
                if (list[mid]==a)
                        return mid;
                else if (a<list[mid])
                        return binary_search_dash(list,first,mid-1,a);
                else
                        return binary_search_dash(list,mid+1,last,a);
        }
        return -1;
}

Although this is a more direct implementation of the above description, it uses more stack space compared to the first implementation, and is much slower on most systems. Also, this form of recursion is called 'Tail Recursion', which is the worst form of recursion. Recursion is a powerful tool which must be used with care. Recursion is advised only in certain cases, where it does not impose unnecessary action, as in Quicksort.

Binary search is O(lg(n)) as it halves the list size in each step. It is a large improvement over linear search; for a list with 10 million entries, while linear search will need 10 million key comparisons, binary search will need just about 24. Although binary search is already very good, at times it can be slightly improved, lets see how.

Solution #3: Fibonacci Search

Who does not know about Fibonacci numbers? They crop up in so many diverse problems that can't even be listed! From estimation of the number of rabbits in successive generations, to the number of leaves on branches, fibonacci numbers have far-reaching uses. In future Seeing With C might feature a write-up on different ways of generating them, here we will see ways of using them! (Generation of Fibonacci numbers is another misuse of recursion.)

The fibonacci series has 1 as the first two terms and each successive term is the sum of two previous terms. So, it goes 1,1,2,3,5,8,13,21,44,...etc. Each number appearing in the fibonacci series is called fibonacci number.

Fibonacci search changes the binary search algorithm slightly: instead of halving the index for a search, a fibonacci number is subtracted from it. The fibonacci number to be subtracted decreases as the size of the list decreases.

Have a look at its implementation:

int fibonacci_search(int list[],int n,int a)
{
        int f1,f2,mid,first,index;

        f1=1; f2=0; mid=2;                      //initialise the first two fibonacci
                                                //numbers. F1 will be the main
        while (f1<n) {                          //set F1 to a number >= n
                f1=f1+f2;
                f2=f1-f2;
                mid++;
        }
        f2=f1-f2;                               //set F1 to the largest number <=n
        f1=f1-f2;
        mid--;
        first=0;

        while (mid>0) {                         //loop
               index=first+f1;
               printf("\nfirst:%d,index:%d,f1:%d,list:%d,mid:%d,a:%d",first,index,f1,list[index],mid,a);
               if (index>=n || a<list[index]) {
                      mid--;                    //if the number is bigger, move back
                      f2=f1-f2;                 //to a smaller fibonacci number
                      f1=f1-f2;
               }
               else if (a==list[index])         //found: return the index
                      return(index);
               else {
                      first=index;              //if the number is smaller, move to the
                      mid=mid-2;                //second part of the list and
                      f1=f1-f2;                 //reduce F1 back two F-numbers
                      f2=f2-f1;
               }
        }

        return(-1);                             //bad luck: not found
}

Whew, this one has the longest code so far. It is also O(lgn) algorithm.

In order to compare the three algorithms shows, we did a small test, on six lists with sizes varying from 15 to 1000. On the same list, we performed four searches each - a mix of existing and non existing numbers. The first list was not sorted - hence the result from binary and fibonacci searches were expected to be wrong. The number of comparisons carried out by each algorithm was tracked.
The following graphs show the comparison - the first one the left shows the result from a single individual search, and the second cumulative from all searches:

Its obvious that Binary search is the best of all (for sorted data, of course) - however the margin with Fibonacci search isn't huge.
So what is Fibonacci search useful for? One of the key advantages with Fibonacci search over Binary search is that comparison dispersion is low, so for example if Binary and Fibonacci searches just compared element 100, and Binary search now compares element 200, Fibonacci search will now compare element 150. This means that for mediums that are accessed sequentially, especially tape drives, Fibonacci search makes more sense.

ADDENDUM

In this part, I will express the linear and the binary search algorithms in 80386 assembly language. This, is towards the following goals:

For people who don't understand assembly/machine languages, its instructive to get some idea of what happens during the compilation of the code that they write. Especially notice the fact that what look like two distinct key comparisons in the main loop of binary search can be achieved with just one CoMPare instruction at the assembly level, therefore it is always counted as one.
For those who know a little assembly but work mainly in C/C++, this is one more example to work through.
For those who already know assembly, I want to provide them with ready made timesaver routines. Even while writing C/C++ software, these assembly routines can be used to speed things up, since all C/C++ compilers allow facilities to write inline assembly code.

Linear Search

Precondition: CX holds the number of entries, SI has been initialised to first array element, D=0.

AGAIN: LODSD            ; The instruction reads the next number from memory
                        ; into a CPU register(EAX)
       CMP EAX,EBX      ; Compare registers EAX and EBX. EBX already holds
                        ; the value to be found
       LOOPNE AGAIN     ; Loop if EAX and EBX Not Equal and all entries have
                        ; not been exhausted - to statement labeled AGAIN
       CMP EAX,EBX      ; Out of the loop - Two things are possible: EAX and
                        ; EBX were found equal, or all list entries exhausted
                        ; So, compare to make sure
       JNE LABEL1       ; Jump if Not Equal to LABEL1
       ...              ; Code to announce the fact that the entry has been
                        ; found at location marked by CX and exit
LABEL1:...              ; Code to announce that the number has not been
                        ; found and exit.

Binary Search

AGAIN: MOV AL,FIRST     ; Register AL <- Memory location containing FIRST
       CMP AL,LAST      ; CoMPare AL (FIRST) with LAST
       JA NOTFOUND      ; Jump if Above - to NOTFOUND since FIRST>LAST
       ADD AL,LAST      ; ADD LAST to AL. Al now contains FIRST+LAST
       SHR AL,1         ; SHift Right AL once, this will divide it by 2
       MOV MID,AL       ; Send current value of AL to memory location MID
       MOVZX AX,AL      ; AX <- AL. (This is called MOV Zero eXpanded)
       SHL AX,2         ; SH Left AX by 2. This multiplies it by 4. This is
                          done because each array value is of $ bytes.
       ADD AX,OFFSET ARRAY
                        ; Now AX contains the address of MID element
       MOV SI,AX        ; SI <- AX
       LODSD            ; Read value from memory location SI into EAX
       CMP EAX,N        ; CoMPare EAX with N, the number to be found
       JE FOUND         ; Jump if Equal to FOUND
       JL LESS          ; Jump if Less to LESS (to lower list). Note that
                        ; no comparison is done here, just a jump. Just one
                        ; comparison was done in CMP statement above. The flags
                        ; hold the result which is checked by the jump
                        ; instruction. Jump is equivalent to a simple goto
                        ; in high level language; it does not do any inherent
                        ; comparison
       MOV AL,MID       ; AL <- MID
       DEC AL           ; AL <- AL-1
       MOV LAST,AL      ; LAST <- AL. These three instructions accomplish
                        ; LAST=MID-1.
       JMP AGAIN        ; Iterate
LESS:  MOV AL,MID       ; AL <- MID
       INC AL           ; AL <- AL+1
       MOV FIRST,AL     ; FIRST <- AL. These three instructions accomplish
                        ; FIRST=MID+1
       JMP AGAIN        ; JuMP

FOUND: ...              ; Code to indicate "Found number" at location contained
                          by SI and exit
NOTFOUND: ...           ; Code to indicate "Not found"

That's all folks!