Floating Point Addition Question

Forum

Forum
Beginners
Floating Point Addition Question

Floating Point Addition Question

The following code is actually C , but I am hoping someone can see where my error is ? I am trying to add two floating point numbers by shifting bits and adding the bit to form the answer thus the problem ......I do not get the correct addition result I think the problem lies in my final statement ? Am I not passing the correct value back into the statement? Thank you for the help.

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <assert.h>

int isNegative (float f)
{
    unsigned int* iptr = (unsigned int*)&f;
    return ( ((*iptr) & 0x80000000) ? 1:0);
}

unsigned char getExponent (float f)  // Purpose is to return the 8 bit exponent of the floating point value
{
         unsigned int* iptr = (unsigned int*)&f;
         return (((*iptr >> 23) & 0xff)) ;
         
}

unsigned int getMantissa (float f) // Purpose to return the 24 bit mantissa of the floating point value.
{
         unsigned int* iptr = (unsigned int*)&f;
         if( *iptr == 0 ) return 0;
         return ((*iptr & 0xFFFFFF) | 0x800000 );
        
}

float sum (float left, float right) // Purpose to return the sum of the floating point values left & right .
{
      // Will obtain the exponents of the left & right and will obtain the mantissa.
      unsigned int littleMan;
      unsigned int bigMan;
      unsigned char littleExp;
      unsigned char bigExp;
      unsigned char lexp = getExponent(left);
      unsigned char rexp = getExponent(right);
      
     
   
if (lexp > rexp)
{
         bigExp = lexp;
         bigMan = getMantissa(left);
         littleExp = rexp;
         littleMan = getMantissa(right);
}
else
{
    bigExp = rexp;
    bigMan = getMantissa(right);
    littleExp = lexp;
    littleMan = getMantissa(left);
}


printf("little: %x %x\n", littleExp, littleMan);
printf("big:    %x %x\n", bigExp, bigMan);

//Purpose is to extract difference in exponet values to determin how much to shift the mantissa    
int expSub = (bigExp - littleExp);
printf("Subtraction of the Exp: %x\n", expSub);

// Purpose is shift the mantissas to allign binary points.
int shifta = (littleMan >> expSub);
printf("The value of the Exp after the shift:  %x\n", shifta);

// Purpose is to add mantissas
int addMantissa = (bigMan + shifta);
printf("The value of the two Mantissas added:  %x\n", addMantissa);  


// Purpose is if the mantissa is too big , extending into the 24 bit , shift over to to fit mantissa and update bigExp to compensate for the shift and strip the hidden bit. 
if (addMantissa > 0xFFFFF)//0x7fffff)
{
    addMantissa = addMantissa >> 1;
    ++bigExp;
}


// Purpose is to reassemble the floating point number 

unsigned int result =  (expSub + 127)<<23 | (addMantissa & 0xfffff) ;
printf ("This is the addition:  %x\n", result);

float fresult = *(float*)&result;
return(fresult);

}

int main()
{
    const int SIZE = 256;
    char line[SIZE];
    
    while (1)
    {
          float f1;
          float f2;
          float left = f1;
          float right = f2;
          
          printf("Please enter the first float ( \"q\" to quit):");
          fgets(line,SIZE,stdin);
          
          if (toupper(line[0]) =='Q')
          break;
          
          f1 = atof(line);
          
          printf("Please enter the second float ( \"q\" to quit):");
          fgets(line,SIZE,stdin);
          
          if (toupper(line[0]) == 'Q')
          break;
          
          f2 = atof(line);
          
          if (isNegative(f1) || isNegative(f2))
          printf ("One of thse is negative, but %g + %g == %g\n", f1,f2,sum(f1,f2));
          else
          printf("%g + %g == %g\n", f1,f2,sum(f1,f2));
}

return(EXIT_SUCCESS);
}

jsmith (5804)

I looked at your code, and I think there are a couple of errors, but in correcting them I still was not able to get the correct output.

You might find this link useful (this is what I was using to verify each step):

http://babbage.cs.qc.edu/IEEE-754/

exception (323)

You can get bit access via a union (they *do* have unions in C, don't they?). Have a look here:
http://www.cplusplus.com/forum/articles/3827/
That should be easier ;-) And without having looked at your code, are you sure you are on a little/big endian machine (whatever you assume)?

jsmith (5804)

Ok, I wrote my own version of this and got it working.

What I found with your code is the following:

1) getExponent is returning the sign bit also; you need to mask with 0x7F instead
2) Line 72 if statement I think needs to be a while loop instead, and the
terminating condition is that bits 24-31 are all zero.
3) You need another loop after line 72. Basically you need to ensure that the
mantissa has the 24th bit set to 1. Your if statement handles the overflow case,
but you also need to handle the underflow case.
4) Your "result =" expression needs to mask off the sign bit, and also the 0xFFFFF
needs to be 0x7FFFFF.

Topic archived. No new replies allowed.

C++

Forum

Floating Point Addition Question