Prevent overflow in Linear Regression algorithm

I am implementing a linear regression algorithm to generate market price predictions using some price values that the program is reading from a csv file. The vectors ordersX and ordersY are created in another class from a csv file containing cryptocurrency prices. B0 is the value of Y when X=0 and B1 is the regression Coefficient (this represents the change in the dependent variable based on the unit change in the independent variable). The code below creates a linear regression slope that is used to generate predictions about future prices. How do I edit the algorithm to limit the estimated B0 and B1 values so that overflow doesn't happen? The values B0 and B1 sometimes go to infinity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
  #include "LinearRegression.h"
#include <iostream>
#include <algorithm>
#include <vector>

/*sorts based on absolute min value or error*/
bool LinearRegression::custom_sort(double a, double b) 
{
    double ax = std::abs(a-0);
    double bx = std::abs(b-0);
    return ax < bx;
}

void LinearRegression::gradientDescent(std::vector<OrderBookEntry>& ordersX, std::vector<OrderBookEntry>& ordersY)
{
    /*Intialization Phase*/
    double err; // for calculating error on each stage
    double b0 = 0; // represents the value of y when x = 0
    double b1 = 0; // represents the change in y based on the unit change in x
    double alpha = 0.01; // learning rate
    std::vector<double>error; // array to store all error values

    /*Training Phase*/
    for (int i = 0; i < 200; i++) // loop 200 times since there are 50 values and I want 4 epochs
    {
        int index = i % 50; // for accessing index after every epoch
        double p = b0 + b1 * ordersX[index].price; // calculate prediction
        err = p - ordersY[index].price; // calculate error
        b0 = b0 - alpha * err;
        b1 = b1 - alpha * err * ordersX[index].price;
        error.push_back(err);
    }

    std::sort(error.begin(), error.end(), &LinearRegression::custom_sort); // sorting based on error values
    std::cout << "Final Values are: " << "B0=" << b0 << " " << "B1=" << b1 << " " << "error=" << error[0] << std::endl;
}


sample data: https://wtools.io/paste-code/b2S8

output:

Product: BTC/USDT
Final Values are: B0=-nan(ind) B1=-nan(ind) error=-5352

Product: DOGE/BTC
Final Values are: B0=2.85961e-07 B1=9.47116e-14 error=-2.63192e-08

Product: DOGE/USDT
Final Values are: B0=0.00144125 B1=2.39817e-06 error=-0.00022323

Product: ETH/BTC
Final Values are: B0=0.0189528 B1=0.000414906 error=-0.00296171

Product: ETH/USDT
Final Values are: B0=-nan(ind) B1=-nan(ind) error=-117.329
Last edited on
I don't think that's normally how you compute linear regression. The problem I see is that each set of values depends on the previous set, so any errors are cumulative.

If you post some sample data and a program that can be compiled and run, we can probably help you better.
At a quick glance, and I am rusty ... are you sure the data forms a linear cluster such that a good line can be found? I don't think the solution is to limit the variables, that will just break your output. I think you either have a data problem or an implementation problem, and I don't see the implementation problem if there is one.
That simply isn't linear regression.

Your slope (B1) should be (N.sum(xy) - sum(x).sum(y) ) / (N.sum(xx)-sum(x).sum(x))

There is no way that is going to run into trouble unless all the x values are the same as each other.


Where did you get your formulae from?



BTW Your data is very weird - it looks as if you reorder x and y values at several points.
Last edited on
I got the formulae from here: https://www.analyticsvidhya.com/blog/2020/04/machine-learning-using-c-linear-logistic-regression/

The data was given to us for the assignment.
try googling up on it without the machine learning part. That adds a level of complexity to it.
Or review just the math part, then try to code it from scratch. A lot of the examples use c++'s inner product function; if you can't use it you need to write one.
Last edited on
You can compare standard linear regression with your slightly odd version. I think I'll stick to the standard.

I've raised the coefficient alpha to 0.1 here, but it can blow up if you get that wrong.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#include <iostream>
#include <fstream>
#include <iomanip>
#include <vector>
#include <string>
using namespace std;

// Some data
istringstream str( "1  3\n"
                   "2  5\n"
                   "3  7\n"
                   "4  9\n"
                   "5 11\n" );


//======================================================================


struct Data{ double x, y; };


//======================================================================


vector<Data> getData( istream &in )
{
   vector<Data> result;
   for ( double x, y; in >> x >> y; ) result.push_back( { x, y } );
   return result;
}


//======================================================================


void regression( const vector<Data> &data, double &m, double &c )
{
   int N = data.size();
   double Sx = 0, Sy = 0, Sxx = 0, Sxy = 0, Syy = 0;
   for ( Data d : data )
   {
      double x = d.x, y = d.y;
      Sx += x;
      Sy += y;
      Sxx += x * x;
      Sxy += x * y;
      Syy += y * y;
   }
   m = ( N * Sxy - Sx * Sy ) / ( N * Sxx - Sx * Sx );  // slope
   c = ( Sy - m * Sx ) / N;                            // intercept
}


//======================================================================


void training( const vector<Data> &data, double &m, double &c, double alpha, int passes )
{
   m = c = 0.0;
   while( passes-- )
   {
      for ( Data d : data )
      {
         double error = m * d.x + c - d.y;
         c -= alpha * error;
         m -= alpha * error * d.x;
      }
   }
}


//======================================================================


void write( const vector<Data> &data, double m, double c )
{
   #define fmt << setw( 20 ) <<
   cout << "Regression line is y = " << m << "x + " << c << "\n\n";
   cout << fixed << setprecision( 6 );
   cout << "For comparison (x, y, ypred):\n";
   for ( Data d : data ) cout fmt d.x fmt d.y fmt m * d.x + c << '\n';
}


//======================================================================


int main()
{
   double m, c;                        // slope and intercept; y = mx+c
   vector<Data> data = getData( str );

   cout << "Read " << data.size() << " points\n\n";


   // STANDARD METHOD
   cout << "Regression (STANDARD METHOD)\n";
   regression( data, m, c );
   write( data, m, c );


   // GRADIENT-DESCENT METHOID
   int passes = 5;
   double alpha = 0.1;
   cout << "\n\nRegression (GRADIENT DESCENT)\n";
   training( data, m, c, alpha, passes );
   write( data, m, c );
}


Read 5 points

Regression (STANDARD METHOD)
Regression line is y = 2x + 1

For comparison (x, y, ypred):
            1.000000            3.000000            3.000000
            2.000000            5.000000            5.000000
            3.000000            7.000000            7.000000
            4.000000            9.000000            9.000000
            5.000000           11.000000           11.000000


Regression (GRADIENT DESCENT)
Regression line is y = 1.999508x + 1.002338

For comparison (x, y, ypred):
            1.000000            3.000000            3.001846
            2.000000            5.000000            5.001354
            3.000000            7.000000            7.000862
            4.000000            9.000000            9.000371
            5.000000           11.000000           10.999879
Last edited on
Topic archived. No new replies allowed.