min/max, mean, median.

Pages: 123
I have a complex task and I am struck with it.

7.84626,0.00121498,?,?,?,?,0.595974,1722821,40877036,73,601,130,45,73,1481,1479,2153,4922.35,116792,6.13849,7.86395,0.00847591,0.00737798
6.18782,0.000136137,?,?,?,?,0.595974,1722821,40877036,73,601,130,45,73,1481,1479,2153,4922.35,116792,6.13849,7.86395,0.00847591
7.86844,0.00187588,?,?,?,?,0.595974,1722821,40877036,73,601,130,45,73,1481,1479,2153,4922.35,116792,6.13849,7.86395,0.00847591,0.00737798
6.12701,0.0010252,?,?,?,?,0.595974,1722821,40877036,73,601,130,45,73,1481,1479,2153,4922.35,116792,6.13849,7.86395,0.00847591,0.00737798


This is a data file containing integers, floats, negative values and also special characters '?'. The code takes finput and scans the file line by line.
There are 23 values in a line (a1, a2, a3..... a23) and there are over 100 lines like this.(i have listed only 4 for clarification)

I have to find Min and Max for each value... for example: for a1, i have to find min and max by scanning each line. similarly for a2, i have to scan each line again...and so on. I need to write that to the output file.
Similarly i have to calculate mean and median for every value i,e, a1....a23 by scanning over 100 lines again and again and writing it to output file.

Can anyone help me with this?
Last edited on
The flow is :-

a) The user is prompted for supplying the data file ( I have done the code).
b) The code checks the data file and prompts the user for response: (I have done the code)
1- For min/max Values
2 - For mean of the Values
3 - For Median of the values
c) User enter the response (1/2/3), the code creates an output file (or overwrites if existing) and perform the chosen method by the user.


I am struck with the step C, where we need to perform the operations.
Which specific aspect are you stuck with?

- reading the user's response?
- performing the min/max calculation?
- performing the mean calculation?
- performing the median calculation?
- writing the results to the output file?
calculation of min/max, mean.

I am struck with how to proceed with calculations....since its a horizontal line with values and i have to calculate vertically.
How should i do....thats the problem?
like using vectors? or.... anything which consumes less time.
I am struck with how to proceed with calculations....since its a horizontal line with values and i have to calculate vertically.

Without seeing your code, we can't be sure what you've done, but it sounds as though you're really asking about the data types you should be reading the data into in step b.

If you have the same set of items on every line, then the most intuitive way to do it would be:

- define a structure to hold the data from one line
- define a vector containing elements of that structure type, so that each element contains the data for one line

That will be easier to manage than trying to keep several vectors synchronized at once.

Then, you can simply load all the data from the file into your vector once, and then perform whatever calculations you need to.

I have defined a struct to categorize numbers, double, strings. I think that will hold all values.

1
2
3
4
5
6
struct value
{
	bool isnumber;
	double nvalue;
	string svalue;
};


This is what i am doing when the user gives a response i.e. 1/2/3. I dont have much of experience working with Vectors so I am confused about how to start with.....

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
	while (!infile_txt.eof())
	{
		infile_txt.getline(c_line, 500);
		line = string(c_line);
		
		if (response == 1)			// for min/max
		{

		}
	
		if (response == 2)			//for Mean calculation
		{

		}

		if (response == 3)			//for Median Calculation
		{

		}
        }


Basically the assignment is pretty big, so i am not able to post the whole of the code so just posting parts which i need help for.
Last edited on
Could you state how the user enters the data?
First of all -

1. What does '?' imply?
2. Can you post the part where the user enters the values ?
These '?' are special characters, to be treated as missing values.

Below is the code how ur user enters the response:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
int main(int argc, char **argv)
{
	string finput, foutput;

	if (argc == 2)
	{
		finput = argv[1];
	}
	else if (argc == 3)
	{
		finput = argv[1];
		foutput = argv[2];
	}
	else
	{
		cerr << "No input file specified!" << endl;
		return 0;
	}

	int response = 0;
	char c_line[500];
	string line;

	cout << "----------------------------------------------------------------" << endl;
	cout << "1. Random Substitution Method (range between min and max values)" << endl;
	cout << "2. Mean Substitution Method " << endl;
	cout << "3. Medien Substitution Method." << endl;
	cout << "----------------------------------------------------------------" << endl;
	cin >> response;
You have a 23*N table. 23 columns. The table is stored in a file, each row in separate line and the 23 values of a row separated by commas.

Does a specific column have the same data type (integer or double) on every row? Do you know the type of each column? (Ignore the special '?' when answering.)


You have 23 columns, but the min/max/mean calculation has the same options as the 1-column case has: you either

1. Read and store all values into memory first and then calculate the statistics.
OR
2. Accumulate the statistics as you read values without storing the values.

However, the median requires the first approach. Therefore, you have to use that.


You have a "sparse" table; some values are missing. There are 23 columns, but each column may have from 0 to N values. When you do encounter a ?, you will not add any value to corresponding column.


A "simple" solution is to have std::vector<std::vector<double>> foo( 23 ):. That foo has one vector for each column. The downside is that every column now contains floating point values, even though the data has integers.

A more complex solution is Table foo;, where Table is a struct that has 23 vector members, one for each column.
hi Keskiverto,

think about more general perspective, if we have N*N table stored in a file, containing missing values and data is float, integers, doubles, negative values.

N rows and N columns, and we have to find min and max of every column irrespective of row count and output that to another file.

how to do that...?

I haven't worked with vectors much so i dont really understand what should be the done in this case.
I am. I started with the 1*N table (just one column) special case and stated that each column is a separate case.

Why would you limit to a square table (N*N) if you already have M*N data (M=23, N=100+)?

data is float, integers, doubles, negative values.

The float, double, and int are all signed types, so "negative values" is not a separate case. Do you know that some columns have only unsigned integral values?

Do you have a format in your data; all values of one column are of same type (but some may be missing) and you do know the type of each column?

If you have no known format, then any column may contain some double values and thus all values in the table must be assumed to be double values.


Do you know your format?
i m generalizing into M*N because I have more data files to categorize later. This is the first one.

data is float, integers, doubles....i think this covers everything because integers cover negative values. so i have defined a structure already:
1
2
3
4
5
6
struct value
{
	bool isnumber;
	double nvalue;
	string svalue;
};


some columns have unsigned integer values too....in the beginning of this thread, i have shared initial 4 rows of the data, how the data actually looks like.

Data is very well formatted, all the data in one "column" is same type( but some may be missing).
Each row stores values from 23 columns. so the row has all data types i.e. integers, floats, doubles.
Last edited on
We have a communication problem, because we seem to use conflicting terms.

A row is horizontal, like a line in a file. You say:
all the data in one "row" is same type

However, you did also say:
i have shared initial 4 rows

That seems somehow odd.

You say:
Each column stores values from 23 different fields

A column is vertical.

In your original data sample I see 4 rows and 23 columns. Each column has 4 values that seem to be of same type.


I have asked whether you know the datatype of each column. The answer to that question should be "no" or "yes" (preferably with a list of the type of each column). However, you repeat every time "floats, integers, doubles, etc". To me that sounds like "no".


Lets assume that each column has a type (like you sort of said). The number of rows is insignificant for your computations. If all files will have same columns, then one program can handle them all.

If the number of columns and/or types differs in each file, a more complex program can handle them all.


If you know the type of each column in advance, then you can make a program for that case.
If the input file (or auxiliary input file) contains description of the type of each column, then that information can be used.
If no information about the types exists (which contradicts with the "well formatted" claim), then the program has to first evaluate every value of a column in order to determine a type that can represent all the values of that column. Only after that can the processing commense.


Lets assume as simple example that we have a two-column table with int,double and no missing values. We read input from cin and expect about 100 rows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

struct Table {
  std::vector<int> a1;
  std::vector<double> a2;

  Table( size_t estimate ) {
    a1.reserve( estimate );
    a2.reserve( estimate );
  }
};

int main() {
  Table foo(100);
  int v1;
  char comma;
  double v2;
  while ( std::cin >> v1 >> comma >> v2 ) {
    foo.a1.push_back( v1 );
    foo.a2.push_back( v2 );
  }


  // print result:
  if ( foo.a1.size() ) {
    auto result1 = std::minmax_element( foo.a1.begin(), foo.a1.end() );
    std::cout << "a1 min " << *result1.first;
    std::cout << " max "   << *result1.first << '\n';
  }
  if ( foo.a2.size() ) {
    auto result2 = std::minmax_element( foo.a2.begin(), foo.a2.end() );
    auto res2sum = std::accumulate( foo.a2.begin(), foo.a2.end(), static_cast<double>(0) );
    std::cout << "a2 min " << *result2.first;
    std::cout << " max "   << *result2.first;
    std::cout << " avg " << res2sum / foo.a2.size() << '\n';
  }
  return 0;
}
thank you Keskiverto, the solution you have given has helped but not too much.

Lets make it simple, consider a dataset with 5 Rows and 5 Columns with only integer values, something like this:

1,2,3,4,5
5,4,3,2,4
2,4,4,4,1
5,4,3,2,2
3,5,5,1,3

and we have to calculate Min/Max for every column, Mean of every column, Median of every column.

Lets try this, after we can generalize the solution for M Rows and N Columns and different data types.
Where is your code showing what you tried?
Lets try this, after we can generalize

The "all of same type and none missing" is a special case that allows a simple implementation that is not exactly a good base for generalization.

Anyway, no struct Table is required. The main() could start:
1
2
3
4
5
6
7
8
int main() {
  constexpr size_t Columns {5};
  std::vector<std::vector<int>> foo {Columns}; // 5 column vectors
  for ( auto & col : foo ) { // range-based for-loop syntax, C++11
    col.reserve( 8 ); // look up what the vector::reserve does
  }
  std::vector<int> row {Columns}; // replaces v1 and v2
  // nested loop to read one row at a time and if successful append to the column vectors 
I have tried to explain the task in detail

consider this example dataset in a text file and I want to store this in a multidimensional vector for calculations. Here r represent rows and c represent columns. Some values are stored as '?' so we don't have to consider them in mix and max calculations, but in case of mean, we consider mean = (sum of available values) / number of available values.

1
2
3
4
5
6
7
8
9
10
11
	c1, c2, c3, c4, c5, c6, c7, c8,c9,c10,c11

r1	5.1,3.5,1.4,0.2,5.1,3.5,1.4,0.2,?,?,?
r2	4.9,3.0,1.4,0.2,4.9,3.0,1.4,0.2,?,?,?
r3	4.7,3.2,1.3,0.2,4.7,3.2,1.3,0.2,?,?,?
r4	4.6,3.1,1.5,0.2,4.6,3.1,1.5,0.2,?,?,?
r5	?,?,?,4.6,3.1,1.5,0.2,4.6,3.1,1.5,0.2
r6	?,?,?,4.7,3.2,1.3,0.2,4.7,3.2,1.3,0.2
r7	?,?,?,4.9,3.0,1.4,0.2,4.9,3.0,1.4,0.2
r8	?,?,?,5.1,3.5,1.4,0.2,5.1,3.5,1.4,0.2


I need to calculate min/max , mean and median of every column and have to output in another file like this:

output
1
2
3
4
	c1, c2, c3, c4, c5, c6, c7, c8,c9,c10,c11
min	-   -   -   -   -   -   -   -  -   -   -
max	-   -   -   -   -   -   -   -  -   -   -
mean	-   -   -   -   -   -   -   -  -   -   -


can someone help me with this??
Last edited on
It will probably be easier to convert your information from row major order to column major order.

I'd start by reading the file into a vector<vector<string>> to preserve the question marks. Then I'd convert to column major order and convert the strings to doubles. After you do this conversion it'll be easier to do the min, max and mean calculations.

Last edited on
hi jlb,

I am trying to do the same but i m not very familiar with multidimensional vectors. I am fetching it line by line(shown in the code below) but how to read every single characters in a line, while storing in a vector and ignoring ',' and '?' OR
handling these special characters as per requirement. I dont know this.

1
2
3
4
5
6
7
8
char c_line(500);
string line;

while(!infile.eof())
{
infile.getline(c_line,500);
line = string(line)
}
First you should be using a std::string instead of that Cstring, and don't use eof() to control your read loop, use the actual read operation.

I'd start by reading the file line by line, parsing each line as you go using a stringstream and getline() with the optional third parameter (the comma) to create a vector<vector<string>>. Each element of the outer vector contains a vector that contains each "number" as a string. At this point you should still have the question marks in the appropriate places.

At this point your vector<vector<string> should look like:

1
2
3
4
5
6
7
8
5.1 3.5 1.4 0.2 5.1 3.5 1.4 0.2 ? ? ? 
4.9 3.0 1.4 0.2 4.9 3.0 1.4 0.2 ? ? ? 
4.7 3.2 1.3 0.2 4.7 3.2 1.3 0.2 ? ? ? 
4.6 3.1 1.5 0.2 4.6 3.1 1.5 0.2 ? ? ? 
? ? ? 4.6 3.1 1.5 0.2 4.6 3.1 1.5 0.2 
? ? ? 4.7 3.2 1.3 0.2 4.7 3.2 1.3 0.2 
? ? ? 4.9 3.0 1.4 0.2 4.9 3.0 1.4 0.2 
? ? ? 5.1 3.5 1.4 0.2 5.1 3.5 1.4 0.2 


Give this a try. If you have problems post your complete program and ask specific questions based on that code.

Pages: 123