Greetings all hope you are staying low on the glycemic index....I'm trying to do some data cleaning and need a bit of advice. I have a .txt file that I need to analyze with two columns that contains a few missing values denoted by a period:
I would like to delete these instances (remove them as lines ideally) or impute an average value from the preceding and subsequent elements. Reading the text file is no problem but rather manipulating it once opened is proving more trouble than it's worth. Strangely, there seems to be little literature on the subject of data cleaning in C++ so if anyone has any suggestions then I would love to hear them. Thanks so much.
Chase CB
One would be to read in each value, then check the value of the second number.
or
Based on the data you provided, it should be simple to check the size of the line. If your wanting to add/average the values then this is not the best option but I thought it might be useful to me.
// Program to read a file and show the lines that have a certain string size.
#include <iomanip> // library to set up leading 0s.
#include <unistd.h> // library to let me delay
#include <fstream> // for ifstream
#include <iostream> // for cin, cout and cerr
#include <string> // for the string datatype
#include <cstdlib> // needed for the exit function
#include <sstream> //
#include <math.h> //
#include <stdio.h> //
#include <time.h> //
#include <conio.h> //
#include <dos.h> //
#include <stdlib.h> //
#include <time.h> //
#include <windows.h> //
#include <sys/stat.h> //
#include <cstring> //
#include <winsock.h> //
usingnamespace std; //
int main(int argc, char* argv[])
{
ifstream Afile ( "data.txt" );
string lineA;
char showmystringsize='n';
char showstringsize='s';
int mystringsize=0;
int linecount=0;
if(Afile.is_open())
{
cout << "Type (y) to set a string size, or (n) to show string size for all lines :";
cin >> showmystringsize;
if (showmystringsize=='y')
{
cout<< "What size do you want to work with? : ";
cin >> mystringsize;
cout<< "Type (s) to show lines with this size or (d) for don't show : ";
cin >> showstringsize;
}
// show string size = y
if (showmystringsize=='y')
{
if (showstringsize=='s')
{
cout << "Showing lines that match your string size." << endl;
while (!Afile.eof( ))
{
getline(Afile,lineA);
if (lineA.size()==mystringsize)
{
cout << lineA << endl;
linecount++;
}
}
}
elseif (showstringsize=='d')
{
cout << "Showing lines that DO NOT match your string size." << endl;
while (!Afile.eof( ))
{
getline(Afile,lineA);
if (lineA.size()!=mystringsize)
{
cout << lineA << endl;
linecount++;
}
}
}
else
{cout << "Invalid choice for showstringsize"<< endl;}
}
elseif (showmystringsize=='n')
{
cout << "Showing all lines with size (Sz)." << endl;
cout << "Size \tLine" << endl;
while (!Afile.eof( ))
{
getline(Afile,lineA);
cout <<lineA.size() << "\t" << lineA << endl;
linecount++;
}
}
else
{cout << "Invalid choice for showmystringsize"<< endl;}
}
else
{cout << "Unable to read file" <<endl;}
cout << "Total: " << linecount << endl;
Afile.close();
return 0;
}
Type (y) to set a string size, or (n) to show string size for all lines :y
What size do you want to work with? : 12
Type (s) to show lines with this size or (d) for don't show : d
Showing lines that DO NOT match your string size.
1957-02-21 43.48
1957-02-25 43.38
1957-02-26 43.45
Total: 3
@ OP: The reason that there is so little data on this subject is that both MS Excel and LO Calc will open files like this as white space or width delimited format and do the calculations.
If you want to code this you could read in a whole line at a time and use std::string.find() and std::string.substr() in order to split the data on white space:
#include <iostream>
#include <sstream>
#include <string>
int main()
{
std::istringstream file
(
"1957-02-21 43.48\n""1957-02-22 . \n""1957-02-25 43.38\n""1957-02-26 43.45"
);
std::string line ;
while( std::getline( file, line ) ) // for each line in file
{
std::istringstream stm(line) ; // make an input stringstream
std::string date ;
double value ;
if( stm >> date >> value ) // if we could read a date and a value
{
std::cout << "date: " << date << " value: " << value << '\n' ;
// use date and value
}
else // attempt to read failed
{
std::cout << "imcomplete data discarded: " << line << '\n' ;
// discard the line
}
}
}
To impute an impute an average value from the preceding and subsequent elements,
1. read the date and value into a vector of pairs (string,double), marking invalid values with a nan
2. sort the vector (default sort, would sort on date)
3. make another pass through the vector, correcting the nans to the average of previous valid value and next valid value.