Data Cleaning

Nov 1, 2013 at 10:29pm
Greetings all hope you are staying low on the glycemic index....I'm trying to do some data cleaning and need a bit of advice. I have a .txt file that I need to analyze with two columns that contains a few missing values denoted by a period:

1957-02-21 43.48
1957-02-22 .
1957-02-25 43.38
1957-02-26 43.45

I would like to delete these instances (remove them as lines ideally) or impute an average value from the preceding and subsequent elements. Reading the text file is no problem but rather manipulating it once opened is proving more trouble than it's worth. Strangely, there seems to be little literature on the subject of data cleaning in C++ so if anyone has any suggestions then I would love to hear them. Thanks so much.
Chase CB

Nov 4, 2013 at 4:53pm
There are at least 2 ways to do this.

One would be to read in each value, then check the value of the second number.

or

Based on the data you provided, it should be simple to check the size of the line. If your wanting to add/average the values then this is not the best option but I thought it might be useful to me.

This doesn't sound like homework, so here ya go.


Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
// Program to read a file and show the lines that have a certain string size.

#include <iomanip>      // library to set up leading 0s.
#include <unistd.h>     // library to let me delay
#include <fstream>		// for ifstream
#include <iostream>		// for cin, cout and cerr
#include <string>		// for the string datatype
#include <cstdlib>		// needed for the exit function
#include <sstream>		//
#include <math.h>		//
#include <stdio.h>		//
#include <time.h>		//
#include <conio.h>		//
#include <dos.h>		//
#include <stdlib.h>		//
#include <time.h>		//
#include <windows.h>	//
#include <sys/stat.h>	//
#include <cstring>		//
#include <winsock.h>    //
using namespace std;	//

int main(int argc, char* argv[])
{
ifstream Afile ( "data.txt" );
string lineA;
char showmystringsize='n';
char showstringsize='s';
int mystringsize=0;
int linecount=0;

if(Afile.is_open())
{
cout << "Type (y) to set a string size, or (n) to show string size for all lines :";
cin >> showmystringsize;

if (showmystringsize=='y')
{
cout<< "What size do you want to work with? : "; 
cin >> mystringsize;
cout<< "Type (s) to show lines with this size or (d) for don't show : "; 
cin >> showstringsize;
}


// show string size = y
if (showmystringsize=='y')
{
	if (showstringsize=='s')
	{
	cout << "Showing lines that match your string size." << endl;
		while (!Afile.eof( ))
		{
		getline(Afile,lineA);
			if (lineA.size()==mystringsize)
			{
			cout << lineA << endl;
			linecount++;
			}
		}		
	}
	else if (showstringsize=='d')
	{
	cout << "Showing lines that DO NOT match your string size." << endl;
		while (!Afile.eof( ))
		{
		getline(Afile,lineA);
			if (lineA.size()!=mystringsize)
			{
			cout << lineA << endl;
			linecount++;
			}
		}		
	}
	else
	{cout << "Invalid choice for showstringsize"<< endl;}
}
else
	if (showmystringsize=='n')
	{
	cout << "Showing all lines with size (Sz)." << endl;		
	cout << "Size \tLine" << endl;
		while (!Afile.eof( ))
		{
			getline(Afile,lineA);
			cout <<lineA.size() << "\t" << lineA << endl;
			linecount++;
		}	
	}
else
	{cout << "Invalid choice for showmystringsize"<< endl;}	
}

else
	{cout << "Unable to read file" <<endl;}

cout << "Total: " << linecount << endl;
Afile.close();
return 0;
}



Output

Type (y) to set a string size, or (n) to show string size for all lines :y
What size do you want to work with? : 12
Type (s) to show lines with this size or (d) for don't show : d
Showing lines that DO NOT match your string size.
1957-02-21 43.48
1957-02-25 43.38
1957-02-26 43.45
Total: 3
Nov 4, 2013 at 5:11pm
@ OP: The reason that there is so little data on this subject is that both MS Excel and LO Calc will open files like this as white space or width delimited format and do the calculations.

If you want to code this you could read in a whole line at a time and use std::string.find() and std::string.substr() in order to split the data on white space:

- find(): http://www.cplusplus.com/reference/string/string/find/
- substr(): http://www.cplusplus.com/reference/string/string/substr/

EDIT: Why do you want an average for a date? Or did you just mean the second column?
Last edited on Nov 4, 2013 at 5:12pm
Nov 4, 2013 at 6:39pm
To discard lines with incomplete data, do something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <iostream>
#include <sstream>
#include <string>

int main()
{
    std::istringstream file
    (
         "1957-02-21 43.48\n"
         "1957-02-22 . \n"
         "1957-02-25 43.38\n"
         "1957-02-26 43.45"
    );


    std::string line ;
    while( std::getline( file, line ) ) // for each line in file
    {
        std::istringstream stm(line) ; // make an input stringstream

        std::string date ;
        double value ;
        if( stm >> date >> value ) // if we could read a date and a value
        {
            std::cout << "date: " << date << "  value: " << value << '\n' ;
            // use date and value
        }
        else // attempt to read failed
        {
            std::cout << "imcomplete data discarded: " << line << '\n' ;
            // discard the line
        }
    }
}

http://coliru.stacked-crooked.com/a/23ecc4a13b085401

To impute an impute an average value from the preceding and subsequent elements,

1. read the date and value into a vector of pairs (string,double), marking invalid values with a nan
2. sort the vector (default sort, would sort on date)
3. make another pass through the vector, correcting the nans to the average of previous valid value and next valid value.


Nov 7, 2013 at 4:59pm
thanks so much
Topic archived. No new replies allowed.