Opinion on Coding Practices/Methology

Hi all,

I am a beginner to C++ who would greatly appreciate a general review of the following code- ie, ways to make it more legible, better implementation, etc. The code runs fine, but I want to make sure I get into a habit of utilizing professional best practices.

The point of this code is to read in a CSV file with about 70,000 rows. The first column contains dates, and some consecutive rows are duplicates. Thus, I´m trying to remove the duplicates and store the remaining contents in a vector. For instance, loading a file with

4-5-10,data1,data2,data3
4-6-10,data4,data5,data6
4-6-10,data4,data5,data6
4-7-10,data7,data8,data9

will give
4-5-10,data1,data2,data3
4-6-10,data4,data5,data6
4-6-10,data4,data5,data6
4-7-10,data7,data8,data9


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;

void DeleteDoubles(vector<string> &retValue);
//void CriarArquivo(vector<string> &retValue);

void DeleteDoubles(vector<string>& retVal)
{
    int commaPos=0;
    vector <string> testVector;
    char buffer1[100];
    char buffer2[100];
    string date;
    int date1;
    int date2;
    string line1;
    string line2;
    int x=1;
    string FileName;
    
    //cout<<"Enter the name of the file you wish to edit.";
    //cin>>FileName;
    
    //ifstream file(FileName.c_str( ));

    ifstream file( "blah.csv" ) ;
    string line ;
    
    while( getline( file, line ) )
    {
        if (line != "")  //exclude empty lines
        {
            if (x%2==0)
            {
                commaPos = line.find(",");  //find the date of one row
                date1=line.copy(buffer1,commaPos,0);
                buffer1[date1]='\0';
                line1=line;
            }

            else
            {
                commaPos = line.find(",");  //find the date of the next row
                date2=line.copy(buffer2,commaPos,0);
                buffer2[date2]='\0';
                line2=line;
            }

            string str_buffer2 = buffer2;
            string str_buffer1 = buffer1;

            if (str_buffer2.compare(str_buffer1) != 0)
            {
                if(x%2==0)
                {
                    testVector.push_back(line1);
                }
                else
                {
                    testVector.push_back(line2);
                }
             }

        x++;
        }
    }
    file.close();
    retVal.swap(testVector);
}

int main () 
{
    vector<string> testVector;
    DeleteDoubles(testVector);
    return 0;
}

When I see the function name "DeleteDoubles", I would expect that it removed doubles
from a container.

I am surprised by two things: first, the function takes a container of strings. Second,
it takes an empty container, then proceeds to populate it from file.

Why not just use a container that already removes duplicates... such as a set?
There should not really be any need to use buffer1 and buffer2 char arrays. Perhaps the std::string::substr() function could help you there? Or the std::string::assign()?

Or perhaps turn your line strings into input streams:

1
2
3
4
5
6
7
#include <sstream>

while(std::getline( file, line ) )
{
    // ... stuff
    std::istringstream iss(line); // turn your string into a istream
    std::getline(iss, date1, ','); // read up to comma from string line discarding the comma 


Also "using namespace std;" is frowned upon. Definitely not 'best practice'. Its acceptable for small noddy programs but for any serious work it should not be used.

Either prefix your std items like this: std::string line; of use explicit using declarations like this:

1
2
3
4
5
using std::string;
using std::ifstream;
using std::vector;

// ... 


However you should never do that in the global part of header files. Header files should never contaminate the global namespace.
@jsmith- How would you pass the container instead- as char values? Also (and pardon my ignorance), why are you surprised by me passing an empty container? I'd greatly appreciate it if you would mind clarifying.

@Disch- Given that I just started learning C++, I had not come across sets when I wrote this code. Having just looked up the topic in the reference guide, I couldn't help but think "Wow, this would have made my life so much easier." XD Thanks for the pointer! Can I take it then that your overall impression was that I made the code more tedious than it needed to be?

@Galik- As far as the char arrays go, you are completely right. When I originally wrote the code, I struggled with this issues with the string::copy member functions; hence, I declare a char array, and they convert the contents back to a string. /Sigh. Appreciate the tip. Also, what you said in regards to the "using namespace std" made sense (technically), but I was hoping you could clarify why it's definitely not best practice. Unfortunately this isn't the sort of thing that I've come across in my reading... =/
Zero One wrote:
I was hoping you could clarify why it's definitely not best practice.


Here is a good explanation:

http://www.parashift.com/c++-faq-lite/coding-standards.html#faq-27.5

That whole FAQ is well worth reading if you are interested in best practice.

It is actually not difficult to introduce subtle errors that are hard to detect if you put symbols into the global namespace willy-nilly. In particular when you say "using namespace std;" you take absolutely every symbol from the standard libraries and make them global. So, for example, if your program uses a function called 'count()', which is not such an unlikely name, you can end up calling std::count() when you think you are calling the one that you wrote.

Answer to your question to me:

Looking at nothing more than the function prototype:

 
void DeleteDoubles( std::vector<std::string>& );


It sounds like the function expects a non-empty container and then removes
Doubles from it (whatever Doubles are). It does not sound like the function
populates the container.

It's just the surprise factor. The name of the function should give me a good idea
what it does. In this case, my guess is totally wrong.

In addition to the other comments:

Zero One wrote:
The point of this code is to read in a CSV file with about 70,000 rows. The first column contains dates, and some consecutive rows are duplicates. Thus, I´m trying to remove the duplicates and store the remaining contents in a vector.


This is a simple and concise problem statement. The solution should be simple and concice. Your code, however, is not. What is all of the buffering and even vs. odd business?

4-5-10,data1,data2,data3
4-6-10,data4,data5,data6
4-6-10,data4,data5,data6
4-7-10,data7,data8,data9


What does each line in the data file represent? Why not encapsulate it in a user-defined type?

If you would like to also sort the data by some criterion (or at least don't mind if it is reordered), as Disch mentioned, consider a std::set.
What is all of the buffering and even vs. odd business?

Given my limited knowledge, I settled on using the getline function to extract individual lines from the stream. Since each line contained an entire row of the CSV file, I had to parse the line for the first column entry, and then store this value in the buffer. The even and off business was my way of checking consecutive lines for redundancy- ie, check 3 and 4, then 4 and 5, etc.

What does each line in the data file represent? Why not encapsulate it in a user-defined type?

Each line is data that is full of either strings or ints. This general method I employed was the only way I understood solving the problem, despite it being tedious. Hence my asking for a better way it could be done. Any light you could shed would be appreciated.
Each line is data that is full of either strings or ints.


I guess I should have been more clear. Consider a more object oriented solution. In terms of OOD, step back from implementation details for a minute. Describe the data in regular terms, at a high level. Just to give you an idea of what I mean, here's an example: "Each line is a purchase. Each purchase has a date, price, etc.." Then note the use of the keywords "is a" and "has a". Here's an [example] OOP translation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Purchase
{
    // methods to handle a purchase
    bool read( const string & purchase ) {
        // method to parse out the data from a single line, string-representation and return success
    }
    void write( ostrem & out ) const {
        // method to write a string-representation of the purchase data to an ostream
    }
    // etc.
private:
    // purchase data
    string date;
    string price;
    string etc;
};

This might be advantagous, depending on how you plan to manipulate the data. To just omit blank lines and remove duplicates, this is probably overkill.

The even/odd checks to find duplicates is not only a little obscure but it also might not remove all real duplicates. The counterexample being if/when the duplicates are not adjacent in the file. Even if you sure that the data does not contain non-adjacent duplicates, it would be good practice to anticipate a different data in the future and code to support it. As an added bonus, the duplicates could be removed in a simpler, more flexible, and more concise manor.

Consider buffering the entire file contents through an intermediate container or alternatively, a set (once a proper comparison functor or less than operator is defined) will sort the data and not store duplicates.

Here are a few random comments, based on the first snippet:
if (line != "") //exclude empty lines
When it comes to STL containers, there is an empty method, which could possibly be slightly more efficient:
if( !line.empty() )

1
2
                commaPos = line.find(",");  //find the date of one row
                date1=line.copy(buffer1,commaPos,0);

When parsing out more than just one field, using std::string::find becomes a little tricky. You could use a stringstream as another intermediary and use getline with a specified delimiter to parse out the fields.

Just some things to think about.
I'm gonna go ahead and close this thread. Thank you to all who contributed and left feedback! All of your critiques were much appreciated! =]
Topic archived. No new replies allowed.