getline is stopping at quotations

For some reason the getline function is not returning all of a line for me. Here are the first three lines of my input file, it is from the kaggle Titanic competition training data:

1
2
3
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C


I think the input file uses \r\n for new lines. When it gets to the second line, the value of 'line' is "1,0,3,\", immediately after getline has been called, meaning that it stopped when it reached the first quote, despite my having explicitly told it to stop at \r.

The first line does not have a quote in it, so I don't understand how the splitCSV method could have somehow caused this problem. For learning purposes I decided to write my own splitCSV method instead of using boost.

Can somebody please tell me what I'm doing wrong here?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
  

void remQuotes(string& s) {
    int x;
    while ((x = s.find("\"")) != string::npos) {
        s = s.replace(x,1,"");
    }
}

std::vector<string> * splitCSV(string line,bool removeQuotes=true) {
    bool inquotes=false;
    std::vector<string> * v = new std::vector<string>();
    int start = 0;
    int pos=0;
    while (line[pos] !=0) {
        if (line[pos]=='"') {
            inquotes = !inquotes;
        } else if (!inquotes && line[pos] == ',') {
            string t = line.substr(start,pos-start);
            if (removeQuotes) {
                // must replace the quotes
                remQuotes(t);
            }
            v->push_back(t);
            start=pos+1;
        }
        pos++;
    }
    return v;
}

Traveller * parseTraveller(std::vector<string>& v) {
    Traveller * t = new Traveller();
    short x;
    istringstream(v.at(0)) >> x;
    t->setId(short(x));
    
    
    return t;
}

/*
 * 
 */
int main(int argc, char** argv) {
    //string testpath="/home/mike/NetbeansProjects/Titanic/test.csv";
    string trainpath="/home/mike/NetBeansProjects/Titanic/train.csv";
    string line;
    ifstream myinput(trainpath);
    bool firstLine=true;
    while (getline(myinput,line,'\r')) {
        std::vector<string> v = *splitCSV(line);
        if (firstLine) {
            // nothing I need to do yet, will eventually pull out headers maybe.
        } else {
            Traveller t = * parseTraveller(v);
            // must delete traveller
        }
    }
    myinput.close();
    return 0;
}


I omitted the definition for the Traveller class for brevity's sake.
Last edited on
Use of string stream to parse would make things easier:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <iostream>
#include <sstream>
#include <string>
#include <vector>

std::vector< std::string > split_csv_remove_quotes( const std::string& line )
{
    std::vector< std::string > result ;

    std::istringstream stm( line ) ;
    std::string part ;
    while( std::getline( stm, part, '"' ) )
    {
        std::stringstream sub_stm( part );
        std::string token ;
        while( std::getline( sub_stm, token, ',' ) ) result.push_back(token) ;
        if( std::getline( stm, part, '"' ) ) result.push_back(part) ; // the last token
    }
    return result ;
}

int main()
{
    const std::string line = "1, 0, 3, \"Braund, Mr. Owen Harris\", male, 22, 1, 0, A / 5 21171, 7.25, , S";
    std::cout << line << '\n' ;

    for( const auto& tok : split_csv_remove_quotes(line) ) std::cout << tok << '\n' ;
}

http://coliru.stacked-crooked.com/a/3e0ea812243764bd

TODO: trim leading and trailing spaces in tokens
Topic archived. No new replies allowed.