Use fscanf to read in variable numbers of integer

Hello all,
I have over 100,000 csv files in the below format:

    1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
    1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,

All I need is field 10 and field 17 onward, field 10 is the counter indicate how many
integer stored start from field 17 i.e. what I need is:

    6,13,4,7,8,18,20
    5,4,7,8,18,20
    5,4,7,8,18,20
    5,13,4,7,8,20
    5,13,4,7,8,20
    4,4,8,18,20
    5,4,7,8,18,20
    5,13,4,7,8,20
    5,13,4,7,8,20
    12,13,4,7,8,18,20,21,25,27,29,31,32

Max number of integer need to read is 28. I can easily achieve this by Getline in C++, however, from my previous experience,
since I need to handle over 100,000 such files and each files may have 300,000~400,000 such lines.
Therefore using Getline to read in the data and build a vector<vector<int>> may have serious performance issue
for me. I tried to use fscanf to achieve this:
1
2
3
4
5
6
7
8
while (!feof(stream)){
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
 for (int i=0;i<MyCounter;i++){
  fscanf(fstream,"%d",&MyIntArr[i]);
 }
 fscanf(fstream,"%*s"); // to finish the line
}

However, this will call fscanf multiple times and may also create performance issue.
Is there any way to read in variable number of integer at 1 call with fscanf ?
Or I need to read into a string and then strsep/stoi it ? Compare to fscanf, which
is better from performance point of view? Thanks a lot.

Regds

LAM Chi-fung
Last edited on
what happens if you do a read of the maximum possible values and allocate the target array to have enough space for it?

I believe, but its been a while, that the fscanf will just put junk into the missing entries.
If you did it this way, I think you could actually just read one time for a line, and make your code smart enough to handle it. Test it, see what it does?

stoi / atoi are probably the exact same code that fscanf calls when it reads the text and converts it for you. So its probably the same speed.

you should, as I hinted at before, either spawn multiple copies of your program or multi thread it so you can do many files at once, whatever your disk system can bear.
Last edited on
@jonnin
Below please find my piece of code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
    // may tried RandomAccess/SequentialScan
    MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

    // point to start of memory file
    char* start = (char*)MemFile.getData();
    // dummy in my case
    char* tmpBuffer = start;

    // looping counter
    uint64_t i = 0;

    // pre-allocate result vector
    MyVector.resize(300000);

    // Line counter
    int LnCnt = 0;

    //no. of field
    int NumOfField=43;
    //delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
    int DelimCnt=NoOfField+1;
    //Delimiter position. May use new to allocate at run time
    // or even use vector of integer
    // This is to store the delimiter position in each line
    // since the position is relative to start of file. if file is extremely
    // large, may need to change from int to unsigner, long or even unsigned long long
    static  int DelimPos[DelimCnt];

    // Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
    // from first 15 field, in this case, can assign 15 to MaxFieldNeed
    int MaxFieldNeed=NumOfField;
    // keep track how many comma read each line
    int DelimCounter=0;
    // define field and line seperator
    char FieldDelim=',';
    char LineSep='\n';

    // 1st field, "virtual Delimiter" position
    DelimPos[CommaCounter]=-1
    DelimCounter++;

    // loop through the whole memory field, 1 and only once
    for (i = 0; i < MemFile.size();i++)
    {
      // grab all position of delimiter in each line
      if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
        DelimPos[DelimCounter] = i;
        DelimCounter++;
      };

      // grab all values when end of line hit
      if (MemFile[i] == LineSep) {
        // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
        // memory in integer array but gain performance 
        DelimPos[DelimCounter] = i;
        // I know exactly what the format is and what field(s) I want
        // a more general approach (as a CSV reader) may put all fields
        // into vector of vector of string
        // With *EFFORT* one may modify this piece of code so that it can parse
        // different format at run time eg similar to:
        // fscanf(fstream,"%d,%f....
        // also, this piece of code cannot handle complex CSV e.g.
        // Peter,28,157CM
        // John,26,167CM
        // "Mary,Brown",25,150CM
        MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
        MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
        MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
        MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
        // reset Delim counter each line
        DelimCounter=0
        // previous line seperator treat as first delimiter of next line
        DelimPos[DelimCounter] = i;
        DelimCounter++
        LnCnt++;
      }
    }
    MyVector.resize(LnCnt);
    MyVector.shrink_to_fit();
    MemFile.close();
    };


I code whatever I want inside the
1
2
      if (MemFile[i] == LineSep) {
}

Even handle empty field! Actually I use this code to solve my previous issue and it handle 2100 files (6.3 GB) in 57 seconds!!!
(I code the CSV format in it and only grab 4 values from each line).
Later will modify it to handle this issue.
Last edited on
Topic archived. No new replies allowed.