sscanf problem

I'm trying to parse a 13.000 line file line by line. I tried parsing it with std::string::find and std::string::substr but it takes about 15 seconds. With sscanf it takes about 1.3 seconds. Although I've got an issue with sscanf. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/*
Line format:
Column1[TAB]Column2[TAB]Column3[TAB]...
Example:
Value1	12	Value3	...
*/
std::string format;
format += "%[^'\t']";
format += "%u\t";
format += "%[^'\t']";
// ..

std::getline(file, line);
sscanf(line.c_str(), format.c_str(), value1, &value2, value3 /*, ..*/);


The problem is some of the values can be empty. When that is the case sscanf ignores all the '\t' until the first character found and that results as columns being shifted, values being incorrect.
Last edited on
I can't comment on scanf, it's not something I've used much. Still, rather than std::string::find and std::string::substr, did you consider std::istringstream and std::getline() with a tab delimiter. There might be a difference in speed.

An example of a few lines from the file and which columns may be empty (any/all of them?) and whether you are only interested in specific columns, or you want all of them, might help.
You didn't give a whole lot of information to go on, but I would be inclined to use an approach more like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include <iostream>
#include <sstream>
#include <string>
#include <utility>
#include <vector>

const unsigned fields = 5;

std::istringstream in(
    "10\t\t3\t30\t40\n"
    "\t\t1\t2\t3\n"
    "9\t8\t7\t6\t7\n"
    "\t1\t\t1\t\n"
    );

std::vector<std::vector<std::pair<unsigned, int>>> parse_stream(std::istream& is)
{
    std::vector<std::vector<std::pair<unsigned, int>>> values;
    std::string token;

    while (is && !is.eof())
    {
        std::vector<std::pair<unsigned, int>> row;

        for (unsigned i = 0; i < fields - 1; ++i)
        {
            if (getline(is, token, '\t') && !token.empty())
                row.push_back({ i, std::stoi(token) });
        }

        if (getline(is, token) && !token.empty())
            row.push_back({ fields - 1, std::stoi(token) });

        if (is)
            values.push_back(std::move(row));
    }

    return values;
}

int main()
{
    auto values = parse_stream(in);

    for (std::size_t i = 0; i < values.size(); ++i)
    {
        std::cout << "Line " << i + 1 << '\n';
        for (auto& pair : values[i])
            std::cout << "\tColumn " << pair.first + 1 << ": " << pair.second << '\n';
    }
}
@Chervil Hey thanks for the reply. I've used std::istringstream with the delimiter and that reduced the time to 8 seconds. Here's the header and few lines:

1
2
// Header
TID	ItemName	Level	SuitableLevelMin	SuitableLevelMax	Class	Race	Gender	Type	SubType	ExtraType	BoundType	EventItem	PCBang Only	Villain Only	GroupID	DropRate	DropRank	font color	Spirit Type	NPC Price	PC Price	NPC Charisma Price	NPC Shrine Coin	PC Charisma Price	PC Shrine Coin	Stack	Hand	Rank	DUR	MinSocket	MaxSocket	MinOption	MaxOption	ATK Range	ATK Speed	Physical_Min_Damage	Physical_Max_Damage	Physical_Defense_Point	Physical_ATKRate	PDR	BlockRate	Magic_Defense_Point	Water_Defense_Point	Fire_Defense_Point	Earth_Defense_Point	Air_Defense_Point	CON_Bonus_Point	STR_Bonus_Point	DEX_Bonus_Point	INT_Bonus_Point	Wis_Bonus_Point	Apply Effect Count	Apply Effect Time	Use Interval	HP	SP	MP	Rune Attribute	Use skill id	Use skill level	Polymorph Id	Polymorph Dur	FirstCategory	FirstCategoryName	SecondCategory	SecondCategoryName	PhysicalRank	Hp_Buff	Mp_Buff	Attack_Buff	Defense_Buff	Run_Buff	Cash	Destination	RemainTime	ExpireTime	classify_id	CanStopUsingItem	CashItemUseType	EnableOnRide	OptionTID	PotionType2	Link_id	Skill_plus	Gambling	QuestItem	Gacha_Type_Numer	GachaRank	RemainPetStamina	GachaMinLv	GachaMaxLv	Item_section_num	Heroic_Min_Damage	Heroic_Max_Damage	Heroic_Defense_Point

1
2
// Entry
50405	Absolute Cap of Antonio	75			2	1					1							3517399		1	10000							3	40	2	2							15																													15														12000;5540														

1
2
// Entry
13035	Rare Pierrot: Superior <Archer - 14 days>	1			2	1		31																																																			898	1													103			20160	140	1	2	1															


Another thing I tested is with release build time goes from given seconds to few hundred milliseconds(8s -> 120ms) but since I'm running the application in debug mode most of the time(and this file is a must everytime that I run the application) 8 seconds is really annoying.

@cire Thanks for the reply. Some fields have a string data type so I ended up manually calling std::getline(std::istringstream, field, '\t') for each field. Using std::istringstream reduced the loading time about 50% but it's still really annoying to wait 8 seconds each time I try to launch the application.
Last edited on
Topic archived. No new replies allowed.