stringstream bad performance in multithread environment.

Jun 26, 2012 at 8:42am
I use stl ostringstream in the multithreading environment.
I found that the execution time of each thread increased linearly as the thread number increased. But I did't find any synchronization code. I replace ostringsstream with snprintf and the preformance increase largely.

os: redhat 6.1 64bit
g++: g++4.4
cpu :4 core 2.0Ghz.

the code list as below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <iostream>
#include <sstream>
using namespace std;

void * func(void * t)
{
        int n = *((int *) t);
        pthread_t pid = pthread_self();
        timeval t1, t2;
        gettimeofday(&t1, 0);
        for(int i = 0; i < n; i++)
        {
                ostringstream os;
                /*
                   char buf[255];
                   int ret = snprintf(buf, 30, "%d", 2000000);
                   buf[ret] = 0;
                 */
        }
        gettimeofday(&t2, 0);
#define DIFF(a, b) ((b.tv_sec - a.tv_sec) + (b.tv_usec - a.tv_usec) / 1000000.0)
        std::cout << pid << ":" << DIFF(t1, t2) << std::endl;
#undef DIFF

        return NULL;
}

int main(int argc, char *argv[])
{
        int m, n =0;
        m = atoi(argv[1]);
        n = atoi(argv[2]);
        pthread_t tid[m];
        for(int i = 0; i < m; i++)
                pthread_create(&tid[i], NULL, func, &n);
        for(int i = 0; i < m; i++)
                pthread_join(tid[i], NULL);
        return 0;
}

Jun 26, 2012 at 11:04am
I think it is because "C" print functions must have a simpler lock mechanism (so with less footprint, but less good). I had a multithread program with lots of puts() in multiple threads, and puts() was always deadlocking my program while waiting for access to the shared ressources for printing. Switching to the c++ streams removed the locks with non other code modification. Before i switched to c++, the only way i managed to keep using puts was to create function to add messages to be printed in a queue, and print the buffer from only one thread

Jun 26, 2012 at 11:11am
STL has no lock mechanism at all. You need to do it yourself!

While the c equivalents often have due to global variables they use.
Jun 26, 2012 at 11:32am
I played around with your code, here's my final version:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
#include <unistd.h>
#include <sys/time.h>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <ostream>
#include <string>
#include <vector>
#include <stdio.h>
#include <stdlib.h>

struct timestat
{
	pthread_t pid;
	int n;
	timeval start, stop;
	std::string str;

	timestat() : pid(0), n(0) { const timeval c = { 0, 0 }; start = stop = c; }
};

std::ostream& operator<<(std::ostream& os, const timeval &tv)
{
	using namespace std;

	os << tv.tv_sec << '.' << setw(6) << setfill('0') << tv.tv_usec;
	return os;
}

std::ostream& operator<<(std::ostream& os, const timestat &ti)
{
	os << ti.pid << ' ' << ti.n << ' ' << ti.start << ' ' << ti.stop;
	return os;
}

void* func(void *param)
{
	timestat* info = reinterpret_cast<timestat*>(param);

	gettimeofday(&info->start, 0);
	for (int i = 0; i < info->n; ++i)
	{
		std::ostringstream os;
		os << "hello " << i;
		info->str = os.str();
	}
	gettimeofday(&info->stop, 0);

	return param;
}

void* func2(void *param)
{
	timestat* info = reinterpret_cast<timestat*>(param);

	gettimeofday(&info->start, 0);
	for (int i = 0; i < info->n; ++i)
	{
		char buf[255];
		snprintf(buf, sizeof(buf), "hello %d", i);
		info->str = buf;
	}
	gettimeofday(&info->stop, 0);

	return param;
}

timeval diff(const timeval &a, const timeval &b)
{
	timeval c = { a.tv_sec - b.tv_sec, a.tv_usec - b.tv_usec };
	if (c.tv_usec < 0)
	{
		// carry
		--c.tv_sec;
		c.tv_usec += 1000000;
	}

	return c;
}

int main(int argc, char *argv[])
{
	int nthreads = atoi(argv[1]);
	int niters = atoi(argv[2]);
	std::vector<timestat> ti(nthreads);

	for (int i = 0; i < nthreads; ++i)
	{
		ti[i].n = niters;
		pthread_create(&ti[i].pid, NULL, (i % 2 ? func2 : func), &ti[i]);
	}
	for (int i = 0; i < nthreads; ++i)
	{
		pthread_join(ti[i].pid, NULL);
	}
	for (int i = 0; i < nthreads; ++i)
	{
		timeval dt = diff(ti[i].stop, ti[i].start);
		std::cout << "thread " << i << ": delay " << dt << std::endl;
	}

	return 0;
}


Even threads use ostringstream, odd threads use sprintf.

These are the results on 4 boxes I have here:
AMD64 3.0GHz Llano Quad Core: Linux
1
2
3
4
5
6
7
8
9
kbw@sid:/tmp$ ./x 8 1000000
thread 0: delay 5.978106
thread 1: delay 0.219955
thread 2: delay 6.037293
thread 3: delay 0.292901
thread 4: delay 5.978408
thread 5: delay 0.256369
thread 6: delay 6.026936
thread 7: delay 0.153445

PowerMac PPC64 2.5GHz 2xDual Core: OS X Leopard:
1
2
3
4
5
6
7
8
9
quad:tmp kbw$ ./x 8 1000000
thread 0: delay 131.418874
thread 1: delay 0.808437
thread 2: delay 131.362945
thread 3: delay 0.746946
thread 4: delay 131.425056
thread 5: delay 0.703399
thread 6: delay 131.423792
thread 7: delay 0.753131

MacBook 2.0GHz 64bit Intel, Dual Core: OS X Lion
1
2
3
4
5
6
7
8
9
wifiluther:tmp kbw$ ./x 8 1000000
thread 0: delay 51.431563
thread 1: delay 0.749011
thread 2: delay 53.054713
thread 3: delay 0.594348
thread 4: delay 53.116178
thread 5: delay 0.598230
thread 6: delay 53.062198
thread 7: delay 0.653320

AMD Athlon 2.2GHz 32bit, 1 Core: FreeBSD 7.2
1
2
3
4
5
6
7
8
9
> ./x 8 1000000
thread 0: delay 2.821509
thread 1: delay 0.639026
thread 2: delay 3.314056
thread 3: delay 0.651706
thread 4: delay 2.521705
thread 5: delay 0.397815
thread 6: delay 3.164929
thread 7: delay 0.944895


I find the result very surprising.
Last edited on Jun 26, 2012 at 12:30pm
Jun 26, 2012 at 11:41am
It looks like std::ostringstream is quite slow.

Changing
1
2
3
std::ostringstream os;
os << "hello " << i;
info->str = os.str();
to
info->str = "hello " + std::to_string(i);
seems to be a lot faster (still slower than snprintf).
Jun 26, 2012 at 1:19pm
stream construction takes a very long time (it doesn't have to be, but it is on pretty much every implementation)

Replacing the loop in kbw's program with

1
2
3
4
5
6
7
8
    std::ostringstream os;
    for (int i = 0; i < info->n; ++i)
    {
        os << "hello " << i;
        info->str = os.str();
        os.str("");
        os.clear();
    }


I get the much more reasonable results with GCC on linux:

$ ./test 8 1000000
thread 0: delay 0.665503
thread 1: delay 0.454910
thread 2: delay 0.403125
thread 3: delay 0.395954
thread 4: delay 0.672103
thread 5: delay 0.512586
thread 6: delay 0.477378
thread 7: delay 0.362957
Last edited on Jun 26, 2012 at 1:19pm
Jun 26, 2012 at 1:51pm
hi cubbi.

Do you try to test the program using stringstream in a single-thread mode.

I test that stringstream class has similary performance with sprintf in the single-thread mode.
In multithread mode, sprintf has almost the same execution time, but the execution time of each thread using string stream increased linearly as the thread number increased.

So I think stringstream must call something global which need lock between different thread.
So I want to know what classes will lock global variable in c++ STL.
And in multi-thread program, I need to avoid to use these classes.


Jun 26, 2012 at 2:21pm
So I think stringstream must call something global which need lock between different thread.

Yes, every stream constructor calls the default constructor of std::locale, which atomically increments reference counts on the facets of the current installed global locale. I am guessing that in your case increments are using locks rather than atomic instructions.

So I want to know what classes will lock global variable in c++ STL.

None. Streams were not part of the STL.
Last edited on Jun 26, 2012 at 2:34pm
Jun 26, 2012 at 9:03pm
The stringstream on GCC is around 18 times faster in a single threaded environment. So I think there's a fair bit of waiting around in the stringstream construction.
Jun 27, 2012 at 12:59am

I am guessing that in your case increments are using locks rather than atomic instructions.


I use redhat 6.1 64bit and g++ 4.4.
How can i find std::locale using locks or atomic instructions? Can i change this?
Thanks
Last edited on Jun 27, 2012 at 12:59am
Topic archived. No new replies allowed.