Counts words in a string

Pages: 12
Also there is a big difference between unoptimised code and flagrantly obtuse code.
closed account (EzwRko23)
Yes, D. Knuth mentioned premature optimisation. But all of what you are doing here is premature optimisation. And funny, your STL code is not really that faster than the naive Scala oneliner using regexp, arrays and wasting memory (regexp is much more general than specialised code you posted). Probably Perl would be also comparable. Actually I measured your code:

2 millions of invocations of the STL Duoas code on a 64 char string took 2.3 s.
2 millions of invocations of the unoptimised Scala split + regexp on the same string took about 3.4 s.

Oh, and did I mention C++ code uses one-byte chars, and mine code is UTF friendly...?

So, either write unreadable and fast code, or unoptimal but short and readable. But not both: unoptimal and unreadable.

Anyway never mind. This discussion is retarded. C++ coders rarely have a clue about performance (they usually don't measure, but believe that if they put inline on everything, or use template generated code, it will be fast - like the inline in the Duoua's code).


Last edited on
C++ coders rarely have a clue about performance...

That is a very poor attempt at stereotyping that people may find offensive. I'm not offended, but that's because I don't think that statement is limited to C++ programmers. (Any writing in an interpreted language can't be too concerned about performance since compiled code will run orders of magnitude faster).

But keeping with the "don't do premature optimization" theme, I'd venture a guess that most code written, however inefficient, does not require optimization anyway either because it isn't invoked enough times to see a difference or it is invoked with inputs that are sufficiently small. Goes directly back to "don't do premature optimization".


closed account (EzwRko23)

But keeping with the "don't do premature optimization" theme, I'd venture a guess that most code written, however inefficient, does not require optimization anyway either because it isn't invoked enough times to see a difference or it is invoked with inputs that are sufficiently small. Goes directly back to "don't do premature optimization".



Agreed.


That is a very poor attempt at stereotyping that people may find offensive. I'm not offended, but that's because I don't think that statement is limited to C++ programmers. (Any writing in an interpreted language can't be too concerned about performance since compiled code will run orders of magnitude faster).


Also agreed - most programmers don't have clue about performance, not only the C++ ones. However, many C and C++ programmers I meet are often obsessed about performance. They post things like "is += faster than +" or they lose time for similar stupid things. This thread also shows it clearly - I posted a short oneliner, but instead of trying to show how to write a similarly simple C++ code, very soon someone states that my code is resource wasteful, slow etc., even though actually it is of very similar performance to the C++ code claimed to be "fast".

Funny, that obsession about performance is often not complemented with actual knowledge of how to write fast software. The PHP or Python programmers usually don't talk so much "wow, look ma, what a super fast piece of code I wrote". They usually accept that most parts of code are not optimal and optimize only what is needed.
Last edited on
xorebxebx wrote:

I posted a short oneliner, but instead of trying to show how to write a similarly simple C++ code, very soon someone states that my code is resource wasteful, slow etc.


Actually Duoas did post simple C++ code. It was only mentioned in passing that your solution was resource wasteful (which it is).

xorebxebx wrote:

, even though actually it is of very similar performance to the C++ code claimed to be "fast".


No one has yet posted any C++ code that they claimed was 'fast'. Duoas said that his code was 'cheaper' than yours which is true.

Your code is much slower that C++ code. In your own tests is was slower:

xorebxebx wrote:

2 millions of invocations of the STL Duoas code on a 64 char string took 2.3 s.
2 millions of invocations of the unoptimised Scala split + regexp on the same string took about 3.4 s.


In the words of D Knuth:
"In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"

And your solution takes, what 30% more time?

Also how many words did you count to get those figures? 64 char string? That is small, unlikely and tailored to suit your algorithm better. We all know that as the number of words grows the performance difference will increase.

Furthermore Duoas and I were simply attempting to provide a C++ solution in one line of code. We were not trying to make a fast solution.

This code should be fairly efficient. I would be interested to see how it compares to your array-creating solution when counting a thousand words per string (which is not an unlikely document size).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
size_t word_count(const std::string& str)
{
	size_t c = 0;
	std::string::const_iterator pos = str.begin();
	std::string::const_iterator end = str.end();

	while(pos != end)
	{
		while(pos != end && (*pos == ' ' || std::isspace(*pos))) ++pos;
		c += (pos != end);
		while(pos != end && (*pos != ' ' || !std::isspace(*pos))) ++pos;
	}
	return c;
}
3 lots of 1000000 iterations of a
768 word string, 3648 chars long:

Duoas: 768 in 155.09
Duoas: 768 in 155.84
Duoas: 768 in 155.81

Galik: 768 in 30
Galik: 768 in 29.65
Galik: 768 in 29.96
Last edited on
closed account (EzwRko23)

No one has yet posted any C++ code that they claimed was 'fast'. Duoas said that his code was 'cheaper' than yours which is true.
Your code is much slower that C++ code. In your own tests is was slower:



It was not a scientific test, just a fast test to check the order of magnitude. Both codes do not do the same - the C++ code takes unfair advantage of 1-byte wide chars, and the string was very short so it fit in the L1 cache. For a larger string, you would probably pay much more for additional passes over the whole string. But nevertheless they are very similar. 1.5x difference is really of no practical difference, this is a typical difference between code compiled by different compilers or with different flags. This is too little difference to be concerned in a non time-critical app. And if it is a problem, it still can be very easily optimised.


This code should be fairly efficient. I would be interested to see how it compares to your array-creating solution when counting a thousand words per string (which is not an unlikely document size).


Good joke. A thousand Java allocations is like... 5 microseconds on a modern hardware? Besides, I've already posted another, better, still very short and simple, oneliner that doesn't create any temporary objects except one single iterator, so no need to compare that old one.

Anyway - tell me - what is the point in comparing a long optimised version of code with short unoptimised (but much more readable) version? The purpose of the short version is to be short, simple readable, and fast to write, not fast to execute.

If I wish, I can translate this C++ code to Scala directly and there is no reason it cannot be exactly the same speed (yep, I can do it all manually with while loops, just as you can). In extreme case I can fall back to writing a small fragment in assembly.

On the other hand your code is wasteful in another way - it wastes programmer resources and you **cannot do anything about it.** (except switching the whole project to a HLL language). I can spend my saved programmer resources (e.g. time) on optimising the stuff that really needs it. Practice shows that it is usually less than 20% of code. But readability is needed usually by the whole code. So what is more important to have? A possibility to write short code or fast code? Or both?
Last edited on
xorebxebx wrote:
Both codes do not do the same - the C++ code takes unfair advantage of 1-byte wide chars

I have similar code for wide chars and it runs, if anything, slightly faster.
closed account (EzwRko23)
It very much depends on the architecture and size of the string. If the string is long and memory access slow, the memory may become a bottleneck and then wide strings may be slower.
Also agreed - most programmers don't have clue about performance, not only the C++ ones. However, many C and C++ programmers I meet are often obsessed about performance. They post things like "is += faster than +" or they lose time for similar stupid things. This thread also shows it clearly - I posted a short oneliner, but instead of trying to show how to write a similarly simple C++ code, very soon someone states that my code is resource wasteful, slow etc., even though actually it is of very similar performance to the C++ code claimed to be "fast".

Funny, that obsession about performance is often not complemented with actual knowledge of how to write fast software. The PHP or Python programmers usually don't talk so much "wow, look ma, what a super fast piece of code I wrote". They usually accept that most parts of code are not optimal and optimize only what is needed.


@xorebxebx: I agree that most programmers don't have a clue about performance. Fact is, if you want to have a clue about it, you have to know how your compiler works -- how it parses the input, how it generates code, how its optimizer works, and then you also have to know your hardware architecture -- pipelining, etc. etc. That is a lot of in depth knowledge, very little of which you're going to get in a college bachelors degree, perhaps even masters or phd as well.

That they post questions like "is += faster than +" I think is a good thing. It shows they care about performance, which is better than being completely ignorant of it and writing code that is 10x slower than it should be all for want of an ampersand to pass a vector to a function by const reference instead of by value (I see this all the time.) But C++ is about performance anyway: at least, on of the goals of the standards committee when changing the language is to not make programmers pay for what they don't use. For example, if your class/struct has no virtual functions, then you don't pay the memory cost for a vtable pointer. Yes, it may be true that C++ does not easily allow you to write the fastest code in all cases; perhaps it makes it impossible. That is a price that C++ programmers have to pay--it is the result of C++ being a general purpose language that can be used to solve essentially any programming problem from hello world to quantum simulations to launching rockets to word processors, etc, etc.

As for the responses you got, well, I think your past posting history is largely to blame. You've on many occasions attacked C++ on a C++ forum (if perhaps only to claim that Scala is better). The people that post here regularly are here because they enjoy C++ -- dare I say they love C++. It is only human nature for a person to defend something they take to heart when it is attacked, and what I saw in this thread was, I feel, largely a knee-jerk over-reaction, but I think that was because of the number of arguments with you in the past.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <sstream>
#include <iterator>
using namespace std;


size_t count_words(const std::string& s)
{
     std::istringstream is(s);
     return distance( std::istream_iterator<std::string>(is),
                            std::istream_iterator<std::string>());
}



void main()
{
	string read;
	int threshold;
	map<string, int> mymap;

	ifstream myfile("test.txt");	//open text file
	ofstream myoutput1;
	ofstream myoutput2;

			cout << "Please enter threshold :";
			cin  >>threshold;

	if(!myfile) 
	{
		cout << "Cannot open input file.\n";
		exit(0);
	}
	else
	{
		while (myfile >> read) 
		{
			++mymap[read];	//Read in the words in text file and count the freq of the word appearing
		}
	}
	myoutput1.open("index.txt");
	myoutput2.open("common.txt");
	map<string, int>::iterator b = mymap.begin();
	map<string, int>::iterator e = mymap.end();   

		//ofstream myoutput;
		//myoutput.open ("example.txt");

	/*
		std::string s; // word count 
		unsigned int wordno = 0;
    
		while(getline(myfile, s))
			 {
				 ++wordno;
				 if(s.empty())++ wordno;
				 else wordno += count_words(s);
				 std::cout << s << '\n';
			 }
     */

		myoutput1<< "Number of words : " << wordno << '\n';
		myoutput1 << "\nWord" <<"       " <<" Occurrence" <<endl;
		myoutput2<< "Number of words : " << wordno << '\n';
		myoutput2 << "\nWord" <<"       " <<" Occurrence" <<endl;

	while ( b != mymap.end() )
	{
		
		if (b->second >= threshold)
		{
			myoutput2 <<  b->first << "\t\t" << b->second << endl;
		}
		else
		{
			myoutput1 <<  b->first << "\t\t" << b->second << endl;
		}
		b++;
	}

		   myfile.close();
	myoutput1.close();
	myoutput2.close();
}





i have difficulties implementing the count word. which i have command away. Please help
Topic archived. No new replies allowed.
Pages: 12