Reading and sorting words by 3rd letter in C++

Hi guys,

I have a problem to solve and wanted some advice on the choices that I can make here. The problem:
- Read 1 Mil words;
- Sort them by third letter;
- I assume the data format is a simple text file, \n delimited lines and \s delimted words.

- I have read stackoverflow on the specific topic [1], and this forum [2]. But I have some more questions and things to clear up. Things I decided to use: fgets for file reading.

Questions:
- Is fgets() the best choice for this problem?
- What data structure should I use to sort words by 3rd letter?
- I wanted to use std::sort() as to not reinvent the wheel, but doubt it will word for the 3rd letter sorting case.

References:
- [1]: https://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python?rq=1
- [2]: http://www.cplusplus.com/forum/general/197487/
Last edited on
IMO the best option is to read the words into a vector<string> with std::getline.
Once you have done this you can use std::sort with a comparer or lambda to sort them - see 2nd version.
http://www.cplusplus.com/reference/algorithm/sort/
But first focus on loading the file into the vector.
My guts tell me that the assignment is more like "implement this on your own rather than using standard C++ facilities". If your professor hasn't taught you yet about sorting algorithms, he's hoping that at least someone in your class comes up with a known (if inefficient) sorting algorithm on her own.

Having said that, I've used fstreams in all my assignments that require file input/output operations with no performance issues. I'd avoid using C functions for two reasons:
1) fgets returns an array of chars, not a std::string;
2) C and C++ <i>are not</i> the same language, even though most colleges tend to teach "C, but you can use cout".
With 1 million words I suspect it would be the sorting that was the time-limiting feature, not reading from file. Make sure that you turn optimisation right up.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;


bool sortMethod( string a, string b )
{
   if ( a.size() < 3 ) a += "   ";     // Be VERY careful
   if ( b.size() < 3 ) b += "   ";     //

   return a[2] < b[2];
}


int main()
{
   // Choose either filestreams or other input streams for testing
// ifstream in( "infile.txt" );
// ofstream out( "outfile.txt" );
   stringstream in( "Two streets, one of them leads to a world I love\n"
                    "and the other the street that leads me home\n" );
   ostream &out = cout;

   // Read words
   vector<string> words;
   string s;
   while ( in >> s ) words.push_back( s );

   // Sort according to predicate (http://www.cplusplus.com/reference/algorithm/sort/)
   sort( words.begin(), words.end(), sortMethod );

   // Output
   for ( string w : words ) out << w << '\n';
}



As an alternative to vector<string> you might also consider using set (no duplicates) or multiset (allow duplicates) as container: then it will automatically sort in any specified way on insertion.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;


bool sortMethod( string a, string b )
{
   if ( a.size() < 3 ) a += "   ";     // Be VERY careful
   if ( b.size() < 3 ) b += "   ";     //

   return a[2] < b[2];
}


int main()
{
   // Choose either filestreams or other input streams for testing
// ifstream in( "infile.txt" );
// ofstream out( "outfile.txt" );
   stringstream in( "Two streets one of them leads to a world I love\n"
                    "and the other the street that leads me home\n" );
   ostream &out = cout;

   // Read words
   multiset<string,bool(*)(string,string)> words( sortMethod );
   string s;
   while ( in >> s ) words.insert( s );

   // Output
   for ( string w : words ) out << w << '\n';
}

eh, a modern machine can do a million integers (which is what a single character is) in a second or so with anything not n squared time. Time shouldnt be a problem since its not string comparisons (yuck) but integers (hooray). The swapping is where you will pay the price, so be sure to swap POINTERS not DATA. That means you probably need to load a vector of strings to read all the data, and then make a second vector of pointers to each string, then sort the second one off the data pointed to.

this data can use a bucket sort!! You can sort this data in linear time with a simple algorithm!!
standard sort may not get this and may actually be slower, not sure.

-- fgets is C. Use c++, which is shown for you. ]
-- best data structure is probably a form of vectors / arrays

so let me preach on it just roughly...

vector<vector<string>> sorted(256);

read data.
sorted[data[3]].push_back(data); //hopefully I did that right... I don't do a lot of 2d coding.. the idea is that you have a vector of 256 (ascii) letters of strings. So if 'a' is 65, you go to the 65th location in the structure and put your string there. Its now in the 'a' bucket.

now when you print/ handle/whatever just iterate. Ascii is already in sorted order, so
for(0..255)
for(sorted[i].size())
cout << sorted[i][j] << endl; // so you go to a and print all the a strings, then b, then c...

does this make sense? You don't have to sort it, you don't need a sort comparison operator, you have sorted the data as you read it and its done!

Now, I do recommend understanding HOW to write the std sort comparison and do it that way, but the above is probably the best approach for THIS problem.

Is this what you meant @Jonnin?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
using namespace std;


int main()
{
   // Choose either filestreams or other input streams for testing
// ifstream in( "infile.txt" );
// ofstream out( "outfile.txt" );
   stringstream in( "Two streets, one of them leads to a world I love\n"
                    "and the other the street that leads me home\n" );
   ostream &out = cout;

   // Read words
   vector< vector<string> > bucket( 256, vector<string>() );
   string s;
   while ( in >> s ) 
   {
      if ( s.size() < 3 ) continue;
      int pos = s[2] - '\0';
      bucket[pos].push_back( s );
   }

   // Output
   for ( int i = 0; i < 256; i++ )
   {
       for ( string s : bucket[i] ) out << s << '\n';
   }
}

The real time consumer here is the input. I vote for reading into a multiset, which sorts input as it's inserted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <string>
#include <set>
using namespace std;

struct String2Cmp {
    bool operator()(const string& a, const string& b) const { 
        return a.size() < 3 ? b.size() > 2 : a[2] < b[2];
    }
};

int main() {
   multiset<string, String2Cmp> st;
   for (string s; cin >> s; )
       st.insert(s);
   for (auto const& s: st)
       cout << s << '\n';
}

Yes, that looks like what I said.
I dunno, experts, is TPB's better? Or identical approach / different tool?

Regardless ... someone is going to impress their professor :)

Just be sure you understand it if you turn something like these last 2 in... they are not things a new student would normally do, but they are also not really difficult concepts.

Are you guys sure all these data structure aren't a bit of an overkill?
After all, the program requires to only do insertions, which can be done effortlessly with an array or less efficiently with a doubly-linked list.
A vector is just a thin wrapper over an array that allows dynamic resizing. A doubly-linked list sounds more complicated than a vector or set.
You have to store the data in something to sort it.
A map or double array / vector lets you sort it in O(N)

single arrays and linked lists do not let you sort it in O(N), so for the questionable savings of using a slightly simpler data structure (though I would argue that a list is more complicated than a double array or map) you lose your rapid sorting algorithm which is going to make handling a million rows sluggish.

Overkill? Sure, its probably overkill for what appears to be homework. But school is training for the real world. Wouldn't you want the next guy you hire to know how to do it right?
Last edited on
Agreed. Maybe being lectured by a C programmer is giving me bad coding habits.
You could do the bucket sorting exactly as I said in C via a char** or unsigned char**. It would be troublesome but you could also do a hash map that sorted it similar to buckets in C as well, but given the extra framework I would absolutely just go ** if I had to do it in C. The point is the algorithm, the container itself is what it is for your language.



@jonnin, The bucket sort is a great idea. I think it's the way to go.
Topic archived. No new replies allowed.