Potential Interview question

So if you are having a online chat with someone.. write a program that will look at all your words and pick out the most used 5 words.
Interesting. All the test interview questions I had to do in my career development course where allegedly from real life interview questions, but they were designed to let me showcase my problem solving skills and seldom had anything to do with writing anything outside of pseudocode. Would they want you to write a program that requires that much thought and time to write it up?
Not sure.. heard this question from a friend and wanted to know how to do it..
One possibility is to scan the text and store the words in a
map[word] -> frequency.

Then iterate through the map and choose the 5 words with highest frequency.

The time complexity shall be NlogN (where N is the number of words you wrote)
Space complexity would be N.

Last edited on
> Would they want you to write a program that requires that much thought and time to write it up?

It is a reasonable interview question to ask someone who has just studied C++.
For a programmer familiar with standard C++, writing the code would not require either much thought or much time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <iostream>
#include <string>
#include <locale>
#include <algorithm>
#include <map>
#include <functional>
#include <fstream>
#include <iterator>

// word: sequence of characters without embedded whitespace
//       and without leading or trailing non-alpha characters
//       with all the characters converted to lower case
std::istream& getword( std::istream& stm, std::string& word )
{
    const auto& loc = stm.getloc() ;
    const auto alpha = [&loc]( char c ) { return std::isalpha( c, loc ) ; } ;

    while( stm >> word )
    {
        const auto begin = std::find_if( word.begin(), word.end(), alpha ) ;

        if( begin != word.end() )
        {
            word = { begin, std::find_if( word.rbegin(), word.rend(), alpha ).base() } ;
            for( char& c : word ) c = std::tolower( c, loc ) ;
            return stm ;
        }
    }

    return stm ;
}

std::map< std::string, int > make_word_count( std::istream& stm )
{
    std::map< std::string, int > wc ;

    std::string word ;
    while( getword( stm, word ) )  ++wc[word] ;

    return wc ;
}

template < typename OUTPUT_ITERATOR >
OUTPUT_ITERATOR copy_most_used( const std::map< std::string, int >& wc,
                                OUTPUT_ITERATOR dest, std::size_t n = 5 )
{
    std::multimap< int, std::reference_wrapper<const std::string> > inverted_map ;
    for( const auto& pair : wc )
        inverted_map.emplace( pair.second, std::cref(pair.first) ) ;

    n = std::min( wc.size(), n ) ;
    auto end = inverted_map.rbegin() ;
    std::advance( end, n ) ;
    for( auto iter = inverted_map.rbegin() ; iter != end ; ++iter, ++dest )
        *dest = iter->second ;
    return dest ;
}

int main()
{
    std::ifstream file( __FILE__ ) ;
    const auto word_count = make_word_count(file) ;

    std::cout << "most used words\n--------------\n" ;
    copy_most_used( word_count, std::ostream_iterator<std::string>( std::cout, "\n" ) ) ;
}

http://coliru.stacked-crooked.com/a/1a86a0af0b27e7da
I had a nearly identical question in an interview once. I liked it so much that I use it myself when interviewing candidates. I'm not looking for code, but rather:
- Can they figure out which data structures will solve it.
- Do they consider performance?

I actually need to do something similar at work every week or so. In my case it's "what are the most frequent errors in a file that lists one error per line?" It's easy in UNIX:
sort | uniq -c | sort -n | tail -5
@JLBorges
I couldn't understand this line , can you elaborate a little here ?
word = { begin, std::find_if( word.rbegin(), word.rend(), alpha ).base() } ;
How does that work ?

And from what I can decipher from the function getword(), you find the first 'character' in word (pointed by begin) and
the last 'character' (say end) in the word and somehow the initializer list magicaly works as assigning each 'character' in the range [begin,end] to word.
Further , it wont work in a word like this foo4$2bar , i.e words with non-alpha characters in the middle of the word .

P.S I'm not that of a C++(11) expert , as you may find out reading my not-so-senseful post :p
Last edited on
How does that work ?
Quite simple. Lets separate it into basic parts:
1
2
3
auto last_char_riterator = std::find_if( word.rbegin(), word.rend(), alpha );
auto end = last_char_riterator.base();
word = { begin, end };
First line is simple call to find_if, searching from the end of the line (using reverse iterators) and using our predicate.
last_char_riterator is a reverse iterator pointing to the last character in word.
Second line reverts reverse iterator pointing to last character in a word to normal iterator pointing past that character (as normal for end iterators)
Third line invokes temporary std::string construction from 2 iterators ( { begin, end }; ) and its move assigment to the word.

Further , it wont work in a word like this foo4$2bar , i.e words with non-alpha characters in the middle of the word .
It does. And if you need other definition of the word, you just need to modify getword() function. For example replace isalpha with isalnum to process "words" like 4ever. Of course this function has room for improvement: you cannot be perfect when dealing with arbitrary text, but for proper and correct english text it is working nearly perfectly.
Last edited on
whats the time complexity for JLBorges code?
> and somehow the initializer list magicaly works as assigning each 'character' in the range [begin,end] to word.

It uses C++11's uniform initialization syntax. http://www.stroustrup.com/C++11FAQ.html#uniform-init

list initialization (details): http://en.cppreference.com/w/cpp/language/list_initialization


> it wont work in a word like this foo4$2bar

As per the definition of 'word' used by getword(), foo4$2bar is a single word.

To treat it as two separate words foo and bar, perhaps the simplest way would be to use a ctype facet with a custom character classification table (which treats all non-alpha characters as white space).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#include <iostream>
#include <string>
#include <map>
#include <functional>
#include <fstream>
#include <iterator>
#include <vector>
#include <locale>
#include <memory>

namespace
{
    std::map< std::string, int > do_make_word_count( std::istream& stm )
    {
        std::map< std::string, int > wc ;
        auto locale = stm.getloc() ;

        std::string word ;
        while( stm >> word )
        {
            for( char& c : word ) c = std::tolower( c, locale ) ;
            ++wc[word] ;
        }

        return wc ;
    }
}

std::map< std::string, int > make_word_count( std::istream& stm )
{
    struct replace_ctype // raii shim
    {
            // This ctype facet classifies all non-alpha as whitespace
            struct non_alpha_is_ws : std::ctype<char>
            {
                static const mask* classification_table()
                {
                    // start with the classic table ( C locale's table )
                    static std::vector<mask> table( classic_table(),  classic_table() + table_size ) ;

                    // all non-alphas are to be treated as whitespace
                    for( std::size_t i = 0 ; i < table_size ; ++i )
                        if( !std::isalpha(i) ) table[i] = space ;

                    return std::addressof( table.front() ) ;
                }

                // do not delete table, initial reference count == 0
                non_alpha_is_ws() : std::ctype<char>( classification_table() ) {}
            };

        replace_ctype( std::istream& stm ) : stm(stm),
            old_locale( stm.imbue( std::locale( stm.getloc(), new non_alpha_is_ws ) ) ) {}
        ~replace_ctype() { stm.imbue(old_locale) ; }

        replace_ctype( const replace_ctype& ) = delete ;
        replace_ctype& operator=( const replace_ctype& ) = delete ;

        std::istream& stm ;
        const std::locale old_locale ;
    };

    replace_ctype temp(stm) ;

    return do_make_word_count(stm) ;
}

template < typename OUTPUT_ITERATOR >
OUTPUT_ITERATOR copy_most_used( const std::map< std::string, int >& wc,
                                OUTPUT_ITERATOR dest, std::size_t n = 5 )
{
    std::multimap< int, std::reference_wrapper<const std::string> > inverted_map ;
    for( const auto& pair : wc )
        inverted_map.emplace( pair.second, std::cref(pair.first) ) ;

    n = std::min( wc.size(), n ) ;
    auto end = inverted_map.rbegin() ;
    std::advance( end, n ) ;
    for( auto iter = inverted_map.rbegin() ; iter != end ; ++iter, ++dest )
        *dest = iter->second ;
    return dest ;
}


int main()
{
    // zebra Zebra zebra+giraffe Giraffe+ZEBRA zebra! zebra-foal zebra-crossing zebra.cadabra
    std::ifstream file( __FILE__ ) ;
    // ZEBRA ZEBRA ZEBRA+GIRAFFE GIRAFFE+ZEBRA ZEBRA! ZEBRA-FOAL ZEBRA-CROSSING ZEBRA.CADABRA

    const auto word_count = make_word_count(file) ;

    std::cout << "12 most used words\n--------------\n" ;
    copy_most_used( word_count, std::ostream_iterator<std::string>( std::cout, "\n" ), 12 ) ;
}

http://coliru.stacked-crooked.com/a/47b8d1aaa1098e49



> whats the time complexity for JLBorges code?

O( log N ) where N is the total number of words.

Space complexity is O(M) where M is the number of unique words ignoring case.

Note: The code can be made faster. For instance, it is easy to avoid std::advance( end, n ) ;
Last edited on
MiiNiPaa wrote:
Third line invokes temporary std::string construction from 2 iterators ( { begin, end }; ) and its move assigment to the word.

That explains it.I didn't knew that initializer lists can be used that way.
Thanks.

@JLBorges,
I appreciate your effort.
Thank you very much.
That explains it.I didn't knew that initializer lists can be used that way.
It is possible because std::string has assigment operator overload:
Standard wrote:
basic_string& operator=(initializer_list<charT> il);
28 Effects: *this = basic_string(il).
29 Returns: *this.

Disreagard that, I am wrong. Read JLBorges answer
Last edited on
> I didn't knew that initializer lists can be used that way.

It is a simple braced-init-list.
1
2
struct A { int i ; double d ; void* p } ;
A a = { 22, 33.44, nullptr } ; // braced-init-list 

In this particular context, the braced-init-list { begin, end } is not treated as a std::initializer_list<>.

The only constructor of std::string that takes a std::initializer_list<> is the one that takes std::initializer_list<char>. That does not match { begin, end }
> ^ last post ,
Thanks again :)
I should've read those referenced links first.
Last edited on
I never seen an init list like that either. cool.
Topic archived. No new replies allowed.