Soundex

I'm trying my best to get an updated soundex algorithm that works well in c++, it needs to obviously allow for the input of a string upto 20 chars and basically convert it into the 4 char soundex name.

Does anyone know where I can find this working?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#include <iostream>
#include <string>
#include <map>
#include <cstdlib>
#include <cctype>
using namespace std;


string soundex( string word )
{
   int i;
   string AEIOUY = "AEIOUY", HW = "HW";
   map<char,char> cmap = { {'B','1'}, {'F','1'}, {'P','1'}, {'V','1'},
                           {'C','2'}, {'G','2'}, {'J','2'}, {'K','2'}, {'Q','2'}, {'S','2'}, {'X','2'}, {'Z','2'},
                           {'D','3'}, {'T','3'},
                           {'L','4'},
                           {'M','5'}, {'N','5'},
                           {'R','6'} };

   string result = word;

   // Remove non-alphabetic characters and put in upper case
   i = 0;
   while ( i < result.size() )
   {
      if ( !isalpha( result[i] ) )
      {
         result.erase( i, 1 );
      }
      else
      { 
         result[i] = toupper( result[i] );
         i++;
      }
   }

   // Store first character
   char firstLetter = toupper( word[0] );

   // Remove all occurrences of H and W except first letter
   i = 1;
   while ( i < result.size() )
   {
      if ( HW.find( result[i] ) != string::npos ) result.erase( i, 1 );
      else                                        i++;
   }

   // Map consonants to digits
   for ( char &c : result )
   {
      if ( cmap.count( c ) ) c = cmap[c];
   }

   // Replace all adjacent same digits with one digit
   i = 1;
   while ( i < result.size() )
   {
      if ( isdigit( result[i] ) && result[i] == result[i-1] ) result.erase( i, 1 );
      else                                                    i++;
   }

   // Remove all occurrences of AEIOUY except first letter
   i = 1;
   while ( i < result.size() )
   {
      if ( AEIOUY.find( result[i] ) != string::npos ) result.erase( i, 1 );
      else                                            i++;
   }

   // Replace first letter
   result[0] = firstLetter;

   // Get correct length
   result += "000";
   result = result.substr( 0, 4 );

   return result;
}


int main()
{
   string word;
   while ( true )
   {
      cout << "Enter a word (empty to end): ";   getline( cin, word );
      if ( word == "" ) exit( 0 );
      cout << "Soundex representation is " << soundex( word ) << endl;
   }
}

Enter a word (empty to end): Trump
Soundex representation is T651
Enter a word (empty to end): Putin
Soundex representation is P350
Enter a word (empty to end): May
Soundex representation is M000
Enter a word (empty to end): Merkel
Soundex representation is M624
Enter a word (empty to end): Jinping
Soundex representation is J515
Enter a word (empty to end): KimJongUn
Soundex representation is K525
Enter a word (empty to end): 

Last edited on
Thank you very much, is this something you have created specially for me? would I be free to use this as part of my school project?
Codingboy12365,

Actually, I'd never heard of Soundex before your post! I was intrigued, so coded it for amusement from the Wikipedia article. I can't vouch for its accuracy as phonetics definitely isn't my field.

If it's a school project you should write your own code, but you can use this for comparison/ideas.

Actually, I'm not happy with the quality of the code, as the repeated erase() function for individual characters is appallingly inefficient. However, alternatives using things like remove_if ended up being a tad messy. There are plenty of ways of achieving each step of the algorithm, so have a look for better alternatives.

It's possible that the Soundex definition may vary between sources - I went for the "American Soundex" in the Wikipedia article. Linguistics isn't my area, but I can see reasons based on pronunciation why it might plausibly be different in different areas of the world. Amusing how 'Trump' ends up translating to something like a Russian tank, whilst Theresa May wouldn't be happy with the phonetic translation of her surname.

It's a nice C++ challenge though!
Last edited on
Here's a version that doesn't overuse the erase() function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#include <iostream>
#include <cstdlib>
#include <cctype>
#include <algorithm>
#include <string>
#include <map>
using namespace std;

string target;

string removeChars( string s, int pos, bool f( char c ) )
{
   if ( pos >= s.size() ) return s;                                  // safety if starts beyond end of string
   string::iterator end = remove_if( s.begin() + pos, s.end(), f );  // 'removed' chars still in s; end points to them or end of string
   return s.substr( 0, end - s.begin() );                            // just return the non-removed characters
}

bool isIn( char c ) { return target.find( toupper( c ) ) != string::npos; }

bool notAlpha( char c ) { return !isalpha( c ); }

string soundex( string word )
{
   map<char,char> cmap = { {'B','1'}, {'F','1'}, {'P','1'}, {'V','1'},
                           {'C','2'}, {'G','2'}, {'J','2'}, {'K','2'}, {'Q','2'}, {'S','2'}, {'X','2'}, {'Z','2'},
                           {'D','3'}, {'T','3'},
                           {'L','4'},
                           {'M','5'}, {'N','5'},
                           {'R','6'} };


   // Remove non-alphabetic characters
   string result = removeChars( word, 0, notAlpha );

   // Put in upper case
   for ( char &c : result ) c = toupper( c );

   // Store first character
   char firstLetter = result[0];

   // Remove all occurrences of H and W except first letter
   target = "HW";   result = removeChars( result, 1, isIn );

   // Map consonants to digits
   for ( char &c : result ) { if ( cmap.count( c ) ) c = cmap[c]; }

   // Replace all adjacent same digits with one digit
   for ( int i = 1; i < result.size(); i++ ) if ( result[i] == result[i-1] ) result[i-1] = '*';
   target = "*";   result = removeChars( result, 1, isIn );

   // Remove all occurrences of AEIOUY except first letter
   target = "AEIOUY";   result = removeChars( result, 1, isIn );

   // Replace first letter
   result[0] = firstLetter;

   // Get correct length and return
   result += "000";
   return result.substr( 0, 4 );
}


int main()
{
   string word;
   while ( true )
   {
      cout << "Enter a word (empty to end): ";   getline( cin, word );
      if ( word == "" ) exit( 0 );
      cout << "Soundex representation is " << soundex( word ) << endl;
   }
}

Last edited on
Topic archived. No new replies allowed.