getword

Hello, In the K&R on page 136 it says that our version of getword does not properly handle underscores, string constants, comments or preprocessor control lines. I just can't get it, could someone clarify this issue better to me? Here is the getword function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

/* getword: get next word or character from input */

 int getword(char *word, int lim)

 {
 	int c, getch(void);
 	void ungetch(int);
 	char *w = word;
	
	while (isspace(c = getch()))
 		;
 	if (c != EOF)
 		*w++ = c;
 	if (!isalpha(c)) {
 		*w = '\0';
 		return c;
 	}
	 
	for ( ; --lim > 0; w++) 
 	    if (!isalnum(*w = getch())) {
 	        ungetch(*w);
 	        break;
 	     }
	*w = '\0';
 	return word[0];
  }

Last edited on
The function your posted looks for an identifier. Let's see the exact rules it takes:

1
2
while(isspace(c = getch())
    ;


This loop will execute as long as it reads spaces from the input, since the loop does nothing, it will swallow up all leading whitespaces.

1
2
3
4
5
6
7
if(c != EOF)
    *w++ = c;
if(!isalpha(c))
{
    *w = '\0';
    return c;
}


The first code checks for an end-of-file (no more characters to read), if the first non-whitespace character found is not EOF, it will store the first value as the first character in the string named w.
The next part of the code checks if it encountered a non-alphabetic character, if so, it will terminate the string and return this character.

1
2
3
4
5
for( ; --lim > 0; w++)
    if ( !isalnum(*w = getch())) {
        ungetch(*w);
        break;
    }


Next comes the main part of this function, the part that swallows up the rest of the word. This function continues until it reaches the maximum limit of character (--lim > 0) or until it encounters an non-alphanumeric value (the if-conditional).

Following this code, we know it will parse an alphabetic character, followed by any number of alphanumeric characters. An underscore isn't alphabetic or alplhanumeric and will thus stop the parser, and so will any of the other types. An example to accept underscores too could be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 int getword(char *word, int lim)

 {
 	int c, getch(void);
 	void ungetch(int);
 	char *w = word;
	
	while (isspace(c = getch()))
 		;
 	if (c != EOF)
 		*w++ = c;
 	if (!isalpha(c) && c != '_') { //Check for underscore
 		*w = '\0';
 		return c;
 	}
	 
	for ( ; --lim > 0; w++) 
        {
            *w = getch();
 	    if (!isalnum(*w) && *w != '_') { //Check for underscore
 	        ungetch(*w);
 	        break;
 	     }
        }
	*w = '\0';
 	return word[0];
  }


You can modify this function to accept string constants (a trailing and leading quotation mark), comments (a leading /* and a trailing */) or preprocessor control lines (a trailing #), all of which can have spaces in between, which is left as an exercise to the reader.
Last edited on
Hi,

An underscore isn't alphabetic or alplhanumeric and will thus stop the parser, and so will any of the other types


Is there a way I can test if the original function doesn't accept underscores, string constants etc. I think this is what's confusing me right now, I don't know what's the problem with the original function.
You could try building a little test application for it and writing out the words. Try something like this:

1
2
3
4
5
6
7
8
9
10
11
12
int getword(char* word, int limit); //Declaration of our function

int main()
{
    char wordbuf[112];
    int curchar;
    while((curchar = getword(wordbuf, 112) != EOF)
    {
        printf("Found: %s\n", wordbuf);
    }
    return 0;
}


You can now just enter any input you like and see how the function parses it. You can test the input this way if you want to see the results.
I tried to modify the function to accept string constants, but I don't think it's working correctly. I hope that my indent style won't horrify you.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

int getword(char *word, int lim)

{
	int c, getch(void);
	void ungetch(int);
	char *w = word;
	
	int string = 0;
	int comment = 0;
	
	while (isspace(c = getch()))
		;
	if (c != EOF)
		*w++ = c;
	if (!isalpha(c)) {
		*w = '\0';
		return c;
	}
	
	if (c == '\"') {
	   for (*w++ = c; (c = getch()) != '\"';)
	      ;
	   *w = '\0';
	   string = 1;
	   } 
	
	 for ( ; --lim > 0; w++) 
	      if (!isalnum(*w = getch())) {
	        ungetch(*w);
	        break;
	      }  else if (string) {
		 w++;
		 continue;
	      }
		 
		*w = '\0';
	    return word[0];
 }



Some results:

"string"
Found: "
Found: string
Found: "
Last edited on
You should move your check for a " character to before you call isalpha. In your current function your first isalpha call will return false, which causes the if-conditional to execute. Since it returns from the function immediately, your check for a " character will never even execute. An example of how to change this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
int getword(char *word, int lim)

{
	int c, getch(void);
	void ungetch(int);
	char *w = word;
	
	int comment = 0;
	
	while (isspace(c = getch()))
		;
	if (c != EOF)
		*w++ = c;
        if (c == '\"') {
           while((c = getch()) != '\"') //Read all characters until we have the terminating character
               *w++ = c; //Store the character and increase to the next position in the word
           *w++ = '\"'; //Store the terminating character
           *w = '\0'; //And the NUL character to end the string
           return word[0]; //Done reading, return from function
	} 
	if (!isalpha(c)) {
		*w = '\0';
		return c;
	}
	
	 for ( ; --lim > 0; w++) 
	      if (!isalnum(*w = getch())) {
	        ungetch(*w);
	        break;
	      }
	      }
		 
		*w = '\0';
	    return word[0];
 }


Note that I also changed your for-loop in your string so it correctly saves all characters in the string (not just the first and second like the case in your original function). I also put a return call after the string-eating loop so the function won't read the word after the string as well. This should read string literals too (note that I did not test this function yet, I could've made a mistake).
Last edited on
Topic archived. No new replies allowed.