Finding the distinct number of words in a binary tree

Apr 30, 2020 at 6:50am
In my latest assignment, we are tasked with creating a binary tree, sorting a file filled with different words, and looking for the number of distinct words within the binary tree.

I've written the code for a binary tree using in-order traversal, so everything is sorted.

I was thinking of using a vector, to hold all the values and then go through the vector and count the unique values, but I feel like that's not really what my professor is looking for.

I was thinking that I could have two strings - one called "found" and the other called "temp" and every time the tree was traversed, temp would hold the current node value, and if the value in the current node wasn't equal to the value in found, then found would update with the new string and the counter variable would be increased.

However, looking at my code, whenever I outputted the function, it always returned 1, despite the file filled with more than hundreds of words. I'd appreciate any help.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  int numDistinctWords(string unique, node *t){
  int counter = 0;
  string found, temp = " ";

  if (t == NULL){
    counter = 0;
  }

    else{
    if (unique == t->word){
      temp = unique;
    }
      if(found != temp){
        counter++;
        found = temp;
      }
    }
      return counter;
}


Here is my function. If my full code is needed, I'll be happy to post that as well.
Last edited on Apr 30, 2020 at 6:51am
Apr 30, 2020 at 10:21am
The most simple way to do that is using std::set since it has already that unique feature:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  void numDistinctWords(std::set<string>& unique, node *t){

  if (t != NULL){
    unique.insert(t->word);
    numDistinctWords(unique, t->left);
    numDistinctWords(unique, t->right);
  }
}

  int numDistinctWords(node *t)
{
  std::set<string> unique;
  numDistinctWords(unique, t);
  return unique.size();
}
Apr 30, 2020 at 12:53pm
how is it sorted? Usually, sorted data makes duplicate finding trivial : adjacent items are equal, you have duplicate, else not. But strings have case to deal with, so if you inserted from a book and got "The" and "the" and someone yelling "THE" then those would not be picked up as repeats from this idea.

Apr 30, 2020 at 3:43pm
@coder777 I appreciate your help, however my course has not yet taught set, so I won't be allowed to use that snippet of code.

@jonnin It's sorted using in-order traversal, so everything is in alphabetical order right now. So I should use 'tolower' for all the strings while reading it into the tree and then go from there?
Apr 30, 2020 at 3:44pm
Here is my full code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

class node {
public:
  string word;
  node *left, *right;
  node(string iword)
  {
    word = iword;
    left = right = NULL;
  }
  void insertLeft(node *p)
  {
    left = p;
  }

  void insertRight(node *p)
  {
    right = p;
  }
};


void inOrder(node *r)
{
  if(r != NULL)
  {
    inOrder(r -> left);
    cout << r->word << " ";
    inOrder(r->right);
  }
}



//--------------------------------------------------------------------------------------------------------------
void bInsert(string c, node **t)
{
  if (*t == NULL)
     *t = new node (c);

  else if((*t)->word <= c)
  bInsert(c, &((*t)->right));
  else
    bInsert(c, &((*t)->left));
    return;
}

//-------------------------------------------------------------------------------------------------------------------
bool member(string c, node *t)
{
  if(t == NULL)
    return false;
  else
    if (c == t->word)
      return true;

    else
      if( c < t->word)
      return member(c, t->left);
      else
      return member(c, t->right);
}


int numDistinctWords(string unique, node *t){
  int counter = 0;
  string found, temp = " ";

  if (t == NULL){
    counter = 0;
}
    else{
    if (unique == t->word){
      temp = unique;
    }
      if(found != temp){
        counter++;
        found = temp;
      }
}
      return counter;
}



int main()
{
  node *T;
  ifstream fin;
  string c;

  fin.open("C:\\Users\\owner\\Documents\\mytest.txt");
  if(fin.fail())
  {
    cout << "Could not find your file. Shutting down.";
    exit(1);
  }

  else
  fin >> c;
  T = NULL;

  while(!fin.eof()){
    if (c != " ")
    bInsert (c, &T);
    fin >> c;
  }



  cout << "In-order\n";
  inOrder(T); cout << endl;
  cout << numDistinctWords(c, T) << endl;


}

Last edited on Apr 30, 2020 at 3:44pm
Apr 30, 2020 at 5:38pm
You could NOT add duplicates to the tree and then count the number of nodes in the tree.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <iostream>
#include <fstream>
#include <cctype>
using namespace std;

struct Node
{
    string word;
    Node *left, *right;

    Node(string w) : word(w), left(), right() {}
};

void treeInsert(Node*& t, string word)
{
    if (!t)
        t = new Node(word);
    else if (t->word < word)
        treeInsert(t->right, word);
    else if (t->word > word)
        treeInsert(t->left, word);
}

int treeSize(Node* t)
{
    return t ? 1 + treeSize(t->left) + treeSize(t->right) : 0;
}

void lowercase(string& s)
{
    for (auto& ch: s) ch = tolower(ch);
}

int main()
{
    ifstream fin("C:/Users/owner/Documents/mytest.txt");
    if (!fin)
    {
        cerr << "Could not find your file. Shutting down.\n";
        exit(1);
    }

    Node* t = nullptr;
    string word;
    while (fin >> word)
    {
        lowercase(word);
        treeInsert(t, word);
    }

    cout << "Distinct words: " << treeSize(t) << '\n';
}

Apr 30, 2020 at 7:07pm
I need the duplicates for other parts of the assignment (e.g most frequent word in the input) so I'm not sure if that's an option.

Why is the original function that I posted wrong? Why does it only output 1?
Apr 30, 2020 at 8:29pm
I assume numDistinctWords is supposed to be a recursive function, but it never calls itself.
Apr 30, 2020 at 9:28pm
I was able to do it with the duplicates in the tree by using a starter function to set up a couple of extra variables that are passed by reference to the actual recursive function. The extra variables are a string to hold the previously seen string and a count variable that is incremented anytime the current string doesn't match the previous one.

1
2
3
4
5
6
7
8
9
10
11
12
void countUnique(Node* root, string& prev, int& count)
{
    // use an inorder traversal here
}

void countUnique(Node* root)
{
    string prev;
    int count = 0;
    countUnique(root, prev, count);
    return count;
}

Apr 30, 2020 at 10:07pm
Your tree should store the word and the number of times it occurs. To process the input file, find (or insert) the word in the tree. Then increment the count. To print the unique words, traverse the tree and print the entries where count == 1.

bInsert would be cleaner if it took a reference to the pointer:
1
2
3
4
5
6
7
8
9
10
11
void
bInsert(string c, node * &t)
{
    if (t == NULL)
        t = new node(c);

    else if (t->word <= c)
        bInsert(c, t->right);
    else
        bInsert(c, t->left);
}


But as you'll see below, you actually don't need it.

I always find it helpful in binary trees to write this function:
1
2
3
4
5
6
7
// Find the node that contains c, and return a reference to the pointer
// to it. If c doesn't exist in the tree, return a reference to the pointer
// where it would go.
node * &find(string c, node * &t)
{
...
}


Then for this assignment, you want a function that finds the word in the tree, or inserts it if it isn't there already:
1
2
3
4
5
6
7
8
node *findOrAdd(string c, node * &t)
{
    node * &p = find(c, t);
    if (p == nullptr) {
        p = new node(c);
    }
    return p;
}


Now counting the number of times a word occurs is easy. This code assumes that the input comes from cin:
1
2
3
4
5
6
7
    node *T = nullptr;
    string word;

    while (cin >> word) {
        node *p = findOrAdd(word, T);
        p->count++;
    }


Recall that I said you should include the count in node. Be sure to initialize the count to zero in the constructor!

Last of all, the number of distinct words in a tree is:

the number of distinct words in the right branch,
plus the number in the left branch,
plus 1, if the current node represents a word occurred just once.
May 1, 2020 at 12:57pm
yes use to upper or to lower when you insert them so they match or it won't work right if the input is mixed case.

you can take dutch's idea one step more and track the highest frequency word(s) (what if there are ties, 2 words with 20 instances each?) as you insert so you don't have to look for it after inserting, you just have it. Eg as you insert, look at the count of the word after you updated it / created it. If it is bigger than the last biggest count you found, change the highest count to that one. If its equal, add it to a vector (word, count) thingys. Find a bigger one, clear the vector and start anew. ... at the end of insertion, you have the answer. Its 'a' way to do it. There are several ways to make it easy too. You can put a the pointer to each node into a vector as you build the tree, and sort the vector on *location.count and peel off the winners from that without traversing the tree :)
Its probably an exercise where the professor wants you to write a tree traversal, so you may be stuck doing that, but if it were open solution / real code you can avoid that in these 2 ways and probably several other ways with a little hand waving.
Last edited on May 1, 2020 at 1:00pm
May 3, 2020 at 6:47pm
@dutch I'm a bit confused on what you mean by 'use in-order traversal here'. Is it that I throw in the code I used for my in-order traversal as well as using my old numDistinctWords function?

@dhayden I've been trying to write the find function, but I'm stuck on how to return the reference to the pointer it would go if c doesn't exist. This is what I have so far:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
node *&find(string c, node* &t){
  if (t == NULL ){
    return t; 
  }  
  else { 
    if (c == t->word){
      return t; 
    } 

//I wrote a function that tests if c is a member of the tree, so I used that for this. 
  if(member(c,t) == false){
    if(c < t->word){
      return t->left; 
    }
    else {
      return t->right; 
    }
  }
  }
}
May 3, 2020 at 7:18pm
1
2
3
4
5
6
7
8
9
10
11
node *&find(string c, node* &t){
  if (t == NULL ){
    return t; 
  }  else if (c == t->word){
      return t; 
  } else if c < t->word) {
      return find(c, t->right);
  } else {
      return find(c, t->left);
  }
}

May 4, 2020 at 1:16am
Hello!

I'm sorry, I don't think I'm getting it right.


the number of distinct words in the right branch,
plus the number in the left branch,
plus 1, if the current node represents a word occurred just once.


I'm attempting to do that, but when I do, it doesn't work.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
int main()
{
  node *T;
  ifstream fin;
  string c;
int counter;

  fin.open("C:\\Users\\owner\\Documents\\mytest.txt");
  /*if(fin.fail())
  {
    cout << "Could not find your file. Shutting down.";
    exit(1);
  }*/
  T = nullptr;

  while (!fin.eof()) {
      node *p = findOrAdd(c, T);
      p->count++;
      fin >> c;
      counter = p->count + p->left->count + p->right->count;
  }






  cout << "In-order\n";

  inOrder(T); cout << endl;
  cout << endl;
  cout << "Numbers: " << counter;


}


However, when I remove counter = p->count + p->left->count + p->right->count; and only do counter = p->count; My code only prints out 4 - which isn't the right answer but I'm assuming that's like only like half of the tree. I'm sorry for the constant updates, I would appreciate the help. Thank you!
May 4, 2020 at 5:06am
I got it! :)
Topic archived. No new replies allowed.