Finding the distinct number of words in

Forum

Forum
Beginners
Finding the distinct number of words in

Finding the distinct number of words in a binary tree

Apr 30, 2020 at 6:50am

In my latest assignment, we are tasked with creating a binary tree, sorting a file filled with different words, and looking for the number of distinct words within the binary tree.

I've written the code for a binary tree using in-order traversal, so everything is sorted.

I was thinking of using a vector, to hold all the values and then go through the vector and count the unique values, but I feel like that's not really what my professor is looking for.

I was thinking that I could have two strings - one called "found" and the other called "temp" and every time the tree was traversed, temp would hold the current node value, and if the value in the current node wasn't equal to the value in found, then found would update with the new string and the counter variable would be increased.

However, looking at my code, whenever I outputted the function, it always returned 1, despite the file filled with more than hundreds of words. I'd appreciate any help.

  int numDistinctWords(string unique, node *t){
  int counter = 0;
  string found, temp = " ";

  if (t == NULL){
    counter = 0;
  }

    else{
    if (unique == t->word){
      temp = unique;
    }
      if(found != temp){
        counter++;
        found = temp;
      }
    }
      return counter;
}

Here is my function. If my full code is needed, I'll be happy to post that as well.

Last edited on Apr 30, 2020 at 6:51am

Apr 30, 2020 at 10:21am

coder777 (8449)

The most simple way to do that is using std::set since it has already that unique feature:

  void numDistinctWords(std::set<string>& unique, node *t){

  if (t != NULL){
    unique.insert(t->word);
    numDistinctWords(unique, t->left);
    numDistinctWords(unique, t->right);
  }
}

  int numDistinctWords(node *t)
{
  std::set<string> unique;
  numDistinctWords(unique, t);
  return unique.size();
}

Apr 30, 2020 at 12:53pm

jonnin (11492)

how is it sorted? Usually, sorted data makes duplicate finding trivial : adjacent items are equal, you have duplicate, else not. But strings have case to deal with, so if you inserted from a book and got "The" and "the" and someone yelling "THE" then those would not be picked up as repeats from this idea.

Apr 30, 2020 at 3:43pm

KittyIchigo1 (55)

@coder777 I appreciate your help, however my course has not yet taught set, so I won't be allowed to use that snippet of code.

@jonnin It's sorted using in-order traversal, so everything is in alphabetical order right now. So I should use 'tolower' for all the strings while reading it into the tree and then go from there?

Apr 30, 2020 at 3:44pm

KittyIchigo1 (55)

Here is my full code:

#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

class node {
public:
  string word;
  node *left, *right;
  node(string iword)
  {
    word = iword;
    left = right = NULL;
  }
  void insertLeft(node *p)
  {
    left = p;
  }

  void insertRight(node *p)
  {
    right = p;
  }
};


void inOrder(node *r)
{
  if(r != NULL)
  {
    inOrder(r -> left);
    cout << r->word << " ";
    inOrder(r->right);
  }
}



//--------------------------------------------------------------------------------------------------------------
void bInsert(string c, node **t)
{
  if (*t == NULL)
     *t = new node (c);

  else if((*t)->word <= c)
  bInsert(c, &((*t)->right));
  else
    bInsert(c, &((*t)->left));
    return;
}

//-------------------------------------------------------------------------------------------------------------------
bool member(string c, node *t)
{
  if(t == NULL)
    return false;
  else
    if (c == t->word)
      return true;

    else
      if( c < t->word)
      return member(c, t->left);
      else
      return member(c, t->right);
}


int numDistinctWords(string unique, node *t){
  int counter = 0;
  string found, temp = " ";

  if (t == NULL){
    counter = 0;
}
    else{
    if (unique == t->word){
      temp = unique;
    }
      if(found != temp){
        counter++;
        found = temp;
      }
}
      return counter;
}



int main()
{
  node *T;
  ifstream fin;
  string c;

  fin.open("C:\\Users\\owner\\Documents\\mytest.txt");
  if(fin.fail())
  {
    cout << "Could not find your file. Shutting down.";
    exit(1);
  }

  else
  fin >> c;
  T = NULL;

  while(!fin.eof()){
    if (c != " ")
    bInsert (c, &T);
    fin >> c;
  }



  cout << "In-order\n";
  inOrder(T); cout << endl;
  cout << numDistinctWords(c, T) << endl;


}

Edit & run on cpp.sh

Last edited on Apr 30, 2020 at 3:44pm

Apr 30, 2020 at 5:38pm

dutch (2548)

You could NOT add duplicates to the tree and then count the number of nodes in the tree.

#include <iostream>
#include <fstream>
#include <cctype>
using namespace std;

struct Node
{
    string word;
    Node *left, *right;

    Node(string w) : word(w), left(), right() {}
};

void treeInsert(Node*& t, string word)
{
    if (!t)
        t = new Node(word);
    else if (t->word < word)
        treeInsert(t->right, word);
    else if (t->word > word)
        treeInsert(t->left, word);
}

int treeSize(Node* t)
{
    return t ? 1 + treeSize(t->left) + treeSize(t->right) : 0;
}

void lowercase(string& s)
{
    for (auto& ch: s) ch = tolower(ch);
}

int main()
{
    ifstream fin("C:/Users/owner/Documents/mytest.txt");
    if (!fin)
    {
        cerr << "Could not find your file. Shutting down.\n";
        exit(1);
    }

    Node* t = nullptr;
    string word;
    while (fin >> word)
    {
        lowercase(word);
        treeInsert(t, word);
    }

    cout << "Distinct words: " << treeSize(t) << '\n';
}

Edit & run on cpp.sh

Apr 30, 2020 at 7:07pm

KittyIchigo1 (55)

I need the duplicates for other parts of the assignment (e.g most frequent word in the input) so I'm not sure if that's an option.

Why is the original function that I posted wrong? Why does it only output 1?

Apr 30, 2020 at 8:29pm

dutch (2548)

I assume numDistinctWords is supposed to be a recursive function, but it never calls itself.

Apr 30, 2020 at 9:28pm

dutch (2548)

I was able to do it with the duplicates in the tree by using a starter function to set up a couple of extra variables that are passed by reference to the actual recursive function. The extra variables are a string to hold the previously seen string and a count variable that is incremented anytime the current string doesn't match the previous one.

void countUnique(Node* root, string& prev, int& count)
{
    // use an inorder traversal here
}

void countUnique(Node* root)
{
    string prev;
    int count = 0;
    countUnique(root, prev, count);
    return count;
}

Apr 30, 2020 at 10:07pm

dhayden (5799)

Your tree should store the word and the number of times it occurs. To process the input file, find (or insert) the word in the tree. Then increment the count. To print the unique words, traverse the tree and print the entries where count == 1.

bInsert would be cleaner if it took a reference to the pointer:

void
bInsert(string c, node * &t)
{
    if (t == NULL)
        t = new node(c);

    else if (t->word <= c)
        bInsert(c, t->right);
    else
        bInsert(c, t->left);
}

But as you'll see below, you actually don't need it.

I always find it helpful in binary trees to write this function:

// Find the node that contains c, and return a reference to the pointer
// to it. If c doesn't exist in the tree, return a reference to the pointer
// where it would go.
node * &find(string c, node * &t)
{
...
}

Then for this assignment, you want a function that finds the word in the tree, or inserts it if it isn't there already:

node *findOrAdd(string c, node * &t)
{
    node * &p = find(c, t);
    if (p == nullptr) {
        p = new node(c);
    }
    return p;
}

Now counting the number of times a word occurs is easy. This code assumes that the input comes from cin:

    node *T = nullptr;
    string word;

    while (cin >> word) {
        node *p = findOrAdd(word, T);
        p->count++;
    }

Recall that I said you should include the count in node. Be sure to initialize the count to zero in the constructor!

Last of all, the number of distinct words in a tree is:

the number of distinct words in the right branch,
plus the number in the left branch,
plus 1, if the current node represents a word occurred just once.

May 1, 2020 at 12:57pm

jonnin (11492)

yes use to upper or to lower when you insert them so they match or it won't work right if the input is mixed case.

you can take dutch's idea one step more and track the highest frequency word(s) (what if there are ties, 2 words with 20 instances each?) as you insert so you don't have to look for it after inserting, you just have it. Eg as you insert, look at the count of the word after you updated it / created it. If it is bigger than the last biggest count you found, change the highest count to that one. If its equal, add it to a vector (word, count) thingys. Find a bigger one, clear the vector and start anew. ... at the end of insertion, you have the answer. Its 'a' way to do it. There are several ways to make it easy too. You can put a the pointer to each node into a vector as you build the tree, and sort the vector on *location.count and peel off the winners from that without traversing the tree :)
Its probably an exercise where the professor wants you to write a tree traversal, so you may be stuck doing that, but if it were open solution / real code you can avoid that in these 2 ways and probably several other ways with a little hand waving.

Last edited on May 1, 2020 at 1:00pm

May 3, 2020 at 6:47pm

KittyIchigo1 (55)

@dutch I'm a bit confused on what you mean by 'use in-order traversal here'. Is it that I throw in the code I used for my in-order traversal as well as using my old numDistinctWords function?

@dhayden I've been trying to write the find function, but I'm stuck on how to return the reference to the pointer it would go if c doesn't exist. This is what I have so far:

node *&find(string c, node* &t){
  if (t == NULL ){
    return t; 
  }  
  else { 
    if (c == t->word){
      return t; 
    } 

//I wrote a function that tests if c is a member of the tree, so I used that for this. 
  if(member(c,t) == false){
    if(c < t->word){
      return t->left; 
    }
    else {
      return t->right; 
    }
  }
  }
}

May 3, 2020 at 7:18pm

dhayden (5799)

node *&find(string c, node* &t){
  if (t == NULL ){
    return t; 
  }  else if (c == t->word){
      return t; 
  } else if c < t->word) {
      return find(c, t->right);
  } else {
      return find(c, t->left);
  }
}

May 4, 2020 at 1:16am

KittyIchigo1 (55)

Hello!

I'm sorry, I don't think I'm getting it right.

the number of distinct words in the right branch,
plus the number in the left branch,
plus 1, if the current node represents a word occurred just once.

I'm attempting to do that, but when I do, it doesn't work.

int main()
{
  node *T;
  ifstream fin;
  string c;
int counter;

  fin.open("C:\\Users\\owner\\Documents\\mytest.txt");
  /*if(fin.fail())
  {
    cout << "Could not find your file. Shutting down.";
    exit(1);
  }*/
  T = nullptr;

  while (!fin.eof()) {
      node *p = findOrAdd(c, T);
      p->count++;
      fin >> c;
      counter = p->count + p->left->count + p->right->count;
  }






  cout << "In-order\n";

  inOrder(T); cout << endl;
  cout << endl;
  cout << "Numbers: " << counter;


}

However, when I remove counter = p->count + p->left->count + p->right->count; and only do counter = p->count; My code only prints out 4 - which isn't the right answer but I'm assuming that's like only like half of the tree. I'm sorry for the constant updates, I would appreciate the help. Thank you!

May 4, 2020 at 5:06am

KittyIchigo1 (55)

I got it! :)

Topic archived. No new replies allowed.

C++

Forum

Finding the distinct number of words in a binary tree