Removing duplicate words in a string

I am trying to solve a problem with my code that essentially deletes or removes repeated words and takes a count next to each word, of how many there are of those. Currently my code sets every word in the paragraph as 1. So essentially I need to remove (by shifting up) duplicate words, keep only one entry for each word in the array. If the word occurs 10 times in the original text, then the ‘count’ field of this word in the array should be incremented to 10.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
 #include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
using namespace std;

const int SIZE = 268;


struct  WordCount
{
	string name;
	int count = 0;
};

void bubbleSort(string arr[], int size)
{
	for (int stopMarker = 1; stopMarker < size; stopMarker++)
	{
		for (int j = size - 1; j >= stopMarker; j--)
		{
			if (arr[j] < arr[j - 1])
			{
				swap(arr[j], arr[j - 1]);
			}
		}
	}
}

int main()
{
	string wordArray[SIZE];
	WordCount arr[SIZE];
	WordCount item;

	ifstream input;

	input.open("F:\\gdp.txt");



	for (int i = 0; i < SIZE; i++)
	{
		input >> wordArray[i];
	}

	input.close();

	bubbleSort(wordArray, SIZE);

	for (int i = 0; i < SIZE; i++)
	{
		item.name = wordArray[i];
		item.count = 1;
		arr[i] = item;
	}

	cout << "Array content..." << endl << endl;

	for (int i = 0; i < SIZE; i++)
	{
		cout << setw(13) << left << arr[i].name << setw(10) << "Count = " << arr[i].count << endl;
	}

	return 0;
}



This is the text/paragraaph I am using:

GDP is commonly used as an indicator of the economic health of a country, as well as a gauge of a countrys standard of living. Since the mode of measuring GDP is uniform from country to country, GDP can be used to compare the productivity of various countries with a high degree of accuracy. Adjusting for inflation from year to year allows for the seamless comparison of current GDP measurements with measurements from previous years or quarters. In this way, a nations GDP from any period can be measured as a percentage relative to previous years or quarters. When measured in this way, GDP can be tracked over long spans of time and used in measuring a nation’s economic growth or decline, as well as in determining if an economy is in recession. GDPs popularity as an economic indicator in part stems from its measuring of value added through economic processes. For example, when a ship is built, GDP does not reflect the total value of the completed ship, but rather the difference in values of the completed ship and of the materials used in its construction. Measuring total value instead of value added would greatly reduce GDPs functionality as an indicator of progress or decline, specifically within individual industries and sectors. Proponents of the use of GDP as an economic measure tout its ability to be broken down in this way and thereby serve as an indicator of the failure or success of economic policy as well. For example, from 2004 to 2014 Frances GDP increased by 53.1%, while Japans increased by 6.9% during the same period.
When you do read the words, the new word is

EITHER already seen => increase it's count
OR new => add it with count 1

How would you make and search a list of words that you have already seen?
So I actually figured out how to get the count of each word and display only one word. However, I am trying to figure out how to count commas as spaces. Currently the program thinks "GDP" and "GDP," are different. The new code is posted below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
#include <algorithm>
using namespace std;

const int SIZE = 268;


struct  WordCount
{
	string name;
	int count;
};

void bubbleSort(string arr[], int size)
{
	for (int stopMarker = 1; stopMarker < size; stopMarker++)
	{
		for (int j = size - 1; j >= stopMarker; j--)
		{
			if (arr[j] < arr[j - 1])
			{
				swap(arr[j], arr[j - 1]);
			}
		}
	}
}

int main()
{
	string wordArray[SIZE];
	string temp;
	WordCount arr[SIZE];
	WordCount item;

	ifstream input;

	input.open("F:\\gdp.txt");


	for (int i = 0; i < SIZE; i++)
	{
		input >> wordArray[i];
		temp.erase(std::remove_if(temp.begin(), temp.end(), [](char c)
		{
			return c == ',';
		}),
			temp.end());
	}

	input.close();

	bubbleSort(wordArray, SIZE);

	int j = 0;
	for (int i = 0; j < SIZE; i++)
	{
		if (item.name != wordArray[j] && j < SIZE)
		{
			item.name = wordArray[j];
			item.count = 1;
			arr[i] = item;
			j = j + 1;
		}
		while (j < SIZE && item.name == wordArray[j])
		{
			item.count++;
			arr[i] = item;
			j = j + 1;
		}
	}

	cout << "Array content..." << endl << endl;

	for (int i = 0; i < 135; i++)
	{
		cout << setw(13) << left << arr[i].name << setw(10) << "Count = " << arr[i].count << endl;
	}

	return 0;
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#include <sstream>
#include <vector>
#include <cctype>
using namespace std;

const int SIZE = 8000;

struct WordCount
{
	string name;
	int count;
};

void bubbleSort(WordCount arr[], int size)
{
	int i, j;
	WordCount temp;
	for(i = 0; i < size; i++)
	for(j = 0; j < size - 1; j++)
	{
		if(arr[j].name > arr[j + 1].name)
		{
			temp = arr[j], arr[j] = arr[j + 1], arr[j + 1] = temp;
		}
	}
}

void removeSpecificWord(WordCount arr[], int idx, int &size)
{
	int i;
	for(i = idx; i < size - 1; i++) arr[i] = arr[i + 1]; size--;
}

void removeDuplicateWord(WordCount arr[], int &size)
{
	int i, j;
	string target;
	for(i = 0; i < size; i++)
	{
		arr[i].count++;
		target = arr[i].name;
		for(j = i + 1; j < size; j++)
		{
			if(target == arr[j].name) removeSpecificWord(arr, j, size), arr[i].count++, j--;
		}
	}
}

int main()
{
	int i;
	int arraySize = 0;
	WordCount wordArray[SIZE];

	string fileName("gdp.txt");
	ifstream inFile(fileName.c_str());

	while(!inFile.is_open())
	{
		cout << "The file \"" << fileName << "\" not found. Enter another file : "; getline(cin, fileName);
		inFile.clear(); inFile.open(fileName.c_str()); cout << endl;
	}

	cout << "The file \"" << fileName << "\" has been opened successfully.\n\n";

	string word;
	string fileContent;
	while(inFile >> word) fileContent += word + ' ';
	inFile.close();

	for(i = 0; i < fileContent.size(); i++) if(ispunct(fileContent[i])) fileContent[i] = ' ';

	stringstream ss;
	ss << fileContent;

	while(ss >> fileContent) wordArray[arraySize].name = fileContent, wordArray[arraySize].count = 0, arraySize++; 

	removeDuplicateWord(wordArray, arraySize);
	bubbleSort(wordArray, arraySize);

	cout << "Array content..." << endl << endl;

	for (i = 0; i < arraySize; i++)
	{
		cout << setw(13) << left << wordArray[i].name << setw(10) << "Count = " << wordArray[i].count << endl;
	}

	return 0;
}


I am trying to solve a problem with my code that essentially deletes or removes repeated words and takes a count next to each word, of how many there are of those. Currently my code sets every word in the paragraph as 1. So essentially I need to remove (by shifting up) duplicate words, keep only one entry for each word in the array. If the word occurs 10 times in the original text, then the "count" field of this word in the array should be incremented to 10.


The file "gdp.txt" has been opened successfully.

Array content...

1            Count =   1
10           Count =   2
Currently    Count =   1
I            Count =   2
If           Count =   1
So           Count =   1
a            Count =   2
am           Count =   1
and          Count =   1
are          Count =   1
array        Count =   2
as           Count =   1
be           Count =   1
by           Count =   1
code         Count =   2
count        Count =   2
deletes      Count =   1
duplicate    Count =   1
each         Count =   2
entry        Count =   1
essentially  Count =   2
every        Count =   1
field        Count =   1
for          Count =   1
how          Count =   1
in           Count =   4
incremented  Count =   1
keep         Count =   1
many         Count =   1
my           Count =   2
need         Count =   1
next         Count =   1
occurs       Count =   1
of           Count =   3
one          Count =   1
only         Count =   1
or           Count =   1
original     Count =   1
paragraph    Count =   1
problem      Count =   1
remove       Count =   1
removes      Count =   1
repeated     Count =   1
sets         Count =   1
shifting     Count =   1
should       Count =   1
solve        Count =   1
takes        Count =   1
text         Count =   1
that         Count =   1
the          Count =   6
then         Count =   1
there        Count =   1
this         Count =   1
those        Count =   1
times        Count =   1
to           Count =   4
trying       Count =   1
up           Count =   1
with         Count =   1
word         Count =   5
words        Count =   2
Last edited on
To the OP, on line 46 of the code in your last post you use the erase-remove idiom on the temp string which is always empty. Instead, perhaps you should use it on wordArray[i]. Sanitizing the string attained is a good idea, you just need to actually follow through and sanitize the right string. ;)
Topic archived. No new replies allowed.