Discard the stop word

closed account (iN8poG1T)
Hello, does anyone know how can i discard the stop word form text file?



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include <iostream>
#include <sstream>
#include <fstream>
#include <map>

using namespace std;

int main()
{
    const string path = "1.txt";
    ifstream input( path.c_str() );

    if ( !input )
    {
        cout << "Error opening file." << endl;
        return 0;
    }

    multimap< string, int, less<string> >  words;
    int line;
    string word;

    // For each line of text
    for ( line = 1; input; line++ )
    {
        char buf[ 255 ];
        input.getline( buf, 128 );

        // Discard all punctuation characters, leaving only words
        for ( char *p = buf;
              *p != '\0';
              p++ )
        {
            if ( !isalpha( *p ) )
                *p = ' ';
        }

        istringstream i( buf );

        while ( i )
        {
            i >> word;
            if ( word != "" )
            {
                words.insert( pair<const string,int>( word, line ) );
            }
        }
    }

    input.close();

    // Output results
    multimap< string, int, less<string> >::iterator it1;
    multimap< string, int, less<string> >::iterator it2;

    for ( it1 = words.begin(); it1 != words.end(); )
    {
        it2 = words.upper_bound( (*it1).first );

        cout << (*it1).first << " : ";

        for ( ; it1 != it2; it1++ )
        {
            cout << (*it1).second << " ";
        }
        cout << endl;
    }

    return 0;
}


below is my text file for stopword.txt


and
any
are
a
as
at
be
because
been
before
can't
cannot
could
couldn't
did
didn't
do
does
during
each
few
for
from
further
had
hadn't
has
hasn't
have
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
is
if
its
itself
let's
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
whom
why
yourself
yourselves

this one is my text file 1.txt

A keystone prey species in the Southern Ocean is retreating towards
the Antarctic because of climate change.

Krill are small, shrimp-like creatures that swarm in vast numbers and
form a major part of the diets of whales, penguins, seabirds, seals
and fish.

Scientists say warming conditions in recent decades have led to the
krill contracting poleward.

If the shift is maintained, it will have negative ecosystem impacts,
they warn.

Already there is some evidence that macaroni penguins and fur seals
may be finding it harder to get enough of the krill to support their
populations.


side note : i need to discard this word and output only the leftover word
I don't understand your question. But you aren't testing for eof properly with the getline and you are doing some things in a more complicated way than necessary (not using auto; not using -> for iterators; etc.).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <iostream>
#include <sstream>
#include <fstream>
#include <map>

using namespace std;

int main()
{
    const string path = "1.txt";
    ifstream input( path );

    if ( !input )
    {
        cerr << "Error opening file.\n";
        return 0;
    }

    multimap< string, int >  words;

    char buf[ 255 ];
    for ( int line = 1; input.getline( buf, sizeof buf ); line++ )
    {
        for ( char *p = buf; *p; ++p )
            if ( !isalpha( *p ) )
                *p = ' ';

        string word;
        for ( istringstream iss( buf ); iss >> word; )
            words.insert( make_pair( word, line ) );
    }

    for ( auto it1 = words.begin(); it1 != words.end(); )
    {
        auto it2 = words.upper_bound( it1->first );
        cout << it1->first << " : ";
        for ( ; it1 != it2; it1++ )
            cout << it1->second << ' ';
        cout << '\n';
    }
}

Read the words in stopword.txt into a set. Then insert words from 1.txt into "words" only if they don't match the words in the set.

Note that your stopword.txt contains words with punctuation. They will never match because you always remove the punctuation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <sstream>
#include <fstream>
#include <map>
#include <set>
using namespace std;

// Read words from stream "is" and return a set comtaining them.
set<string>
wordsToSet(istream &is)
{
    // INSERT YOUR CODE HERE
}


int main()
{
    const string path = "1.txt";
    ifstream input( path );

    // Read the stopwords
    ifstream stopWordStream("stopword.txt");
    set<string> stopWords = wordsToSet(stopWordStream);
    stopWordStream.close();
    
    if ( !input )
    {
        cerr << "Error opening file.\n";
        return 0;
    }

    multimap< string, int >  words;

    string buf;
    int line=0;
    while (getline(input, buf)) {
	++line;
        for ( auto &ch : buf) {
            if ( !isalpha( ch ) )
                ch = ' ';
	}

        string word;
        for ( istringstream iss( buf ); iss >> word; ) {
	    if (stopWords.find(word) == stopWords.end()) {
		words.insert( make_pair( word, line ) );
	    }
	}
    }

    for ( auto it1 = words.begin(); it1 != words.end(); )
    {
        auto it2 = words.upper_bound( it1->first );
        cout << it1->first << " : ";
        for ( ; it1 != it2; it1++ )
            cout << it1->second << ' ';
        cout << '\n';
    }
}

Last edited on
closed account (iN8poG1T)
Hi i dont understand this part, can you elaborate more? what should i do inside? i just need to read the stopword text?

1
2
3
4
5
6
// Read words from stream "is" and return a set comtaining them.
set<string>
wordsToSet(istream &is)
{
    // INSERT YOUR CODE HERE
}
You should insert code to read the stream called "is" and return a set containing the words within it.
Topic archived. No new replies allowed.