TFIDF

Here's a link to the documentation of fprintf https://www.cplusplus.com/reference/cstdio/fprintf/
\t is an escape character that means tab whitespace
%d is a specifier that means to insert the next decimal variable in the list.
Hello lugopen,

While I look over your code. You have a program thet reads an input file. PLEASE provide the input file or at least a good sample, so that everyone will be using the same information.

I have to ask if you are intending in a C program or a C++ program?

Although it does work mixing C and C++ code is not the best idea.

Andy
Here's the doc link for fscanf as well since you were asking about input https://www.cplusplus.com/reference/cstdio/fscanf/
Hello lugopen,

I debug a program the same way it should have been written, in small steps or parts.

Since the first function call in "main" is to load the file I started there.

You have tfidf.load(stdin);. What is the point of sending "stdin" to the function when it is never used?

In the "load" function:
1
2
3
void TfIdf::load(FILE* file)
{
  file = fopen ("myfile.txt","w+");

This function should not take any parameters because none should be sent.

Opening the file as "w+" will create the file if it does not exist, but it will also truncate or delete any contents in the file. So if the file did not exist in the 1st place it will be created giving a false impression until you try to read an empty file and encounter a problem.

Your function should be more like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void TfIdf::load(/*FILE* file*/)
{
    FILE* file = fopen("myfile.txt", "r");
    
    if (!file)
    {
        perror("File opening failed");

        //return 1;
    }

    // <--- Have nothing to read past this point.
    fscanf(file, "%d%d", &docCount, &wordCount);

    tf = new Map[docCount];  // <--- Looks like you are creating an array of "Map" iterators. 
    df = new int[wordCount];

    int docId{}, wordId{};  // <--- ALWAYS initialize all your variables.

    while (fscanf(file, "%d%d", &docId, &wordId) != EOF)
    {
        addWord(docId - 1, wordId - 1);
    }
}


I need an idea of the input file to know how to proceed.

Andy
L113 should be:

 
tfidf.load(file);


Note that the file has to start with two integers - docCount and wordCount

and that the file consists of pairs of integers - docId and wordId. It's not a file containing character text.

to compare a sentence of the user to a text contained in a text file and to have the sentence that most resembles or that corresponds to the words of the user


That's not what this code does.
Last edited on
Hello lugopen,

As seeplus has noted your code does not match what you are trying to do.

It is time to provide complete instructions of what needs to be done along with the input file, or a good sample to work with, so that everyone will know what you have to do and not have to guess.

Here is a guess based on your 1st code and changing it to C++:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include <iostream>
#include <string>
//#include <vector>   // <--- If youhave "std::map" what is the vector for?

#include <algorithm>  // <--- Not sure what this is for?
#include <fstream>
#include <map>

struct TfIdf
{
    std::map<std::string, int> words;
    std::map<int, int>::iterator Iter;

    TfIdf() : docCount(0), wordCount(0)/*, tf(NULL), df(NULL)*/ {}
    //~TfIdf() {}  // <--- Since this is empty the will provide a default dtor.

    int load();
    //int save(FILE* file, int minDf, int maxDf) const;
    void addWord(int docId, int wordId);

    int docCount;       // the number of documents
    int wordCount;      // the number of words
    //words* tf;            // term frequency
    //int* df;            // document frequency
};

int TfIdf::load()
{
    std::ifstream inFile( "myfile.txt");

    if (!inFile)
    {
        return std::cout << "\n     File opening failed\n", 1;
    }

    inFile >> docCount >> wordCount;

    //tf = new words[docCount];  // <--- Looks like you are creating an array of "maps". 
    //df = new int[wordCount];

    int docId{}, wordId{};  // <--- ALWAYS initialize all your variables.

    while (inFile >> docId >> wordId)
    {
        //addWord(docId - 1, wordId - 1);
    }

    return 0;
}

int main()
{
    int response{};

    TfIdf tfidf;

    if (response = tfidf.load())
    {
        return response;
    }

    //tfidf.save(stdout, 5, tfidf.docCount * 0.1);

    // <--- Keeps console window open when running in debug mode on Visual Studio. Or a good way to pause the program.
    // The next line may not be needed. If you have to press enter to see the prompt it is not needed.
    //std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');  // <--- Requires header file <limits>.
    std::cout << "\n\n Press Enter to continue: ";
    std::cin.get();

    return 0;  // <--- Not required, but makes a good break point.
}

I know this is not 100% right, but it will give you an idea for a start.

The "load" function, which I think might work better as "LoadMap", opens the file in the function and when the function ends the "ifstream" dtor will close the file stream.

Or you could define the file stream in "main" and pass that to the function along with the file name, if needed, then you would use "inFile.open()" to open the file then it would close when the function ends leaving the file stream variable defined in "main" available to reuse in another function.

Andy
IMO that code is not even a starting point for what you want.

to compare a sentence of the user to a text contained in a text file and to have the sentence that most resembles or that corresponds to the words of the user.


On the surface, this is quite simple:


Obtain user sentence
Split user sentence into words.
Repeat {
    Read a sentence from the file.
    For each word in sentence {
        compare to user words (using soundex?)
        if word matches
             increment match count
    }
    save sentence and match count
} while not file eof
sort result by match count
display sentence with highest match count


However, the devil is in the detail:
1) what constitutes a 'sentence'
2) what constitutes a 'word'
3) what constitutes a word 'match' - what about plurals, apostrophe etc etc

Much more info regarding these details is required.
Can you post a sample of myfile.txt please.
to compare a sentence of the user to a text contained in a text file and to have the sentence that most resembles or that corresponds to the words of the user


For a different take on this, consider as a starter as C++17:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
//to compare a sentence of the user to a text contained in a text file
//and to have the sentence that most resembles or that corresponds to the words of the user

#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <iomanip>
#include <cctype>
#include <cstring>
#include <algorithm>

// Compare word to list.
// This is just a simple comparison. Change if a more sophisticated comparision required
auto compWord(const std::vector<std::string>& tofind, const std::string& word) {
	return std::find(tofind.cbegin(), tofind.cend(), word);
}

// Obtain list of words to compare
auto getWords() {
	std::vector<std::string> tofind;

	std::cout << "Enter the word(s) that the sentence must contain to match. 0 to terminate\";

	for (std::string word; (std::cin >> word) && word != "0"; ) {
		std::string tomat;

		for (const auto& ch : word)
			if (std::isalpha(static_cast<unsigned char>(ch)))
				tomat += static_cast<char>(std::tolower(static_cast<unsigned char>(ch)));

		tofind.emplace_back(tomat);
	}
	return tofind;
}

int main()
{
	constexpr const char* senterm {"!?."};
	std::ifstream textf("myfile.txt");

	if (!textf)
		return (std::cout << "Cannot open input file\n"), 1;

	textf >> std::noskipws;	// Need to not skip whitespace for the file iteration

	const std::string text((std::istream_iterator<char>(textf)), std::istream_iterator<char>());	// File text
	const auto tofind {getWords()};			// Words to find
	std::vector<unsigned> wrdmatch(tofind.size());	// Count of each word found per sentence
	std::string word;				// Current word found in sentence
	const char *ststrt {}, *stend {};		// begin/end of current sentence
	const char *beststrt {}, *bestend {};		// begin/end of sentence with highest match

	bool atend {};		// Set when end of text
	bool gotsent {};	// Set when have a sentence
	unsigned most {};	// Most matched words
	unsigned match {};	// Number matched in a sentence

	for (auto chp {text.c_str()}; !atend; atend = (*chp++) == 0) {
		if (*chp && std::isalpha(static_cast<unsigned char>(*chp))) {
			// Got a character. Add to current word
			word += static_cast<char>(std::tolower(static_cast<unsigned char>(*chp)));
			gotsent = true;

			if (ststrt == nullptr)
				ststrt = chp;

		} else {
			if (*chp == 0 || std::isspace(static_cast<unsigned char>(*chp)) || std::strchr(senterm, *chp) != NULL)
				// End of word
				if (!word.empty()) {
					// Do a word compare. This is simply an equality test here, but could be soundex etc etc
					if (const auto itr {compWord(tofind, word)}; itr != tofind.cend()) {
						++match;
						++wrdmatch[itr - tofind.cbegin()];
					}

					word.clear();
				}

			if (*chp == 0 || std::strchr(senterm, *chp) != NULL) {
				// End of sentence
				stend = chp;

				if (gotsent && stend != ststrt) {
					if (match > most && std::all_of(wrdmatch.begin(), wrdmatch.end(), [](auto no) {return no > 0; })) {
						most = match;
						beststrt = ststrt;
						bestend = stend;
					}

					std::fill(wrdmatch.begin(), wrdmatch.end(), 0);
					match = 0;
					ststrt = nullptr;
					gotsent = false;
				}
			}
		}
	}

	std::cout << "A best sentence match with " << most << " total matches is:\n";
	std::cout << std::string(beststrt, bestend) << '\n';
} 


This will ask for the words to be found. It will display the sentence from the file that contains the overall greatest number of all of these words and contains at least one of each word. If more than one sentence meets this criteria, then the first will be shown.

Note that it ignores all punctuation - so doesn't becomes doesnt for the purpose of word comparison. Also ignore and ignores (and plurals etc) are considered different words. For a better comparison replace the compWord() function.
Last edited on
Topic archived. No new replies allowed.