Paul5 (9)
Task1: Construct a program to load all unique words of a specific file in an array and display them.
Task2: Construct a program to load and display all words of a specific file with
their occurrences as 2D array.
Task3: Design and develop a search engine that will take a word from user as
input and will suggest top 5 most relevant documents in specified system
directory. For every search operation:
Array will be a two dimensional array that will maintain filename along with
the frequency/occurrences of provided word in each document.
After loading your system should be able to display the top 5 five most
relevant documents on the basis of word’s frequency/occurrence in
documents
What logic will be used for these programs and how we will handle 2d array in these programs and note that this program is dealing with words not numbers
|
this is fairly involved.
for words, the only consideration is to make sure to compare caseless and to ignore punctuation.
for example, 'hello' 'Hello' and 'hello,' are all the same word... but case and punctuation glued to the words will break comparison: string == is exact matches, not 'near' matches.
2d array sounds like a school problem where the worst possible design is handed to you.
at a first cut, maybe make the array like this..
struct entry {string word; int count{1};};
and the 2-d array like this:
entry derp[numdocuments][maxwords];
such that derp[0][0] is the first document, the first word, derp[0][1] is the first document, second word, derp[1][3] is the second document, third word....
but now you have a problem that you didn't save the file name for the documents because its a 2d array and not a sensible design. if you add it to the struct, you have redundant filenames X numwords wasting space. If you keep it on the side, its clunky. you can fix it here: why not set the first entry per file to "filename" and count== -1 or perhaps 0? this will work, and efficiently use the design you have been forced into.
a real solution needs to know if the files changed since last time you ran, so you can save what you know but re-process changed files or files added or removed since last run. I think the problem is glossing over all this and just wants a run over a static folder. You may want to thread it so that you do many files at once, if the # of files is large.