[Problem] Extracting lines of text from

[Problem] Extracting lines of text from file

Pages: 12

Misenna (128)

Deleted

Last edited on

Misenna (128)

Deleted

Last edited on

Misenna (128)

Deleted

Last edited on

Misenna (128)

Deleted

Last edited on

Misenna (128)

Deleted

Last edited on

Misenna (128)

Deleted

Last edited on

Misenna (128)

Deleted

Last edited on

Enoizat (1343)

What is the meaning of "\\b" in there?

Supposing you are searching for the string "and", it makes the regex equal to "\band\b".
"\b" is supposed to be interpreted as word boundary, so it should help finding entire words - but if fails on acronyms like C.I.A.

And icase? Meaning it is case insensitive, no matter the input? Fundamental/fundamental?

Yes, or, at least, that's what I've understood from the documentation. To be honest, it's the first time I use C++ regex. Since I'm incredible lazy, I was waiting for a prompt to give them a glance, and your post turned up at the right moment :-)

Chervil (7320)

"I feel that I introduced lots of redundancy" - you mean like posting the same thing ten or twenty times?

Don't know what happened, but maybe some of those redundant posts in this thread could be deleted - or at least edited to a single line.

Misenna (128)

Chervil there were posting problems during the better part of the morning, nothing came thru ... I reported the problem, and hope the excess posts will be removed. Or maybe it helps if you told me how to merge posts myself?

-

Enoizat, thank you! But, if it is case insensitive, then there is something wrong with the result. 'the', as in your post, it would be 23 not 22. Just saying. But i trust that some months ahead I'm ready to use it in one of my own programs. :-)

Chervil (7320)

Well, I think you can delete your own post only when it is the very last one in a thread. But there should be an EDIT button beneath each post, so you could leave the post in the thread, but remove its contents - just put something like "duplicate removed" or whatever makes sense.

So far I've not attempted to reply to any of the posts as I don't know if they are truly identical or if there are some differences.

Misenna (128)

The posts have indeed been different. Now only one remains, which, if you would still like to comment on it or leave critique, you are very welcome to do so. :)

Enoizat (1343)

But, if it is case insensitive, then there is something wrong with the result. 'the', as in your post, it would be 23 not 22.

That's because my code was wrong :-(
Could you please give the following one a chance and give me any feedback in case you find other errors? Thanks a lot.

#include <fstream>
#include <iostream>
#include <limits>
#include <regex>
#include <string>
#include <utility>
#include <vector>


std::pair<int, std::vector<std::string>>
    fillWithMatches(std::ifstream& source, 
                    std::vector<std::string>& matches,
                    const std::string& searched);
void waitForEnter();


int main()
{
    bool again {false};
    do {
        std::cout << "Please give me the word to be found (no spaces!): ";
        std::string tobefound;
        std::cin >> tobefound;
        std::string filename("short.txt");
        std::ifstream infile(filename);
        std::vector<std::string> matches;
        auto result = fillWithMatches(infile, matches, tobefound);
        std::cout << "\nFound " << result.first << " matches in "
                  << matches.size() << " lines.\nDetails:\n";
        for(const auto& s : result.second) { std::cout << "--> " << s << '\n'; }
        infile.close();
        std::cout << "\nDo you want to perform another check [y, n]? ";
        char answer {'n'};
        std::cin >> answer;
        std::cin.ignore(1);
        if('y' == answer) { again = true; }
        else              { again = false; }
    } while(again);
    waitForEnter();
    return 0;
}


std::pair<int, std::vector<std::string>>
    fillWithMatches(std::ifstream& source, 
                    std::vector<std::string>& matches,
                    const std::string& searched)
{
    std::pair<int, std::vector<std::string>> result;
    std::string line;
    std::regex reg("\\b" + searched + "\\b", 
              std::regex_constants::ECMAScript | std::regex_constants::icase);
    while(std::getline(source, line)) {
        auto sbegin = std::sregex_iterator(line.begin(), line.end(), reg);
        auto send = std::sregex_iterator();
        if(0 < std::distance(sbegin, send)) {
            result.first += std::distance(sbegin, send);
            matches.push_back(line);
            for(std::sregex_iterator i{sbegin}; i!=send; ++i) {
                std::smatch sm = *i;
                std::cout << sm.prefix() << " --> " << sm.str() << '\n';
            }
        }
    }
    result.second = matches;
    return result;
}

void waitForEnter()
{
    std::cout << "\nPress ENTER to continue...\n";
    std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}

Misenna (128)

Enoizat, I tried, and it seems to work perfectly. I tested it with two files, the one I offered originally, and a longer version of the same text. They all delivered correct results.

All I got is this one warning:
1>c:\startingout\jump\jump\sentencefilter.cpp(67): warning C4244: '+=': conversion from '__int64' to 'int', possible loss of data

std::pair<int, std::vector<std::string>> result;

(The third text I tested it with, your program is of by 1 when searching for 'the'. Which has absolutely nothing to do with your program ... I got the same result with my version. Must be something with the text itself that doesn't get caught by either of our programs) If you wish to give it a shot ->

https://www.poetryfoundation.org/poems-and-poets/poems/detail/43997 << Part I

And here is the website I used while testing to confirm the results were indeed correct. http://www.writewords.org.uk/word_count.asp

(Ignoring cases like game -> game's -> games which is counted separately by the website. :-)

Last edited on

Enoizat (1343)

warning C4244: '+=': conversion from '__int64' to 'int', possible loss of data

I can’t deny your (Microsoft) compiler has a point: std::distance is not guarantee at all to return an int, so, when I sum its value to the first element of my std::pair<int, std::vector<std::string>>, it warns that one of the two types is bigger than the other, which means some data could possibly get lost.

I can’t reproduce your warning because my version of g++ doesn’t complain about that, but I can suggest a workaround, which is substituting a larger type for the int in std::pair<int, std::vector<std::string>>.
I think you should use std::ptrdiff_t, which is defined in <cstddef>.

I’ll be grateful if you let me know if it leads to a zero errors, zero warnings program - what doesn’t mean it’s correct, of course :)

Misenna (128)

I did, and it works without any ~~errors~~ warnings now. :-) And, since I was unable to implement your suggested solution, what I did is to change the following lines of code:


/* __int64 would also work, tested it as well. What I do not know is whether this 
is really platform independent or compiler specific, __int64 && int64_t (For GCC I
 found that #include <inttypes.h> would be needed to use any of the two to work) */

10: std::pair<int64_t, std::vector<std::string>>
43: std::pair<int64_t, std::vector<std::string>>
48: std::pair<int64_t, std::vector<std::string>> result;

VS is no longer complaining and the results are 100% accurate. (Except the one text I already mentioned ... This is still of by one plus expected result for 'the' (46 instead of 45). A text thing, must be, after all the tests I performed.) :-)

Last edited on

Topic archived. No new replies allowed.

Pages: 12

C++

Forum

[Problem] Extracting lines of text from file