What's a good way performance-wise to read in words from a text file while ignoring the punctuations?

Hello, I have a text file which contains words and punctuations like ,.:"... I want to get each word from the text file and work on them. What would be the way that will perform faster since I would work on a big text file.

What I'm doing right now is defining a locale that treats all punctuations as white space and use imbue for my ifstream then do it with the usual:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <fstream>
#include <locale>
#include <iomanip>
#include <string>
using namespace std;

//define my locale here

int main()
{
    string word;
    ifstream fin;
    //imbue here
    
    fin.open("text.txt");
    while(fin >> word)
    {
        //do stuffs with the word here
    }
    fin.close();
    return 0;
}


This solution does work (at least I haven't encounter any trouble yet), but is there a more efficient way to do it?
Thank you for your help
Read the entire line from file. File I/O is the slowest part of the operation.

Then strip out punctuation. You can collect words and strip punctuation at the same time, if you want.

Make sure to pre-allocate the resulting string (once, at the beginning of the culling) to avoid continual heap allocate-free cycles for every word.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int main()
{
    string line, word;
    ifstream fin;
    string wordchars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                       "abcdefghijklmnopqrstuvwxyz";

    ...

    word.reserve( 50 );
    while (getline( fin, line ))
    {
        size_t n1, n2 = 0;
        while ((n1 = line.find_first_of( wordchars, n2 )) != npos)
        {
            n2 = line.find_first_not_of( wordchars, n1 );
            word = line.substr( n1, n2-n1 );

            // do stuff with the word here

            n1 = n2;
        }
    }
}

Is there a time issue with your locale-imbued version?
How does it stack against the non-locale-imbued version here?

(It should have similar performance characteristics, methinks...)
Thank you mate, I did a test with my text file (roughly 500.000 words I think), and the performance of the solution you suggested does seems similar to my own. But there is something that I'm wondering, getline stops at a newline character so will it cause problems with performance when working with paragraphs that are too long?
And is there any other better ways, since I need my program to at a decent speed on my 10 year-old computer D:
No. Strings don't release memory unless you tell them too, so unless your paragraphs are carefully designed to cause a reallocation every time (typically if they double in size each time), then it shouldn't cause significant problems there.
> What I'm doing right now is defining a locale that treats all punctuations as white space and use imbue ...

Yes, that is the right idea.

Memory mapping the file, and then using a (deprecated) std::istrstream to get the desired std::istream interface over the mapped bytes in memory would typically improve performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <iostream>
#include <fstream>
#include <strstream> // deprecated
#include <cctype>
#include <locale>
#include <string>
#include <ctime>

int main()
{
    const char* const path = "/tmp/test.txt" ;
    
    {
        // create a big file with lots of punctuation (html)
        std::ofstream file(path) ;
        for( int i = 0 ; i < 25 ; ++i ) file << std::ifstream( "/usr/share/doc/bc/bc.html" ).rdbuf() ;
    }
    
        
    const auto start = std::clock() ; // *** start timer

    using namespace boost::interprocess ;
    file_mapping mapping( path, read_only ) ;
    mapped_region region( mapping, read_only) ;
    const std::size_t nbytes  = region.get_size() ;
    const char* const address = static_cast<  const char* >( region.get_address() ) ;
    
    std::istrstream stm( address, nbytes ) ; // deprecated
    
    // This ctype facet classifies all punctuations too as whitespace
    struct punct_too_is_ws : std::ctype<char>
    {
        static const mask* classification_table()
        {
            // start with the classic table ( C locale's table )
            static std::vector<mask> table( classic_table(),  classic_table() + table_size ) ;

            // all punctuation is to be treated as whitespace
            for( std::size_t i = 0 ; i < table_size ; ++i ) if( std::ispunct(i) ) table[i] = space ;

            return std::addressof( table.front() ) ;
        }

        // do not delete table, initial reference count == 0
        punct_too_is_ws() : std::ctype<char>( classification_table() ) {}
    };
    
    stm.imbue( std::locale( stm.getloc(), new punct_too_is_ws ) ) ;
    
    std::string str ;
    std::size_t cnt = 0 ;
    while( stm >> str ) ++cnt ;

    const auto end = std::clock() ; // *** end timer
    
    std::cout << cnt << " words were read in " << double(end-start)*1000 / CLOCKS_PER_SEC << " milliseconds.\n" ;
}

g++-4.9 -std=c++11 -O3 -Wall -Wextra -pedantic-errors -Wno-deprecated main.cpp && ./a.out
340075 words were read in 40 milliseconds.

http://coliru.stacked-crooked.com/a/1855b264cf4c4318
Wow, thanks a bunch, JLBorges. But I'm still a student so I can't use boost for my projects/assignment right now. Is there an alternative using only standard C++ or maybe windows-specific with Microsoft Visual Studio 2010?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#include <iostream>
#include <fstream>
#include <strstream> // deprecated
#include <cctype>
#include <locale>
#include <vector>
#include <string>
#include <ctime>
#include <windows.h>
#include <cassert>
#include <cstdio>

int main()
{
    const char* const path = "test.txt" ;

    {
        std::ofstream file(path) ;
        for( int i = 0 ; i < 1000 ; ++i ) file << std::ifstream( __FILE__ ).rdbuf() ;
    }


    const auto start = std::clock() ; // *** start timer

    // using namespace boost::interprocess ;
    // file_mapping mapping( path, read_only ) ;
    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
    HANDLE file = CreateFile( path, GENERIC_READ, 0, 0, OPEN_EXISTING,
                              FILE_FLAG_SEQUENTIAL_SCAN | FILE_FLAG_DELETE_ON_CLOSE, 0 ) ;
    assert( file != INVALID_HANDLE_VALUE ) ;

    // mapped_region region( mapping, read_only) ;
    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa366537(v=vs.85).aspx
    HANDLE mapping = CreateFileMapping( file, 0, PAGE_READONLY, 0, 0, 0 ) ;
    assert(mapping) ;
    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761(v=vs.85).aspx
    const char* const address = static_cast<  const char* >( MapViewOfFile( mapping, FILE_MAP_READ, 0, 0, 0 ) ) ;
    assert(address) ;

    // const std::size_t nbytes  = region.get_size() ;
    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa364955(v=vs.85).aspx
    const DWORD nbytes = GetFileSize( file, 0 ) ;

    std::istrstream stm( address, nbytes ) ; // deprecated

    // This ctype facet classifies all punctuations too as whitespace
    struct punct_too_is_ws : std::ctype<char>
    {
        static const mask* classification_table()
        {
            // start with the classic table ( C locale's table )
            static std::vector<mask> table( classic_table(),  classic_table() + table_size ) ;

            // all punctuation is to be treated as whitespace
            for( std::size_t i = 0 ; i < table_size ; ++i ) if( std::ispunct(i) ) table[i] = space ;

            return std::addressof( table.front() ) ;
        }

        // do not delete table, initial reference count == 0
        punct_too_is_ws() : std::ctype<char>( classification_table() ) {}
    };

    stm.imbue( std::locale( stm.getloc(), new punct_too_is_ws ) ) ;

    std::string str ;
    std::size_t cnt = 0 ;
    while( stm >> str ) ++cnt ;

    const auto end = std::clock() ; // *** end timer

    std::cout << cnt << " words were read in " << double(end-start)*1000 / CLOCKS_PER_SEC << " milliseconds.\n" ;
    
    // 340000 words were read in 109 milliseconds.

    // UnmapViewOfFile, CloseHandle, CloseHandle ...
}

http://rextester.com/ILF60290
Thank you very much, you not only helped me with my problem but also taught me something new.
Still, there is something I'm not quite clear about the CreateFile at line 28, when I copied your code into Visual Studio 2010, it raised an error in the first argument saying " argument of type const char* is incompatible with parameter of type "LPCWSTR" ". Surprisingly, if I use a wstring wsPath and pass it into the dunction with wsPath.c_str(), everything works fine. How could this happen? I tried passing in a std::string or std::string::c_str but they wouldn't work, either
The Windows API (most of it) is designed to support characters encoded as either UTF-8 or UTF-16; so functions with null-terminated strings as parameters are available in pairs.

CreateFileA() is the UTF-8 version of the function (takes the path as const char*)
and CreateFileW() is its UTF-16 counterpart (takes the path as const wchar_t*)

CreateFile is #define d to be either CreateFileA or CreateFileW based on the presence of a manifest preprocessor constant (_UNICODE).

Projects generated by older versions of Visual Studio defined _UNICODE by default; so CreateFile() in the code became CreateFileW().
Topic archived. No new replies allowed.