Get All Hyperlinks

Pages: 12
thank you very much. I am receiving an error for the following statement:
apply_filter( std::begin(hlinks), std::end(hlinks), begins_with_forum ) ;



My goal is to then manipulate the specific URLs pulled by taking the end of the URL and concatenating with another URL:

http://www.cplusplus.com/forum/beginner,

becomes...

http://www.cplusplus.com/tutorial/beginner,

and then plug this URL back into a function to pull data from each of those new URL webpages - specifically, the data is sports statistics.

Will I need to use libcurl then to get the statistics from different tables within the pages or is there a windows function for this as well?
Last edited on
> I am receiving an error for the following statement:

Are you using Visual Studio 2015?
What is the text of the error diagnostic?
If you have made modifications to the sample code, post the code too.


> http://www.cplusplus.com/forum/beginner,
> becomes...
> http://www.cplusplus.com/tutorial/beginner,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <string>

std::string insert_before_last_segment( std::string url, std::string insert_this )
{
    if( url.size() < 2 ) return insert_this + url ;

    // locate the position of the last '/' (except a possible ending '/')
    const auto pos = url.rfind( '/', url.size() - 2 ) ;

    return pos != std::string::npos ? url.substr( 0, pos ) + insert_this + url.substr(pos)
                                    : insert_this + '/' + url ;
}

int main()
{
    const std::string url = "http://www.cplusplus.com/forum/beginner" ;
    const std::string substr = "/tutorial" ;

    std::cout << url << '\n' << insert_before_last_segment( url, substr ) << '\n' ;
}

http://rextester.com/SJSP78833


> Will I need to use libcurl then to get the statistics from different tables within the pages
> or is there a windows function for this as well?

libcurl is not rquired. download_file will download the web page for any valid url.
1
2
3
4
5
#include <urlmon.h>
#pragma comment( lib, "urlmon" )

bool download_file( std::string url, std::string path )
{ return URLDownloadToFileA( nullptr, url.c_str(), path.c_str(), 0, nullptr ) == S_OK ; }
I am using Visual Studio 2013 - the free version for students
The version of the C++ compiler that is shipped with Visual Studio 2013 would be lacking in some of the C++ features used in the snippet. Consider upgrading to Visual Studio 2015
(The Community edition is free; explicitly select C++ during the installation.)
http://www.microsoft.com/en-us/download/details.aspx?id=48146

If my memory serves me right, with these modifications, the code should compile cleanly with Visual Studio 2013.
(Not tested with Visual Studio 2013).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <set>
#include <regex>
#include <urlmon.h> // standard windows header 
#pragma comment( lib, "urlmon" ) // standard windows library

// URLDownloadToFileA - WinAPI function
bool download_file( std::string url, std::string path )
{ return URLDownloadToFileA( nullptr, url.c_str(), path.c_str(), 0, nullptr ) == S_OK ; }

std::string file_to_string( std::string file_name )
{
    std::ifstream file(file_name) ;

    std::istreambuf_iterator<char> begin(file) ;
    std::istreambuf_iterator<char> end ;
    return std::string( begin, end ) ;
}

std::set<std::string> extract_hyperlinks( std::string html_file_name )
{
    static const std::regex hl_regex( "<a href=\"(.*?)\">", std::regex_constants::icase ) ;

    const std::string text = file_to_string(html_file_name) ;

    std::sregex_token_iterator begin( text.begin(), text.end(), hl_regex, 1 );
    std::sregex_token_iterator end ;
    return std::set<std::string>( begin, end ) ;
}

template < typename ITERATOR, typename CALLABLE > auto apply_filter( ITERATOR begin, ITERATOR end, CALLABLE filter )
{
    std::set< typename std::iterator_traits<ITERATOR>::value_type > result ;
    for( ; begin != end ; ++begin ) if( filter( *begin ) ) result.insert( *begin ) ;
    return result ;
}


int main()
{
    const std::string url = "http://www.cplusplus.com" ; // adjust as required 
    const std::string path = "cplusplus.com.html" ; // adjust as required
    const auto begins_with_forum = [] ( std::string str ) { return str.find( "/forum/" ) == 0 ; }; // adjust as required

    if( download_file( url, path ) )
    {
        const auto hlinks = extract_hyperlinks( path ) ;
        const auto filtered_hlinks = apply_filter( hlinks.begin(), hlinks.end(), begins_with_forum ) ;
        for( auto iter = filtered_hlinks.begin() ; iter != filtered_hlinks.end() ; ++iter ) std::cout << url + *iter << '\n' ;
    }
}

http://rextester.com/SQKXU38855
Topic archived. No new replies allowed.
Pages: 12