Get All Hyperlinks

Pages: 12
Hello,

Is it possible to get the Hyperlinks from a specific webpage?
Sure it is possible.
Under Windows you can use the InternetReadFile function to download the file and try to use regular expressions to find the links.
With C++/CLI you can use the WebBrowser control.
I am using C++. Is the WebBrowser control under a special library?
You need to use C++/Cli. If you use Visual Studio you wouldn't need anything else.
Additional definitions needed to make question answerable:

What does "get" mean?
In what form does your program have the source webpage?
Sure! What I mean by get is to extract all the hyperlinks (around 90) from a webpage and put them in an array. Then I will trim them down and concatenate each of them with another URL to load 90 pages of statistics and put each of those pages into another array.

I have done this already with VBA and it works great but it is slow. I'm hoping to replicate what I did in C++ but it doesn't seem as straight forward, am I right in saying that?

I am still in the learning phase of C++ and this question was geared to help know what I need to know.

Additionally, I could combine what I have done in excel with C++ if you think that might be an easier option?

I have taken a look at libcurl and it seems complex.
Last edited on
So you have an URL, for example:

http://www.cplusplus.com/forum/general/

and you want to feed that to a program, and the program will then fetch that webpage, examine it, identify anything in that webpage that is a hyperlink (for example, http://www.cplusplus.com/articles/ ), and give you all those hyperlinks in a big text array of strings?
Last edited on
Yes exactly.
Yes libcurl is complex and probably not meant for beginners.

What are you going to do now?
What do you recommend? How long would it take to learn libcurl? How expensive is it to hire someone to do this?
> How long would it take to learn libcurl?

Downloading an html page using libcurl is easy; learning that would take a few minutes.
http://stackoverflow.com/a/21573625

Once the file is downloaded, parse it to extract the hyperlinks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <fstream>
#include <string>
#include <iterator>
#include <set>
#include <regex>

std::string file_to_string( std::string file_name )
{
    std::ifstream file(file_name) ;
    return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
}

std::set<std::string> extract_hyperlinks( std::string html_file_name )
{
    static const std::regex hl_regex( "<a href=\"(.*?)\">", std::regex_constants::icase  ) ;

    const std::string text = file_to_string(html_file_name) ;

    return { std::sregex_token_iterator( text.begin(), text.end(), hl_regex, 1 ),
             std::sregex_token_iterator{} } ;
}

http://coliru.stacked-crooked.com/a/6a29eb0b75116d22
Thanks for your help and for the code. I tried installing libcurl but I got stuck once the folder downloaded. Could you point me in the right direction as to how I install this into Visual Studio? I spent about an hour searching for step-by-step on the web and YouTube but nothing is seamless.
Visual Studio has everything you need - no need for sth. else.
https://msdn.microsoft.com/en-us/library/ms775123%28v=vs.85%29.aspx
More accurately, that's actually the windows API has everything you need, rather than Visual Studio having it. You can use this function without having Visual Studio. I recall it used to come as part of the standard Windows SDK, which is installed alongside VS, but is certainly available independently.
Last edited on
Microsoft C++:
1
2
3
4
5
#include <urlmon.h>
#pragma comment( lib, "urlmon" )

bool download_file( std::string url, std::string path )
{ return URLDownloadToFileA( nullptr, url.c_str(), path.c_str(), 0, nullptr ) == S_OK ; }
Guys, are these snippets of code able to be copied and run? Do I need to add any special headers (urlmon) for it to run?
I know we are cplusplus.com but I might suggest using Python and Beautifulsoup
1
2
3
4
5
6
7
8
9
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']


http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup
> are these snippets of code able to be copied and run?

Yes. No third-party library is required. The standard C++ library and a standard Windows library are all that is needed.

A complete program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
//Microsoft (R) C/C++ Optimizing Compiler Version 19.00.23026 for x86

#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <set>
#include <regex>
#include <urlmon.h> // standard windows header 
#pragma comment( lib, "urlmon" ) // standard windows library

// URLDownloadToFileA - WinAPI function
bool download_file( std::string url, std::string path )
{ return URLDownloadToFileA( nullptr, url.c_str(), path.c_str(), 0, nullptr ) == S_OK ; }

std::string file_to_string( std::string file_name )
{
    std::ifstream file(file_name) ;
    return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
}

std::set<std::string> extract_hyperlinks( std::string html_file_name )
{
    static const std::regex hl_regex( "<a href=\"(.*?)\">", std::regex_constants::icase  ) ;

    const std::string text = file_to_string(html_file_name) ;

    return { std::sregex_token_iterator( text.begin(), text.end(), hl_regex, 1 ),
             std::sregex_token_iterator{} } ;
}

int main()
{
    const std::string url = "http://www.cplusplus.com/" ; // adjust as required
    const std::string path = "cplusplus.com.html" ; // adjust as required
    if( download_file( url, path ) )
       for( std::string hlink : extract_hyperlinks(path) ) std::cout << hlink << '\n' ;
}

/
/" title="cplusplus.com
/articles/
/articles/algorithms/
/articles/cpp11/
/articles/standard_library/
/articles/winapi/
/contact.do?referrer=www.cplusplus.com%2F
/doc/
/doc/tutorial/
/doc/tutorial/classes/
/doc/tutorial/functions/
/doc/tutorial/pointers/
/doc/tutorial/templates/
/forum/
/forum/beginner/
/forum/general/
/forum/unices/
/forum/windows/
/info/
/info/description/
/info/faq/
/info/history/
/privacy.do
/reference/
/reference/clibrary/
/reference/iostream/
/reference/stl/
/reference/string/
/search.do
http://fb.com/cplusplus.com
http://google.com/+cplusplus
http://twitter.com/cpluspluscom

http://rextester.com/TCVU34588
Last edited on
Thank you, that's outstanding. I thought I would need libcurl.

If I just wanted the ones that started with /forum/... Is there a way to filter?


Also is there a reason why you do not just do using namespace std;?
Last edited on
> If I just wanted the ones that started with /forum/... Is there a way to filter?

One way would be to modify the regular expression to "<a href=\"(/forum/.*?)\">"
Perhaps it would be more flexible to write a function which applies a filter to a sequence of values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <set>
#include <regex>
#include <urlmon.h> // standard windows header 
#pragma comment( lib, "urlmon" ) // standard windows library

// URLDownloadToFileA - WinAPI function
bool download_file( std::string url, std::string path )
{ return URLDownloadToFileA( nullptr, url.c_str(), path.c_str(), 0, nullptr ) == S_OK ; }

std::string file_to_string( std::string file_name )
{
    std::ifstream file(file_name) ;
    return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
}

std::set<std::string> extract_hyperlinks( std::string html_file_name )
{
    static const std::regex hl_regex( "<a href=\"(.*?)\">", std::regex_constants::icase  ) ;

    const std::string text = file_to_string(html_file_name) ;

    return { std::sregex_token_iterator( text.begin(), text.end(), hl_regex, 1 ),
             std::sregex_token_iterator{} } ;
}

template < typename ITERATOR, typename CALLABLE > auto apply_filter( ITERATOR begin, ITERATOR end, CALLABLE filter )
{
    std::set< typename std::iterator_traits<ITERATOR>::value_type > result ;
    for( ; begin != end ; ++begin ) if( filter(*begin) ) result.insert(*begin) ;
    return result ; 
}
    

int main()
{
    const std::string url = "http://www.cplusplus.com" ; // adjust as required 
    const std::string path = "cplusplus.com.html" ; // adjust as required
    const auto begins_with_forum = [] ( std::string str ) { return str.find( "/forum/" ) == 0 ; }; // adjust as required
    
    if( download_file( url, path ) )
    {
        const auto hlinks = extract_hyperlinks(path) ;
        const auto filtered_hlinks = apply_filter( std::begin(hlinks), std::end(hlinks), begins_with_forum ) ;
        for( std::string hlink : filtered_hlinks ) std::cout << url + hlink << '\n' ; 
    }
}


http://rextester.com/VDT63860


> is there a reason why you do not just do using namespace std;?

See http://www.cplusplus.com/forum/general/72248/#msg385442
Pages: 12