cURL and c_str() wierdness

Oct 13, 2013 at 6:30am
Hello all, please help out a Biologist:

I am trying to parse the following URL using the cURL library:

www.ncbi.nlm.nih.gov/nucleotide/? term = Anthoxanthum[organism] AND 2003/7/25:2005/12/27[Publication Date]&format=text

but cURL returns xml (the default, not text I've asked for).

I'm using this code line:
curl_easy_setopt(curl, CURLOPT_URL, URL.c_str());

Here's the wierd thing: when I replace the "URL.c_str()" above with the actual text of the web search I want to do, it works fine. Also, if I paste in the URL from fout<<URL, that works fine in a browser.

Seems to me it's a c_str() problem, maybe the "&" or "="? but I can't figure it out, so I turn to the cplusplus forum for their usual wisdom.
Oct 13, 2013 at 6:37am
closed account (Dy7SLyTq)
are you sure that URL holds the correct value?
Oct 13, 2013 at 6:49am
The URL works fine (pasted into browser), also works fine if I output the string to a text file and paste that.

The returned cURL data is in the desired text form when I explicitly define the URL:
curl_easy_setopt(curl, CURLOPT_URL, "www.ncbi.nlm.nih.gov/nucleotide/? term = Anthoxanthum[organism] AND 2003/7/25:2005/12/27[Publication Date]&format=text")

but if I say:

1
2
3
URL= "www.ncbi.nlm.nih.gov/nucleotide/? term = Anthoxanthum[organism] AND 
2003/7/25:2005/12/27[Publication Date]&format=text";
curl_easy_setopt(curl, CURLOPT_URL, URL.c_str());

it doesn't.
Oct 13, 2013 at 8:01am
Oct 13, 2013 at 9:48am
Is the string URL still in scope when you call curl_easy_perform (or whatever)??

(A string literal is stored in the const segment of your exe, so it will never be deallocated. But a string will be destroyed as soon as it goes out of scope, invalidating the (const) char* returned by c_str().)

Andy

PS Not directly related to the c_str()/char* problem, but the documention for CURLOPT_URL does say you should specify the scheme (e.g. http://, ftp:://, ldap://, ...) as part of the URL.

CURLOPT_URL

Pass in a pointer to the actual URL to deal with. The parameter should be a char * to a zero terminated string which must be URL-encoded in the following format:

scheme://host:port/path

http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTURL
Last edited on Oct 13, 2013 at 10:04am
Oct 13, 2013 at 12:35pm
That was it, naraku933!!

I played around with your answers and it looks like it's only the spaces (%20) that matter; I can leave the brackets in as [] not %5B... %5D.

Thanks for your help, I'd never have found that on my own.
Oct 13, 2013 at 4:08pm
The recommended way is to use curl_easy_escape() on initial string and libcurl does it for you correctly.
http://curl.haxx.se/libcurl/c/curl_easy_escape.html
Oct 13, 2013 at 11:57pm
modoran,

I tried your suggestion, but can't get it to work in a similar application.

Neither does the suggestion to "hard encode" the characters with %hex format.

Here's what I have:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
for (int j=0; j<number_of_ids; ++j)
    {
        working = "www.ncbi.nlm.nih.gov/nuccore/" + genbank_id[j] +"?report=fasta&format=text";
        URL.push_back(working);
        fout<<URL[j]<<endl;
    }
    cout<<"URL vector populated"<<endl;

    //obtain url as FASTA
    for (int j = 0; j<(int)URL.size(); ++j)
    {
        CURL *curl;
        CURLcode res;
        string readBuffer;
        curl = curl_easy_init();
        if(curl)
        {
            curl_easy_escape(curl, URL[j].c_str(),0);
            fout<<URL[j].c_str()<<endl;
            curl_easy_setopt(curl, CURLOPT_URL, URL[j].c_str());
            curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L); //follow redirection
            curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
            curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);

            // Perform the request, res will get the return code
            res = curl_easy_perform(curl);

            // Check for errors
            if(res != CURLE_OK)
            {
                fprintf(stderr, "curl_easy_perform() failed: %s\n",
                curl_easy_strerror(res));
            }
            curl_easy_cleanup(curl);
            //curl_free (curl);
        }
        //Populate FASTA vector from readBuffer
        FASTA.push_back(readBuffer);
        DNA.push_back(readBuffer);
    }
    cout<<"FASTA strings populated into FASTA and DNA vector"<<endl;


I know the URL is good because when i paste this in to a browser, I get my data: (where FJ817486 is the first genbank ID)
http://www.ncbi.nlm.nih.gov/nuccore/FJ817486?report=fasta&format=text

Any suggestions?
Oct 14, 2013 at 2:51am
Are you handling javascript somehow? I don't get anything but the main page back with a warning about the site requiring javascript:
<strong>Warning:</strong>
The NCBI web site requires JavaScript to function.
<a href="http://www.ncbi.nlm.nih.gov/corehtml/query/static/unsupported-browser.html#enablejs" title="Learn how to enable JavaScript" target="_blank">more...</a>


http://curl.haxx.se/docs/faq.html#Does_curl_support_Javascript_or
Oct 14, 2013 at 3:24am
norm,

The webpage I'm trying to access is just straight-up old fashioned text. I don't think it's a java issue, because I go the first request to work, but this one won't work even if I explicitly encode the url.

do you know how I can check the URL "sent" by libcurl?
Oct 14, 2013 at 3:26am
Let me clarify: this is working (sorta) because I get xml format back. The part that isn't working is the "&format=text" bit.
Oct 14, 2013 at 4:05am
Not being a biologist, I'm probably using incorrect terminology but the data that you want is the genome sequence(?) as shown by this link, right?: http://www.ncbi.nlm.nih.gov/nuccore/FJ817486?report=fasta&format=text

Have you inspected the response that you got back? I used your code and got xml back but that sequence data(?) is not included. I could be wrong, but the page that you want is probably a dynamic web page generated by javascript and thus not possible to retrieve with libcurl.

Edit: typo
Last edited on Oct 14, 2013 at 4:09am
Oct 14, 2013 at 4:17am
norm,

The weblink you post is the sequence I'm trying for, and you're also right in that the xml doesn't seem to include that sequence, or I would just carve it up and get what I wanted.

if I "Inspect Element" on the page you link, I see this:
<script type="text/javascript" src="/portal/js/portal.js?v3%2E5%2E1%2Er392364%3A+Mon%2C+Mar+25+2013+15%3A07%3A09"></script>

which supports your java idea.

So this is just a case of "you can't get there from here"?

Thanks for your help in any case.
Oct 14, 2013 at 4:22am
So this is just a case of "you can't get there from here"?

Seems that way.

Do they not provide an API?

EDIT:
Here you go: http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records

curl_easy_setopt(curl, CURLOPT_URL, "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=FJ817486&rettype=fasta&retmode=text");
Last edited on Oct 14, 2013 at 4:45am
Topic archived. No new replies allowed.