How much skill do you have with the fstream? HTML files may (that's a big may, I'm not sure) be formatted so you may have to read in binary mode, meaning you will have to interpret individual bits and bytes.
If it isn't formatted, it is simply your task to learn enough HTML to know what parts to parse out and what to look for - in other words, how to separate the tags from the actual data.
@tummychow, fstream is for both reading & writing to files right?Yea, I have used it before.Does reading in binary mode help?I have done so before but only for binary files(.dat extension).
Right now I dunno if I should find a function that strips HTML tags & then read or find some way to identify the ID and its values from the finance page.How can I write a program to find diff values I dunno when there's nothing proper I find to identify them. Anyone here has written programs which read from webpages, pls help.
What I would do, although I'm no HTML expert so I can't say that this would work, is something like this:
Create a parse that takes a std::string. Set up my fstream to read in a line from the html file in text (assuming it's unformatted which I'm reasonably sure it is), and then pass it to the parser. Slash out all the html tags and formatting markers, leaving behind pure displayed text. Then analyze it.
If you're doing this for your own use, you're probably going to be in for the long haul; I can't honestly say that I know how to follow the method I just gave.
i believe doing this in binary mode would be a pain in the ass.. i have done this before in java by using regular expressions.. but i haven't tried it in c++ yet..
maybe you could find a library to ease things up.. but i suggest using regular expressions..
The < and /> markers are what divide the sections. It's easier to just make a function that puts the entire HTML file into a single line in a std::string object, then learn to analyze the data from there. Since HTML is only for the organization of data and not the aesthetics, you can remove duplicate spacings before passing the string to a parser to allow coding the parser to be easier. I also suggest you remove all unnecessary spaces within elements (ie <div style = "..." > to <div style="...">) to make element parsing easier.
In my opinion, learning to read an XML file would be more useful. It's very similar to HTML but doesn't have many pre-defined keywords, so the overall project wouldn't be as large. If you end up needing to parse HTML, you can simply edit your XML parser or set up a system for it to sort out specific tags.
The parser should probably spit out an array of Element objects (Element being a class, but the name is up to you) that contain the element type, its attributes, and the data it contains (including other elements). If you ever want to make a system like the DOM, it will be easier to accomplish because the data is already organized into a tree.