Reading HTML files

Jan 22, 2010 at 3:58pm
I dunno why I can't read from the HTML file.
Last edited on Jan 27, 2010 at 7:06am
Jan 22, 2010 at 4:14pm
How much skill do you have with the fstream? HTML files may (that's a big may, I'm not sure) be formatted so you may have to read in binary mode, meaning you will have to interpret individual bits and bytes.
If it isn't formatted, it is simply your task to learn enough HTML to know what parts to parse out and what to look for - in other words, how to separate the tags from the actual data.
Jan 22, 2010 at 10:45pm
I have written a program that reads data from this HTML file but for some reason, it doesn't work properly

Why not show us some of your code?
Jan 23, 2010 at 2:11am
Done
Last edited on Jan 27, 2010 at 7:06am
Jan 23, 2010 at 2:22am
@tummychow, fstream is for both reading & writing to files right?Yea, I have used it before.Does reading in binary mode help?I have done so before but only for binary files(.dat extension).

Right now I dunno if I should find a function that strips HTML tags & then read or find some way to identify the ID and its values from the finance page.How can I write a program to find diff values I dunno when there's nothing proper I find to identify them. Anyone here has written programs which read from webpages, pls help.

Regards
Last edited on Jan 23, 2010 at 2:23am
Jan 23, 2010 at 2:27am
What I would do, although I'm no HTML expert so I can't say that this would work, is something like this:
Create a parse that takes a std::string. Set up my fstream to read in a line from the html file in text (assuming it's unformatted which I'm reasonably sure it is), and then pass it to the parser. Slash out all the html tags and formatting markers, leaving behind pure displayed text. Then analyze it.
If you're doing this for your own use, you're probably going to be in for the long haul; I can't honestly say that I know how to follow the method I just gave.
Jan 23, 2010 at 2:56am
i believe doing this in binary mode would be a pain in the ass.. i have done this before in java by using regular expressions.. but i haven't tried it in c++ yet..

maybe you could find a library to ease things up.. but i suggest using regular expressions..
Jan 23, 2010 at 3:42am
The < and /> markers are what divide the sections. It's easier to just make a function that puts the entire HTML file into a single line in a std::string object, then learn to analyze the data from there. Since HTML is only for the organization of data and not the aesthetics, you can remove duplicate spacings before passing the string to a parser to allow coding the parser to be easier. I also suggest you remove all unnecessary spaces within elements (ie <div style = "..." > to <div style="...">) to make element parsing easier.

In my opinion, learning to read an XML file would be more useful. It's very similar to HTML but doesn't have many pre-defined keywords, so the overall project wouldn't be as large. If you end up needing to parse HTML, you can simply edit your XML parser or set up a system for it to sort out specific tags.

The parser should probably spit out an array of Element objects (Element being a class, but the name is up to you) that contain the element type, its attributes, and the data it contains (including other elements). If you ever want to make a system like the DOM, it will be easier to accomplish because the data is already organized into a tree.
Topic archived. No new replies allowed.