HTML file analysis

Ok, so firstly, I've seen a few archived topics on this subject, I just haven't found a proper answer to my question. I have 2 Questions on the subject:

1. My loop for counting the amount of links and comments is returning 0, when it should be returning 1 for each.

2. I'm having troubling figuring out to identify the amount of characters between "<" and ">" so I can count how many characters are inside html tags.

This is the relevant code for the tag comment and link counters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  while (file)
    {
        getline(file, text);
        cout << text << "\n";
        lines++;

        for (int i = 0; i < text.size(); i++)
        {
            if (i == tag)
            {
                tags++;

                if (i+1 == comm)
                {
                    comms++;
                }
                else if (i+1 == link || i+1 == link2)
                {
                    links++;
                }
            }

            chars++;
        }
    }
> if (i == tag)
You're comparing an index with ?

Surely you mean something like
if (text[i] == tag)
Your while loop isn't correct. file stream will indicate fail
after
the attempt to use getline(). The loop should be like:

1
2
3
while (getline(file, text))
{
    cout << text << "\n";

Hello PacificAtlantic,

As I look at the while loop it is completely out of context. There is no way to know what you did to lead up to this while loop.

The variables "tag", "comm", "link" and "link2". What are they and how are they defined along with what value do they hold? Good questions that will drag this out until there is an answer.

It is best to provide enough of a program that will compile and can be run and tested. It also helps to use the same input file that you are, so everyone can use the same information and not have to guess.

I was also thinking that the variables "tag", "comm", "link" and "link2" may work better as constant variables because their values should not be changed.

Given the code:
1
2
3
if (i == tag)
{
    tags++;

One missing "s" on line 3 and the value of "tag" will change. And when corrected the if statement may never be true again. Just a suggestion, "tagCount" may be a better name with less chance of being a problem with the variable "tag".

Andy
salem, that small change actually fixed the first bug. Appreciate that.

Also, here is the entire code that I'm working with, for context. Sorry, didn't think of that sooner.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
    string text;        // file text
    int lineCount = 0;      // number of lines
    int tagCount = 0;       // number of tags
    int linkCount = 0;      // number of links
    int commCount = 0;      // number of comments
    int charCount = 0;      // number of characters
    int tag_charCount = 0;  // number of characters in tags
    double ratio = 0;   // ratio of characters in tags vs. out of tags
    string filename;

    const char tag = '<';
    const char link = 'a';
    const char link2 = 'A';
    const char comm = '!';
    const char tag_end = '>';

    cout << "HTML File Stat Viewer" << endl;
    cout << "\nEnter a valid filename.";
    cout << "(Should have no blanks, and include the extension)" << endl;
    cout << "> ";
    cin >> filename;

    ifstream file;
    file.open (filename.c_str());

    while (!file)       // Error-checking for file name
    {
        file.clear();
        cout << "\nInvalid file name. Please enter a valid file name.\n" << endl;
        cout << "> ";
        cin >> filename;

        file.open (filename.c_str());
    }

    cout << "\nFile Text:" << endl;
    cout << "-------------------------------------------------------------------------\n" << endl;
    while (file)
    {
        getline(file, text);
        cout << text << "\n";
        lineCount++;

        for (int i = 0; i < text.size(); i++)
        {
            if (text[i] == tag)
            {
                tagCount++;

                if (text[i+1] == comm)
                {
                    commCount++;
                }
                else if (text[i+1] == link || text[i+1] == link2)
                {
                    linkCount++;
                }
            }

            charCount++;
        }
    }

    cout << "\nHTML Analysis:" << endl;
    cout << "-------------------------------------------------------------------------" << endl;
    cout << "Number of lines: " << lineCount - 1 << endl;
    cout << "Number of tags: " << tagCount << endl;
    cout << "Number of comments: " << commCount << endl;
    cout << "Number of links: " << linkCount << endl;
    cout << "Number of chars in file: " << charCount << endl;
    cout << "Number of chars in tags: " << tag_charCount << endl;
    cout << "Percentage of chars in tags: " << ratio << endl;

    file.close();
    return 0;
}


The last thing I'm missing is getting the amount of characters inside "<" and ">" brackets so I can get the total number of tag characters.

Anything helps, thanks.
Hello PacificAtlantic,

Some little things to get you started:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main()
{
    string text;            // file text
    int lineCount{};        // number of lines
    int tagCount = 0;       // number of tags
    int linkCount = 0;      // number of links
    int commCount = 0;      // number of comments
    int charCount = 0;      // number of characters
    int tag_charCount = 0;  // number of characters in tags
    double ratio = 0;       // ratio of characters in tags vs. out of tags
    string filename{ "Test HTML.html" }; // <--- Used for testing. Comment or remove when finished. That is everything including the {}s.

    const char TAG = '<';
    const char LINK = 'a';
    const char LINK2 = 'A';
    const char COMM = '!';
    constexpr char COMM2{ '-' };
    const char TAG_END = '>';

    cout <<
        '\n' <<
        std::string(23, ' ') << "HTML File Stat Viewer\n" << std::string(70, '-') << '\n' <<
        " Enter a valid filename.\n"
        " (Should have no blanks, and include the extension) > ";
    cout << filename << "\n"; // <--- Used for testing. Comment or remove when finished.
    //cin >> filename; // <--- Used for testing. Uncomment when finished.  Should be changed to use "std::getline()".

    ifstream file(filename);
    //file.open(filename);

    while (!file)       // Error-checking for file name
    {
        file.clear();

        cout <<
            "\n     Invalid file name. Please enter a valid file name.\n\n"
            "Enter a valid filename.\n"
            " (Should have no blanks, and include the extension) > ";
        std::getline(std::cin, filename);  // <--- Changed.

        file.open(filename);
    }

    cout << "\n File Text:\n";
    cout << std::string(70,'-')  << '\n';

    //while (file)
    while(getline(file, text))
    {
        cout << text << "\n";

        lineCount++;

        for (int i = 0; i < text.size(); i++)
        {
            if (text[i] == TAG)

The comments should explain most of the changes.

Looking at lines 10, 17 and 23 if this gives an error when you compile the code then you may need to adjust your IDE/compiler to use at least the 2011 standards. Or you may need to upgrade. The C++ 2017 standards are considered the current standard to compile to.

The {}s in lines 10, 17 and 23, known as the uniform initializer, are available from C++11 on.

Although not mandatory the constant variables are usually give capital letters. This helps to realize that these variables are constants that can not be changed. It also sets them apart from regular variables.

Line 34 is the simple way to define a file stream variable and open the file at the same time. Inside the while you will still need the ".open()" because the variable is already defined.

In the second while loop and as seeplus showed you this is the best way to read a file of unknown length. With your code:
1
2
3
4
5
6
7
while (file)
{
    getline(file, text);

    cout << text << "\n";

    lineCount++;

With the read inside the while loop when the read sets the "eof" bit on the stream you are still doing a "cout" to the screen of something, what this is may be undetermined, then you add 1 to "lineCount" when you do not need to.

When you get to the line cout << "Number of lines: " << lineCount - 1 << endl;. The "- 1" here should be telling you that there is a problem because you should not need this.

Done correctly the while loop will end before you add the extra 1 to "lineCount".

Now you have a bigger problem than just counting the character between "< >". According to your program anything that starts with "<" is a tag. In the test file that I am using I have these 2 lines of code:
1
2
3
<td class="Gr"> <150 </td

<td class = "None"> < 80 </td>

The bold type is what is printed in the table cell and neither would be considered a proper tag, but the program counts them as tags when it should not.

The next problem I found is the comments. The first line of code in my test file is: <!DOCTYPE html>. This is not a comment, but is counted as one. This may not be completely accurate, but I would consider this more as a directive to the browser to tell it what is coming and how to process it.

https://www.w3schools.com/tags/tag_doctype.asp

As I remember it a true comment consists of 4 characters: <!--. To count as a comment you will need to check all 4 characters for a match. This part I am not sure of yet:
1
2
3
<!--[if IE]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->

Is it considered a comment or something special?

When opening the "html" file in MSVS 2017 it changes the colour of the type to green to show it is a comment, when surrounded by <!-- -->, so this may be considered an true comment.

First you need to figure out what is a proper tag and how you will deal with closing tags like </p>. Will they count as a tag or should they be counted as a "closingTagCount"? Saying that "<" makes a tag is not working.

The if and if/else if statements in the while loop need worked on to better what is a tag and comment. Checking for a link with the <a> tag is working. My test file has 2 <a> tags and it counted them just fine.

Until you figure out what a proper tag is counting the character between the "< >" is the least of your problems.

Andy
Topic archived. No new replies allowed.