Find missing lines in file


Suppose we have given two files A and B
and we have to find lines that are present in file A
but not present in file B and print result to file C

Loading files into memory and
solving it in vector or list may not solve the problem
because we can have insufficient memory to load the file

How can i write code for this problem
Can you assume that every line of B can be found from A?
You could do something like this:

-> open A
-> read first line from A
-> open B
-> read first line from B
-> open C
-> Run a loop comparing the current line from A to each of the lines in B. If by reaching the end of B you haven't found a similar line, write current line to C. If you find a similar line, read next line from A, go back to beginning of B.

I don't think you would have any memory issue by reading a text file line by line since you'd only be dealing with two strings at once.

Hope this helps
@hoogo:
What if:

A:
1
2
3
4
5
a
b
a
c
a


B:
1
2
c
a


Your result would be:
b

That may, or may not be the expected output.

You are right though that the problem is most likely is solvable by looking at streams of lines rather than entire files.
That may, or may not be the expected output.

As far ad I understand OP's issue it's what he's looking for, or is it?
and we have to find lines that are present in file A
but not present in file B and print result to file C

I was thinking in terms of text files as an example. Considering a "line" is any string ending with '\n', so it's highly unlikely that you get two identical lines.

In which way would you think it might not be the expected output? :)
Last edited on
I could claim that A was created by prepending/inserting
1
2
3
a
b
a

into the B.

It is a design decision, whether a line is a line when they happen to have same content.
@keskiverto

I must apologise but I'm not getting what you're saying at all about the relationship between A and B. Aren't they two separate text files, which generation is independent from one another?
You assume separate. I assume one evolving from the other. We can both be wrong.

The point is that we should know the relationship, etc in order to select "best" solution for it.
For example, if we know that each file is less than megabyte, or that some are larger than terabyte, that knowledge will guide our decisions.
Oh. Yes, I see what you mean. Indeed it's very relevant to know the source, size and relationship between, the files. I hadn't taken that into consideration.

I guess it's up to OP to narrow down what files he wants to deal with..
How can i write code for this problem


write a function that searches a file for a given text string and returns a bool.
bool FileContains(const std::string& searchTerm);

get that working.

then write another function to read each line from a file, in the loop call FileContains(thisLine) to see if the line exists, and if not, write it to file C.


Edit: After reading keskiverto's responses i feel the need to add, this code frag
1
2
3
4
5
6
7
for each line in A
    if (fileC.FileContains(thisLine))
        // already found, ignore
    else if (fileB.FileContains(thisLine))
        fileC.write(thisLine)
    endif
endfor


As keskiverto says, inside knowledge will lead to better algorithms and performance. My example above is just a brute force method.

for example, if you know that all of the items in the list are alphabetic order you can tweak the FileContains() algorithm to bug out sooner, if its looking for "dog" and gets to "house" you know "dog" cannot exist. but we know nothing of your data so brute force would seem the best suggestion atm.

Last edited on
I'd say less "want" and more "Teacher, I need vital details!"
@keskiverto

True enough..

I tried brute force in C
but i have got segmentation Fault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include<stdio.h>
#include<string.h>
#include<limits.h>
#include<conio.h>

int main()
{
    FILE *A;
    FILE *B;
    char* linia1;
    char* linia2;
    int dlugosc;
    int znaleziono;

    if ((A = fopen("plikA.txt", "rt"))
       == NULL)
   {
      fprintf(stderr, "Cannot open input file.\n");
      return 1;
   }
   if ((B = fopen("plikB.txt", "rt"))
       == NULL)
   {
      fprintf(stderr, "Cannot open input file.\n");
      return 1;
   }
   while(!feof(A))
    {
        znaleziono=0;
        fgets(linia1,INT_MAX,A);
        dlugosc=strlen(linia1)-strlen(strstr(linia1,"("));
        fseek(B,0L,SEEK_SET);
       while(!feof(B))
       {
           fgets(linia2,INT_MAX,B);
           if(!strncmp(linia2,linia1,dlugosc)) znaleziono=1;
       }
        if(!znaleziono) printf("%s\n",linia1);
       puts(linia1);
   }


   fclose(A);
   fclose(B);



    return 0;
}


What is wrong in this code ?

Yes we will get better algorithm if we reduce this problem to sorting

One problem is on line 30 fgets(linia1,INT_MAX,A);
linia1 points just somewhere since it's not initialized.
You could read a batch at a time from the first file read into an unordered_map and go through the second file comparing to the map.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
using namespace std;

const int LINES_PER_BATCH = 1000000;

int main() {
    ifstream finA("fileA");  if (!finA) { cerr<<"fileA\n"; return 0; }
    ifstream finB("fileB");  if (!finB) { cerr<<"fileB\n"; return 0; }

    while (true) {
        string line;

        // read a batch of fileA lines
        unordered_map<string,bool> batch; // bool true means "found in fileB"
        for (int i = 0; i < LINES_PER_BATCH && getline(finA, line); ++i)    
            batch[line] = false;
        if (batch.size() == 0)
            break;

        // process all of fileB with this batch of fileA's lines.
        while (getline(finB, line)) {
            auto it = batch.find(line);
            if (it != batch.end())
                it->second = true;
        }
        finB.seekg(0); // reset fileB to the beginning

        // print lines from this batch of fileA not found in fileB        
        for (const auto& x: batch)
            if (!x.second)
                cout << x.first << '\n';
    }
}

Last edited on
Topic archived. No new replies allowed.