Help with merging files

Oct 29, 2011 at 9:55am
I am new to this so please go easy on me.
I have to merge sorted files in a directory.The files contains two strings per line and I have to merge on the first string.

At present I am able to open the directory and read the files one by one.Can anybody please help me how to do this?

I am not asking the code but only the logic.
Oct 29, 2011 at 11:33am
The classic merge is using a priority queue.

Open all the files and read the first line, use the first string as a key for the priority queue. You will need to store the
output line and file handle as the data so maybe group these in a struct or something.

Take the top item from the queue and output it, read another record from this file and put it in the queue, repeat until no more data.

This may be overkill in that you probably don't have that many files and the process will be IO bound anyway.

So another option is to just use an array of the top items from each file. Repeatedly linear search the array to find the top string,
output this line and replace that array record with the next item in that file. This may be just as fast as the priority queue in this case.

The files will end at different times so you will have to mark the array records of files that have run out.
Oct 29, 2011 at 1:25pm
Hi ur2cdanger ,
What code do you have . . ? please let us know so that we can suggest you some thing in it .
Oct 29, 2011 at 2:53pm
This is what I have done so far.I have merged all the files without sorting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

/*
 * mergetokens.h
 *
 *  Created on: Oct 29, 2011
 *      Author: surya
 */

#ifndef MERGETOKENS_H_
#define MERGETOKENS_H_



#include<iostream>
#include<fstream>
#include<sstream>
#include<string>
#include<set>
#include<algorithm>
#include<vector>
#include<map>
#include<cstring>

using namespace std;


void merge()
{
	DIR *pDIR;
	struct dirent *entry;
	ofstream outFile;
	fstream inFile;
	outFile.open("tokens.txt");
	if(!outFile)
		cout<<"File Not created"<<endl;
	else
		cout<<"File created"<<endl;

	vector<string> files;
	string read_dir_path1 = "/home/surya/workspace1/Neptune/tokens/";
	if( pDIR=opendir(read_dir_path1.c_str() ) )
	{
			cout<<"\nDirectory Opened\n";
			while(entry = readdir(pDIR))
			{
				if( strcmp(entry->d_name, ".") != 0 && strcmp(entry->d_name, "..") != 0 )
				{

					//tokenize( entry->d_name, read_dir_path1 );
					cout<<entry->d_name<<endl;
					files.push_back(entry->d_name);
					cout<<files.size()<<endl;


				}

			}
	}
	else
		cout<<"Directory did not open"<<endl;
	for(int i=0;i<files.size();i++)
	{
		string temp1;
		strcpy(temp1,files.pop_back());
		cout<<temp1<<endl;
		inFile.open(temp1);
		if(!inFile.is_open())
				cout<<"Not opened"<<endl;
		else
		{
				cout<<"opened"<<endl;
				string temp;
				while(!inFile.eof())
				{
						string line;
						getline(inFile,line);
						outFile<<line;
						outFile<<endl;
				}
		}
		inFile.close();

	}



				outFile.close();

}



#endif /* MERGETOKENS_H_ */ 
Oct 30, 2011 at 4:05pm
here's what I have

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <queue>
#include <vector>
#include <iostream>
#include <fstream>
#include <string>
#include <cstdlib>
#include <climits>

using namespace std;

struct lineandfile
{
  ifstream* file;
  string sortkey,lineremainder;
  lineandfile(ifstream* setfile,string currentline);
  // Priority queue extracts top (highest) element, cheat by reversing inequality, there is a neater way
  int operator<(const lineandfile& other) const
    { return(sortkey > other.sortkey); }
};

// Assuming a single space between line components
lineandfile::lineandfile(ifstream* setfile,string currentline)
  : file(setfile)
{
  int keyend = currentline.find(' ');
  sortkey = currentline.substr(0,keyend);
  lineremainder = currentline.substr(keyend+1,INT_MAX);
}


int main()
{
  vector<string> filenames;
  filenames.push_back("test1.txt");
  filenames.push_back("test2.txt");

  priority_queue<lineandfile> fileheads;
  char buffer[1000];
  int i;

  // Initialise queue with first line of each file
  for(i=0;i<filenames.size();++i)
  {
    ifstream* file = new ifstream();
    file->open(filenames[i].c_str());
    if(!file->good())
    { // bail out
      exit(EXIT_FAILURE);
    }
    file->getline(buffer,1000);
    if(file->good() && buffer[0] != '\0')
    {
      fileheads.push(lineandfile(file,buffer));
    }
  }

  // Extract top item from queue and output
  // Push on next item from that file if it exists
  // Output is to stdout
  while(!fileheads.empty())
  {
    lineandfile current(fileheads.top());
    cout << current.sortkey << ' ' << current.lineremainder << endl;
    fileheads.pop();
    current.file->getline(buffer,1000);
    if(current.file->good() && buffer[0] != '\0')
    {
      fileheads.push(lineandfile(current.file,buffer));
    }
    else
    {
      current.file->close();
      delete current.file;
    }
  }
}


test1.txt
apple pear
blackberry strawberry
cherry banana

test2.txt
aardvark pangolin
bear panda
coyote wolf

output
aardvark pangolin
apple pear
bear panda
blackberry strawberry
cherry banana
coyote wolf

The main advantage of file merging is when you files are huge (gigabytes) you can still merge them.
If you files are not so big then you can read them into RAM and merge there.

Topic archived. No new replies allowed.