how to fit range coordinates?

Hello,

I try to make correspondance between intervals, I m' explain myself:

- first file (file1.txt) is a list of intervals as format :

id1 3455 3463
id2 3535 3544
id3 3601 3608
id4 3622 3630
id5 3631 3913
id6 3631 3759
id7 3631 5899
id8 3760 3913
id9 3760 5630
id10 3996 4276
id11 4486 4605
id12 4706 5095
id13 5174 5326
.../...

- second file (file2.txt) another list of intervals as format :

acc_2419732 3268 3285
acc_3041065 3565 3583
acc_362358 3640 3656
acc_3279485 3793 3813
acc_3091017 3794 3811
acc_2807380 3832 3848
acc_3105138 3832 3848
acc_3105139 3832 3848
acc_3105140 3832 3848
acc_3116450 3832 3848
acc_86708 3832 3848
acc_1987802 3922 3938
acc_1679660 4113 4129
acc_891489 4113 4129
acc_2829973 4299 4318
acc_298381 4603 4619
.../...

the question is the following one:
which are the intervals of my file 2 (file2.txt) which are found included among those of file 1 (file1.txt)?
by writing a simple script in awk (syntax rather close to C):

#!/usr/bin/awk -f
#usage: myprog.awk file2.txt

BEGIN {
while((getline < "file1.txt") > 0){
cpt++
descr[cpt]=$1
start[cpt]=$2
end[cpt]=$3
}
close("test.txt")
}
{
j=1
while(start[j]<=$2 && j<=cpt){
if(end[j]>=$3){print $1,$2,$3,descr[j],start[j],end[j];j++}
else{j++}
}
}

we find the results:

acc_362358 3640 3656 id5 3631 3913
acc_362358 3640 3656 id6 3631 3759
acc_362358 3640 3656 id7 3631 5899
acc_3279485 3793 3813 id5 3631 3913
acc_3279485 3793 3813 id7 3631 5899
acc_3279485 3793 3813 id8 3760 3913
acc_3279485 3793 3813 id9 3760 5630
acc_3091017 3794 3811 id5 3631 3913
acc_3091017 3794 3811 id7 3631 5899
acc_3091017 3794 3811 id8 3760 3913
acc_3091017 3794 3811 id9 3760 5630
acc_2807380 3832 3848 id5 3631 3913
acc_2807380 3832 3848 id7 3631 5899
acc_2807380 3832 3848 id8 3760 3913
acc_2807380 3832 3848 id9 3760 5630
acc_3105138 3832 3848 id5 3631 3913
acc_3105138 3832 3848 id7 3631 5899
acc_3105138 3832 3848 id8 3760 3913
acc_3105138 3832 3848 id9 3760 5630
acc_3105139 3832 3848 id5 3631 3913
acc_3105139 3832 3848 id7 3631 5899
acc_3105139 3832 3848 id8 3760 3913
acc_3105139 3832 3848 id9 3760 5630
acc_3105140 3832 3848 id5 3631 3913
acc_3105140 3832 3848 id7 3631 5899
acc_3105140 3832 3848 id8 3760 3913
acc_3105140 3832 3848 id9 3760 5630
acc_3116450 3832 3848 id5 3631 3913
acc_3116450 3832 3848 id7 3631 5899
acc_3116450 3832 3848 id8 3760 3913
acc_3116450 3832 3848 id9 3760 5630
acc_86708 3832 3848 id5 3631 3913
acc_86708 3832 3848 id7 3631 5899
acc_86708 3832 3848 id8 3760 3913
acc_86708 3832 3848 id9 3760 5630
acc_1987802 3922 3938 id7 3631 5899
acc_1987802 3922 3938 id9 3760 5630
acc_1679660 4113 4129 id7 3631 5899
acc_1679660 4113 4129 id9 3760 5630
acc_1679660 4113 4129 id10 3996 4276
acc_891489 4113 4129 id7 3631 5899
acc_891489 4113 4129 id9 3760 5630à
acc_891489 4113 4129 id10 3996 4276
acc_2829973 4299 4318 id7 3631 5899
acc_2829973 4299 4318 id9 3760 5630
acc_298381 4603 4619 id7 3631 5899.
acc_298381 4603 4619 id9 3760 5630

great it's an excellent result.
Except that when my two files exceed the 100000 lines the computing time grows in an exponential way, and i need more than 24 hours to have the result.
I point out that the two files are sorted according to the second column in ascending order.
Y' would not have it a more elegant method so as to obtain a linear and non quadratic comparison like is the case here?

Thank you.
Topic archived. No new replies allowed.