Need to remove rows with duplicate values in single column and older datestamp

Pages: 12
Hi

I need to remove the rows with duplicate values in single column and older timestamp.

I have an unsorted set of data that is stored in CSV file. I need to use time stamp to get proper data. I need to eliminate duplicate entries from the set of data in CSV file. There are more than 10,000 set of data in CSV file, in the below format.

Any idea how can I use STL vector or map or set to achieve this?
Below is the data.

order_number activity timestamp
990112678909876001235486;register order;2014-03-26 10:32:56.000
990112678909875001235486;register order;2014-03-26 10:25:56.000
990112678909876300123546;register order;2014-03-22 10:31:56.000
990112678909873001235436;register order;2014-01-26 11:42:56.000
990112678909872001235426;register order;2014-02-26 10:32:46.000
990112678909872002235216;check stock;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 11:22:16.000
990112678909872002235216;handle payment;2014-03-26 10:32:46.000
Last edited on
Can it be done this way?

sort( vec.begin(), vec.end() );
vec.erase( unique( vec.begin(), vec.end() ), vec.end() );


OR use the below.

std::vector<Samples> sample; // full of Samples

// rearranges the vector, moving all the unique elements to the front
// and duplicates to the back
// last is an iterator to beginning of the duplicates
auto last = std::unique(samples.begin(), samples.end());

// erase all duplicates
samples.erase(last, samples.end());
Last edited on
I want to delete all the rows with duplicate timestamp older than many days from this specific set. Is there any way to do it ?
Last edited on
Your question is VERY unclear, but the following will eliminate duplicate order_numbers, keeping the one with the latest datestamp. (Your actual data is a bit ambiguous in that respect, as two co-coded items appear to have exactly the same date and time.)

Another possibility would be to use a std::map, associating the order number with an activity/datestamp pair. That's for another day.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

// ifstream input( "data.txt" );
istringstream input(
"990112678909876001235486;register order;2014-03-26 10:32:56.000 \n"
"990112678909875001235486;register order;2014-03-26 10:25:56.000 \n"
"990112678909876300123546;register order;2014-03-22 10:31:56.000 \n"
"990112678909873001235436;register order;2014-01-26 11:42:56.000 \n"
"990112678909872001235426;register order;2014-02-26 10:32:46.000 \n"
"990112678909872002235216;check stock;2014-03-26 10:32:46.000    \n"
"990112678909872002235216;ship order;2014-01-26 11:22:16.000     \n"
"990112678909872002235216;handle payment;2014-03-26 10:32:46.000 \n" );


struct Order
{
   string order_number;
   string activity;
   string date;

   friend istream & operator >> ( istream &in, Order & order )
   {
      if ( getline( in, order.order_number, ';' ) )
      {
         getline( in, order.activity, ';'  );
         getline( in, order.date    , '\n' );
      }
      return in;
   }
};


//======================================================================


struct Event
{
   int year, month, day, hour, minute;
   double second;
   Event( const string &A )
   {
      stringstream ss( A );
      char c;
      ss >> year >> c >> month >> c >> day >> hour >> c >> minute >> c >> second;
   }
};

//-------------------------------------

bool operator < ( const Event &a, const Event &b )
{
   if ( a.year   != b.year   ) return a.year   < b.year  ;
   if ( a.month  != b.month  ) return a.month  < b.month ;
   if ( a.day    != b.day    ) return a.day    < b.day   ;
   if ( a.hour   != b.hour   ) return a.hour   < b.hour  ;
   if ( a.minute != b.minute ) return a.minute < b.minute;
   return a.second < b.second;
}

//======================================================================


int main()
{
   vector<Order> orders;
   set<string> codes;

   for ( Order item; input >> item; )
   {
      if ( !codes.insert( item.order_number ).second )
      {
         int pos = find_if( orders.begin(), orders.end(), [&item]( const Order &e ){ return e.order_number == item.order_number; } ) - orders.begin();
         if ( Event( orders[pos].date ) < Event( item.date ) ) orders[pos] = item;
      }
      else
      {
         orders.push_back( item );
      }
   }

   for ( Order e : orders ) cout << e.order_number << ';' << e.activity << ';' << e.date << '\n';
}


990112678909876001235486;register order;2014-03-26 10:32:56.000 
990112678909875001235486;register order;2014-03-26 10:25:56.000 
990112678909876300123546;register order;2014-03-22 10:31:56.000 
990112678909873001235436;register order;2014-01-26 11:42:56.000 
990112678909872001235426;register order;2014-02-26 10:32:46.000 
990112678909872002235216;check stock;2014-03-26 10:32:46.000  
Last edited on
I am sorry for the confusion in the input data. Let me clarify you here.
I guess the code that you provided here will handle for only Case 2 mentioned here.

Case 1:
Few of the data has duplicate entries for both columns (order_number, activity) here, but has different timestamp. For example, row1 and row2 has the same order_number 990112678909876001235486, as seen below.

Case2:
Only order_number column is duplicate, all other columns are different.
For example, row 7 and row 8 has the same order_number column 990112678909872002235216, but both activity column and timestamp column are different here.

Case3:
Both activity and timestamp columns will be duplicate, but order_number will be different. For example,row9 and row10 has the same activity (handle payment) and timestamp, but order_number column is different here.

Case4:
Both order_number column and activity column are duplicated, but in timestamp column only year:mm:dd are same and hh:mm:ss:ms are different.
For example, refer to the row 11, row 12 and row 13 here.

So how can I handle all these cases?

order_number;activity;timestamp
990112678909876001235486;register order;2014-03-26 10:32:56.000
990112678909876001235486;register order;2014-03-26 12:26:16.000
990112678909875001235486;register order;2014-03-26 10:25:56.000
990112678909876300123546;register order;2014-03-22 10:31:56.000
990112678909873001235436;register order;2014-01-26 11:42:56.000
990112678909872001235426;register order;2014-02-26 10:32:46.000
990112678909872002235216;check stock;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 11:22:16.000
990112678909872002235216;handle payment;2014-03-26 10:32:46.000
990112678909871002435208;handle payment;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 10:00:16.000
990112678909872002235216;ship order;2014-01-26 10:15:26.000
990112678909872002235216;ship order;2014-01-26 11:22:46.000
Last edited on
My code handles the case where the order_number is duplicated (by tracking use of that in a set). If you want it to do anything else ... you will still have to explain more clearly.

In your cases above, exactly what do you want to KEEP?
Last edited on
I want to remove all duplicate entries from the above set. It needs to handle atleast first 3 cases. Here case 2 is already handled,so we need to focus on case 1 and case 3. Is it possible? This is an unsorted set of data. So I was thinking to first sort it first based on timestamp (in this case it would be the last column) and then remove the duplicates. do you have any idea?
Last edited on
We can delete duplicate from vector by iterate vector for each element. Below is the code, just for an example. But I am not sure how to do based on timestamp by sorting it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <vector>
#include <algorithm>
 
void remove(std::vector<int> &j)
{
    auto end = j.end();
    for (auto itr = j.begin(); itr != end; ++itr) {
        end = std::remove(itr + 1, end, *itr);
    }
 
    j.erase(end, j.end());
}
 
int main()
{
    std::vector<int> j = { 5, 2, 1, 3, 3, 2, 2, 4, 5, 4, 6 };
 
    remove(j);
 
    for (auto itr = j.cbegin(); itr != j.cend(); ++itr)
        std::cout << *itr << ' ';
 
    return 0;
}


O/P: 5 2 1 3 4 6
Last edited on
In your last example:
- row 1 and 2 have the same order number and activity, but different timestamp; so you want to keep row 2 but delete row 1 - is that correct? (my code would do that)

- row 7 and row 8 have the same order number; you want to keep row 7 (later month) - is that correct? (my code would do that )

- row 9 and 10 have different order numbers; you want to keep both - is that correct?

- row 11, 12, 13 have the same orde number and activity but different timestamp; you want to delete rows 11 and 12, but keep row 13; - is that correct?


If I've got any of these wrong then I have no idea what you want. Maybe somebody else can follow what you are saying.
Below are my comments:

- row 1 and 2 have the same order number and activity, but different timestamp; so you want to keep row 2 but delete row 1 - is that correct? (my code would do that)
Yes, keep only latest timestamp and remove older one. So delete row 1 and keep row2 here.

- row 7 and row 8 have the same order number; you want to keep row 7 (later month) - is that correct? (my code would do that )
Since only order number is duplicated here, I want to keep both here. I mean to keep both row 7 and 8. Only if more than 2 columns are duplicated in same row, then delete it, which is not the same in this case. Only if the order number and timestamp is duplicated, then delete older and keep latest, which is not applicable in this case.[Is this good way to do, when there are many duplicates in a larger set?]

- row 9 and 10 have different order numbers; you want to keep both - is that correct?
It has both activity and timestamp duplicated here, so want to delete it, as more than one column appears the same here. I mean to keep latest order number and delete old order number, as both timestamp and activity are same in this case.
- row 11, 12, 13 have the same order number and activity but different timestamp; you want to delete rows 11 and 12, but keep row 13; - is that correct?
Yes, delete both rows 11 and 12 (with older timestamp) and keep only row 13 here, which is latest timestamp.
In rows 9 and 10 why do you want to delete two things with different order numbers? That's not the way to run an online shop! And what do you mean by "latest" order number?

In a case where order number and timestamp are the same, how do you decide which is the "older" to delete?

If this is an assignment please post the original assignment (not your paraphrasing of it).
Last edited on
Ok, understood.
In that case, lets keep row 9 and 10, if both activity and timestamp are duplicates. In short we will keep different order numbers for tracking purpose.
I just had a quick run of your sample code and got the below output.

For row 11, 12, 13- where have the same order number and activity. Is it possible to delete both rows 11 and 12 (with older timestamp) and keep only row 13 here, which is latest timestamp?

BEFORE RUN:

order_number;activity;timestamp
990112678909876001235486;register order;2014-03-26 10:32:56.000
990112678909876001235486;register order;2014-03-26 12:26:16.000
990112678909875001235486;register order;2014-03-26 10:25:56.000
990112678909876300123546;register order;2014-03-22 10:31:56.000
990112678909873001235436;register order;2014-01-26 11:42:56.000
990112678909872001235426;register order;2014-02-26 10:32:46.000
990112678909872002235216;check stock;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 11:22:16.000
990112678909872002235216;handle payment;2014-03-26 10:32:46.000
990112678909871002435208;handle payment;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 10:00:16.000
990112678909872002235216;ship order;2014-01-26 10:15:26.000
990112678909872002235216;ship order;2014-01-26 11:22:46.000


AFTER RUN:
order_number;activity;timestamp
990112678909876001235486;register order;2014-03-26 12:26:16.000
990112678909875001235486;register order;2014-03-26 10:25:56.000
990112678909876300123546;register order;2014-03-22 10:31:56.000
990112678909873001235436;register order;2014-01-26 11:42:56.000
990112678909872001235426;register order;2014-02-26 10:32:46.000
990112678909872002235216;check stock;2014-03-26 10:32:46.000
990112678909871002435208;handle payment;2014-03-26 10:32:46.000
Last edited on
Is there any way to sort the data BEFORE or AFTER RUN?
@denver2020,
Please post your original assignment. Your attempted explanations of it are making no sense.


As far as I can see the only logical way to de-duplicate is to remove repetitions of
{order_number,activity}
pairs, keeping that with the latest datestamp, and the only logical way to sort is by order_number, then datestamp.

Like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#include <iostream>
#include <iomanip>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

// ifstream input( "data.txt" );
istringstream input(
"990112678909876001235486;register order;2014-03-26 10:32:56.000\n"
"990112678909876001235486;register order;2014-03-26 12:26:16.000\n"
"990112678909875001235486;register order;2014-03-26 10:25:56.000\n"
"990112678909876300123546;register order;2014-03-22 10:31:56.000\n"
"990112678909873001235436;register order;2014-01-26 11:42:56.000\n"
"990112678909872001235426;register order;2014-02-26 10:32:46.000\n"
"990112678909872002235216;check stock;2014-03-26 10:32:46.000\n"
"990112678909872002235216;ship order;2014-01-26 11:22:16.000\n"
"990112678909872002235216;handle payment;2014-03-26 10:32:46.000\n"
"990112678909871002435208;handle payment;2014-03-26 10:32:46.000\n"
"990112678909872002235216;ship order;2014-01-26 10:00:16.000\n"
"990112678909872002235216;ship order;2014-01-26 10:15:26.000\n"
"990112678909872002235216;ship order;2014-01-26 11:22:46.000\n" );


//======================================================================


string trim( const string &s, string junk = " " )
{
   int i = s.find_first_not_of( junk );
   if ( i == string::npos ) return "";

   int j = s.find_last_not_of( junk );
   return s.substr( i, j - i + 1 );
}


//======================================================================


struct Order
{
   string order_number;
   string activity;
   string date;

   friend istream & operator >> ( istream &in, Order & order )
   {
      if ( getline( in, order.order_number, ';' ) )
      {
         getline( in, order.activity, ';'  );
         getline( in, order.date    , '\n' );
         trim( order.order_number );
         trim( order.activity     );
         trim( order.date         );
      }
      return in;
   }
};


//======================================================================


struct Event
{
   int year, month, day, hour, minute;
   double second;
   Event( const string &A )
   {
      stringstream ss( A );
      char c;
      ss >> year >> c >> month >> c >> day >> hour >> c >> minute >> c >> second;
   }
};

//-------------------------------------

bool operator < ( const Event &a, const Event &b )
{
   if ( a.year   != b.year   ) return a.year   < b.year  ;
   if ( a.month  != b.month  ) return a.month  < b.month ;
   if ( a.day    != b.day    ) return a.day    < b.day   ;
   if ( a.hour   != b.hour   ) return a.hour   < b.hour  ;
   if ( a.minute != b.minute ) return a.minute < b.minute;
   return a.second < b.second;
}


//======================================================================


bool sortOrders( const Order &a, const Order &b )
{
   if ( a.order_number != b.order_number ) return a.order_number < b.order_number;
   return Event( a.date ) < Event( b.date );
}


//======================================================================


int main()
{
   vector<Order> orders;
   set< pair<string,string> > codes;             // A set to keep track of duplicate order-number/activity pairs

   for ( Order item; input >> item; )                                                // Read each line
   {
      if ( !codes.insert( { item.order_number, item.activity } ).second )            // If this pair exists ...
      {
         int pos = find_if( orders.begin(), orders.end(),                            // ... find its position ...
               [&item]( const Order &e ){ return e.order_number == item.order_number
                                              && e.activity     == item.activity     ; } ) - orders.begin();
         if ( Event( orders[pos].date ) < Event( item.date ) ) orders[pos] = item;   // ... if newer, then replace
      }
      else
      {
         orders.push_back( item );                                                   // otherwise, just add
      }
   }

   sort( orders.begin(), orders.end(), sortOrders );                                 // Sort (order number, then date)

   string previous;
   for ( Order e : orders )
   {
      if ( previous != "" && e.order_number != previous ) cout << '\n';
      cout << setw( 30 ) << e.order_number << ' '
           << setw( 20 ) << e.activity     << ' '
           << setw( 30 ) << e.date         << '\n';
      previous = e.order_number;
   }
}


      990112678909871002435208       handle payment        2014-03-26 10:32:46.000

      990112678909872001235426       register order        2014-02-26 10:32:46.000

      990112678909872002235216           ship order        2014-01-26 11:22:46.000
      990112678909872002235216          check stock        2014-03-26 10:32:46.000
      990112678909872002235216       handle payment        2014-03-26 10:32:46.000

      990112678909873001235436       register order        2014-01-26 11:42:56.000

      990112678909875001235486       register order        2014-03-26 10:25:56.000

      990112678909876001235486       register order        2014-03-26 12:26:16.000

      990112678909876300123546       register order        2014-03-22 10:31:56.000
Last edited on
Thanks for your quick response.
Actually the data here is unsorted one. So I was just wondering if we need to first remove duplicate and then sort it based on timestamp column. I was thinking of introducing a function like below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void searchAndSubstituteAll(std::string& data, std::string Searchto, std::string Strsubstitute)
{
    // Get the first occurrence
    size_t pos = data.find(Searchto);

    // Repeat till end is reached
    while (pos != std::string::npos)
    {
        // Substitute this occurrence of Sub String
        data.substitute(pos, Searchto.size(), Strsubstitute);
        // Get the next occurrence from the current position
        pos = data.find(Searchto, pos + Strsubstitute.size());
    }
}
Last edited on
Based on your latest code, I have tested and below is the output.
I can see that AFTER_RUN, has single space which got introduced after each line. Few lines do not have single space , if there is a pair which has both order_number and timestamp are same. Is there any way to avoid the new single space that got introduced here after each line for which no pair is identfied? Please advise me.

BEFORE RUN:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
order_number;activity;timestamp
990112678909876001235486;register order;2014-03-26 10:32:56.000
990112678909876001235486;register order;2014-03-26 12:26:16.000
990112678909875001235486;register order;2014-03-26 10:25:56.000
990112678909876300123546;register order;2014-03-22 10:31:56.000
990112678909873001235436;register order;2014-01-26 11:42:56.000
990112678909872001235426;register order;2014-02-26 10:32:46.000
990112678909872002235216;check stock;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 11:22:16.000
990112678909872002235216;handle payment;2014-03-26 10:32:46.000
990112678909871002435208;handle payment;2014-03-26 10:32:46.000
990112678909872002235216;ship order;2014-01-26 10:00:16.000
990112678909872002235216;ship order;2014-01-26 10:15:26.000
990112678909872002235216;ship order;2014-01-26 11:22:46.000



AFTER RUN:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
      990112678909871002435208       handle payment    2014-03-26 10:32:46.000

      990112678909872001235426       register order       2014-02-26 10:32:46.000

      990112678909872002235216           ship order        2014-01-26 11:22:46.000
      990112678909872002235216          check stock       2014-03-26 10:32:46.000
      990112678909872002235216       handle payment    2014-03-26 10:32:46.000

      990112678909873001235436       register order        2014-01-26 11:42:56.000

      990112678909875001235486       register order        2014-03-26 10:25:56.000

      990112678909876001235486       register order        2014-03-26 12:26:16.000

      990112678909876300123546       register order        2014-03-22 10:31:56.000

                  order_number                  activity                      timestamp
Last edited on
May be some function like this to eliminate empty space between lines? Please comment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bool forEmptySpace(const std::string &orders)
	{
		if (orders.length() == 0)
			return true;
		else
		{
			for (std::size_t index = 0; index < orders.length(); index++)
			{
				if (!std::isspace(orders[index]))
					return false;
			}

			return true;
		}
	}

Last edited on
denver2020

I only put the extra lines in so that you could distinguish groups with the same order number.
If you don't like it then just take it out:
1
2
3
4
5
6
   for ( Order e : orders )
   {
      cout << setw( 30 ) << e.order_number << ' '
           << setw( 20 ) << e.activity     << ' '
           << setw( 30 ) << e.date         << '\n';
   }


You are writing snippets of code that have no bearing on the problem.
> You are writing snippets of code that have no bearing on the problem.
https://cboard.cprogramming.com/cplusplus-programming/180158-need-help-sort-remove-duplicate-entries.html
That's all they've ever managed in a decade of "programming".
Instead of order_number sort, I used only datestamp for sort. I changed the below lines in the code and got the below output.

1
2
 if ( a.date != b.date) return a.date < b.date;
   return Event( a.date ) < Event( b.date );


AFTER RUN:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
      990112678909872002235216           ship order        2014-01-26 11:22:46.000

      990112678909873001235436       register order        2014-01-26 11:42:56.000

      990112678909872001235426       register order        2014-02-26 10:32:46.000

      990112678909876300123546       register order        2014-03-22 10:31:56.000

      990112678909875001235486       register order        2014-03-26 10:25:56.000

      990112678909872002235216          check stock        2014-03-26 10:32:46.000
      990112678909872002235216       handle payment        2014-03-26 10:32:46.000

      990112678909871002435208       handle payment        2014-03-26 10:32:46.000

      990112678909876001235486       register order        2014-03-26 12:26:16.000
 

Last edited on
Pages: 12