I am hoping someone can help me on this problem. I have a large data set (over a million records) with about 100 variables holding the results for each person at various times.
I am looking for an algorithm that will tell me which rows, within each subgroup, match all the values in another row. So every row that matches row 1 will be given a tag of 1 and will not be evaluated further, every remaining row that is not a match to row 1 will be evaluated to see if it matches row 2 and will be given a tag of 2 and will not be evaluated further, etc. The additional rule is that if a result is missing then that result is treated as a wild card can be matched to either a 0 or 1 (the only 2 possible outcomes). However, once it is matched to a 0 or 1 then it has to match to that value in any subsequent processing. The data set must be sorted by time.