Degree of sortedness.

Hello,

This might not be the best place to ask it, but I figure I'd give it a shot here: I'm looking for an algorithm to judge the "sortedness" of a list of integers. I know of such algorithms for Strings, but they don't take into account the "distance" of two items.

Specifically, I'm looking for a measure that gives 'approximately correct' lists (e.g. 5, 6, 9, 8, 10, 9, 9, 10, 11) a very high score, but a single 'highly out of place' item can drag the score way down (e.g. 6, 8, 9, 9, 9, 5, 10, 10, 11).

Is there an existing algorithm for it, or can anyone think of a clever way? Mind that the result should also be easy to interpret.

P.S.: I'm not too concerned with the efficiency. I'm using it on large arrays (few billion entries), but only once or twice and only during calibration.
Use a standard sorting algorithm that will need fewer steps on your "approximately correct" list, and just run it. Count the number of iterations. Fewer iterations; better score.
But a (comparison-based) sorting algorithm wouldn't see any difference in the following lists:
List 1: {1, 2, 3, 4, 10, 5, 6}
List 2: {1, 2, 3, 4, 10000, 5, 6}

I'm thinking of something along the lines of summing up max(0, a(i+1) - a(i)) and dividing by the total sum of items, but I fear the result-range would be too narrow to be useful.
Bump?
Use Spearman's correlation coefficient between the value and its rank. 1.0 means perfectly sorted list. 0.0 means unordered, -1.0 reversed order.
Again, Spearman's Rho doesn't take into account the values of the numbers, just the ranks.

I want to add larger penalties for "obviously wrong" numbers.
There are many, many algorithms you could knock together for this. Here's one candidate:

Sort the array. Then create the following sum:
1
2
3
4
for (all elements in array)
{
  sum += abs ( originalArray[i] - sortedArray[i]);
}


If you can just explain carefully exactly how you want the scoring system to work, you will have explained to yourself your own solution.

I want to add larger penalties for "obviously wrong" numbers.


Obviously wrong numbers will have obviously wrong ranks.
@rapidcoder: You're not listening to what I'm asking. If it places 101 before 100, then that's okay, because 100 ~= 101. If it places 1000 before 100, then that's not okay, because 1000 isn't anywhere near 100. Check my example above:

List 1: {1, 2, 3, 4, 10, 5, 6}
List 2: {1, 2, 3, 4, 10000, 5, 6}

Any type of rank correlation measure (or 'sort-effort' algorithm) would value those two the same, because the 5th element should be the last in both cases. I want to punish the second list much harder, because 10000 obviously doesn't belong before 5, while 10 isn't that far off.

@Moschops: I have a few ideas, but it's difficult to predict whether they'll be useful/meaningful.
The easiest would be counting up all positive "a(i+1) - a(i)" (= wrong order) and comparing to the total sum. I might square the difference to additionally punish "obviously wrong" numbers. The problem is then that the resulting number isn't easy to evaluate.

More advanced, I'm thinking of taking the average of the elements from a(i-t) to a(i+t) and measuring the effect of deleting 'i' from the range on the average of the range. Of course, the parameter 't' is a problem and even a perfectly sorted list can get a very odd result depending on the numbers.
Oh right, well then I'll just use the standard C++ function "calculateLevelOfObviousWrongness"... oh, hang on a minute! There's NO SUCH FUNCTION!

The following is obviously wrong:

{ 1, 3, 2 }

As is the following:

{ 1, 1000, 2}

They're both "obviously" wrong. Which is "more obviously" wrong? Are they both the same amount of "wrongness"? Do you see what I'm getting at?

Should the scoring be based on how far out of order they are, in which case these have the same score? Or should it be a function of the value alone? Shall we give the second one a score of (abs(1000 - 2)) + abs((2 - 1000)) because there is a 1000 where the 2 should be, and a 2 where the 1000 should be? That would give the second one a score or 1996, and the first one a score of 2. Yikes! Are we really saying that the second one is almost a thousand times more unsorted than the first?

These are the questions I'm getting at. If the OP can numerically quantify when he means when he says "sortedness", then he will have answered his own question.
Last edited on
I see what you're getting at; that's the exact reason I posted this topic: for input on how to quantify/measure this. If I knew which measure I wanted to use, I would have done so rather than posting this.

I'm not asking for "What's the solution to this?", because there isn't one; I'm asking for opinions and insights, asking what you would do. The problem is simple enough for anyone to give his two cents. The example clearly expresses which is "good" and what is "bad". I'm asking how you would express this mathematically/statistically to get a single output to evaluate and compare a "roughly sorting" algorithm.

I've repeated several times that I'm not interested in rank, but in precision. I've given several simple possible methods; I'm looking for thoughts on how to translate them into a meaningful measure. The problem really isn't that specific.
How about:

for each value that is in the wrong place, the absolute difference between it and what should be there, multiplied by the distance it is displaced from the correct position? That would give more weight to big differences, and would also allow for distance from the correct place.

That might give too much weight to numerical differences

i.e.

{1,1000,3} - badness 997+997 = 1994

should really be not that much less wrong than

{1, 5000, 3} - badness 4997 + 4997 = 9994

A better way might be
log10 (absoluteNumericalDifference) * distanceFromCorrectPosition

which would give these scores:

{1,1000,3} - badness 5.99
{1, 5000, 3} - badness 7.40
{5000, 1, 3} - badness 15.10

Under the second scheme, {1, 5000, 3} is 25% more wrong than {1, 1000, 3}, and {5000, 1, 3} comes out as about twice as wrong as {1, 5000, 3}.
Last edited on
Very good suggestion! It solves the problem of evaluating lists with several sorted segments, because each element is only judged by its own location. Additionally, it gives correctly sorted lists a score of 0, making it easy to evaluate.

I'd drop the multiply-by-displacement factor, though. Partly because I believe the displacement is going to be completely meaningless in my applications, as well as for precision problems (I'm testing on a limited range of values, but the list itself can become very big).

Translating it into a meaningful number could be a problem, as well as handling lists with big "gaps" between the numbers.
Topic archived. No new replies allowed.