is it preferable to have simple values for clustering, or something more complex like, say, a histogram, is also feasible? |
The main problem here is finding "good" relations - e.g., a strict weak ordering is required for sorting (and thus speeding up the process considerably from O(n
2) to O(n log n)). If you manage to define such relations in a meaningful manner, then complex attributes aren't a problem. (However, how would you define such a relation for histograms? A lexicographical comparison doesn't make sense, what you most likely will end up with is something like the average brightness level, disguised in a complicated formula).
Also bear in mind that for k-means you require a (semi-)metric, but while you can interpret histograms as K-vectors they do build metric space, again this metric most likely doesn't make sense for your application (and the metric is the most important thing for you: after all, what you want to do is measure the "distance" of two images: the distance is 0 if and only if the compared images are the same; distances are not negative; the distance d(i1, i2) = d(i2,i1), i.e. it is symmetric; and d(i1,i2)+d(i2,i3) ≥ d(i1,i3). Thus your similarity function in the end *is* a metric, and therefore you want a metric feature space, in which the distance has the same meaning.)
[...] It's not perfect, but I think it's not a bad idea. What do you think? |
I think you should try it. Don't be disappointed if it doesn't give you the expected results - image analysis is at this time, as I understand it, as much a mathematical based science as it is experimentation. The Sobel operator, for example, has mathematical properties which make it "good" (it eliminates noise along the edge by a Gaussian filter while differentiating orthogonal to the edge), but it wouldn't have been found if experiments with Prewitt's filters wouldn't have shown such bad results in real-life images where noise actually is a major factor (even if not visible).
So what I would do is to start with something like you have described, and then try to figure out in which situations it works and it which it doesn't. Then you can try to build a mathematical model based on that, and to optimize your approach within the model.