Suggestions for a program

I was almost done typing this post, when I accidentally hit ctrl+r instead of ctrl+t and got the whole thing wiped clean. That's the last time I type anything in a browser.

So I'm writing a set of programs to track down duplicate files in general, and duplicate images in particular (image boards tend make these suckers pile up in your file system).

The first program, which I have dubbed imgcmp, searches in a list of directories taken from a configuration file for files matching a list of patterns (e.g. "*.bmp" "*.jp*g") and then, with each file:
1. Perform a CRC64 on the file.
2. Load and resize the image to a 32x32x24 using nearest neighbor, x:y ratio ignored. This thumbnail will be used for comparison
3. Convert the pixels in the thumbnail to HSV.
4. Index the thumbnail in a map using the CRC.
After all thumbnails are loaded, which takes in average ~10 mins for every ~10k images, I do a pixel-by-pixel comparison of each possible pair of images (((n-1)^2+n-1)/2). It's really nothing fancy, just a subtraction of channel values (I have to do something a bit more complex to compare hues, though). If the sum of all differences between a pair is lower than a threshold, the result is added to a vector (I used to add them to an array regardless, but then a friend told me about mutual exclusion. Good thing, because the memory usage was ridiculous). Images that are in fact a duplicate file are given a negative result so that they later appear at the top of the vector.
When all pairs have been compared, the vector is sorted from pairs most alike to least alike. I found that a likeness of less than 98-95% is probably a false positive.
Finally, the hashmap is save to a binary file to be reloaded in future runs.

The second program hashes all files in all file systems and saves the CRC64-file name relation to a text file (the file name encoded as UTF-8, of course). As a side note, it runs at an average speed of 60 MiB/s with peaks of 80 MiB/s during the larger and less fragmented files.

The third program should monitor a set of directories for files and automatically delete those that appear in the database produced by the second program*, and report the likeness to images stored in the database produced by imgcmp

I'd like to hear suggestions and ideas for improvement (I'd like to keep it as a console program, though). I'm specially worried about the possibility of CRC collisions, and I don't mind too much increasing the size of the digest, but since I'll be required to process a lot of files, I'd prefer not to increase hashing (digestion?) time too much.

* Ideally, this database should be updated to reflect deletions and moves. I'm still thinking some way to monitor this in a way similar to Sysinternal's Filemon. Suggestions in this area are particularly welcome.
I'd look at SHA algorithm instead of CRC. My understanding is that CRC is not as unique as you might think.
All hash algorithms have significant collisions. CRC was not designed for file comparisons --only integrity testing. Unless the file is very carefully damaged, the CRC will likely reflect the change. However, that doesn't mean that it can't happen, or that other files won't exhibit the same CRC.

Likewise for SHA, MD5, etc. The difference between algorithms is really just the robustness of the integrity check. Not file comparisons.

Lastly, CRC, SHA, etc. are premised upon an exacting input. Images may appear very similar, but have very distinct data. The variation that occurrs just from resizing and/or resampling an image obviates the utility of a hash algorithm.

There exist image normalization techniques to minimize these variations, but there is no perfect method. You will have to consider that your matching algorithm will need some statistical analysis --matching positive for a certain probability, rather than a 100% match.

Alas, sorry...
I only hash the file to reduce comparison time. If two files are known to be the same, there is no point in comparing each of their pixels.
As for likeness, I knew all along that it was best to return a probability that two files were the same, rather a certainty. That was in the algorithm from the beginning.

Does anyone have any ideas on how to monitor file system operations?
What do you want to do, determine *equal* or *similar* images? For the first, use a hash from the shs, having the appropriate size (sha1 will be a good choice for most cases and is readily implemented and will most likely not have any collisions - there are more possible values than particles in the known universe, and they are pretty well distributed...). Then *sort* the resulting list/vector/whatever in O(n log n). Comparisons can then be performed in O(n) giving an overall O(n log n) time complexity (which is optimal for the problem).

For the second, there are quite some approaches to define a "distance" between images (when I did 3D computer vision, the actually useful difference functions turned out not to be actual distance functions, some weren't even transitive, so don't be fooled by the term). A search on the topic will provide quite some information (stereo vision has to deal with that a lot). Given these amounts of images you perhaps also want to perform a clustering (something better than k-means...). Just google it, you will find plenty of algorithms. (Pattern recognition techniques might help, too. I had some pretty neat results for texture patterns with the feature space described by M. Unser, I don't know where to get his paper, though).
I need to find similar images. The algorithm is already done and it works adequately, considering how simple it is.
The problem with fancy algorithms is that in this case, n is VERY large. Just a few hours ago, I compared 34700 images, which amounts to 602,027,650 comparisons at an average speed of 128693 cmp/s. And that's using both CPUs.
Sure, the algorithm isn't perfect, and mostly-gray images, such as drawings, have a surprisingly high likeness, but I'd prefer not to spend a whole month looking for duplicates.
Your approach, as I understand it, has a fairly large constant embedded in the comparison of images, which you do perform pair-wise, i.e. in O(n2). Computing a feature-space representation of the images which allows for sorting will lead to a O(n log n) algorithm, and the constant is embedded in a O(n) algorithm (namely, computing the feature space representations for each image). Thus, even if the constant is considerably larger, the approach will be faster for large amounts of images.
Furthermore, what does make images "similar" for your application? In most cases, actual color/gray values aren't very accurate (two equal images being exposed at a difference of 1/3 exposure values will be rated completely different, while a human who doesn't see the images directly next to another won't know the difference). A better approach might be comparison in a frequency space (Fourier, discrete cosine, or wavelet transform might be suitable), or the images convoluted by the Sobel kernel, allowing the sole comparison of edges (well, gradients)
Note that this will slow down your approach by a factor of n, which is negligible in an O(n2) algorithm.
Sorry, but I don't understand what you mean. I've pieced together my limited understanding of complexity analysis from various sources, none of which formal, so please bear with me.
My algorithm takes O(n^2) time because it has to do all those comparisons, correct? How could a different algorithm reduce the number of comparisons?

If you must know, "likeness", as defined by my algorithm, is an average of the HSV difference (simple subtraction is of course not enough to calculate the difference between two hues) between two bitmaps of the same size, with lower results indicating higher likenesses. These bitmaps are reduced versions of the originals (usually 32x32 or 16x16 regardless of original dimensions) for reasons of memory usage and calculation time, but also to bypass problems such as the same image appearing resampled twice, as this was a known possible property.
Suppose you wanted to determine if in a given (multi-)set of integers two are "similar", as in |a-b|<x. Then you don't have to test each a,b from your set. Instead, you can sort the elements of the set. Then if |a1-a2|>=x, |a1-an| < x cannot be true for any n > 2. Therefore, you only have time complexity O(n log n) for sorting and O(n) for comparisons (you have to loop over the sorted sequence once), i.e. O(n log n) overall complexity.

My suggestion was that you define a function φ: I->Q from all images to the set of rational numbers, where φ is a "homomorphism" according to the equality relation (of course, a relation isn't an operation, so homomorphism doesn't really make sense, but I think you get the idea), basically reducing your problem to the one above.

PS
With the question about similarity, I didn't mean the formula you are using, but *your* definition of similar images. What kind of images are they? E.g., is a photograph of a page of text similar to the photograph of a different page of text? Is a portray similar to any other portray, or only to the ones of the same person, or only to the ones of the same person taken at the same distance using the same camera settings etc., with a different noise pattern? The definition of image similarity varies widely. You should know what you want (and define it clearly), otherwise you won't succeed. Perhaps you should sort some images "by hand", trying to figure out what makes them "similar".
Last edited on
Oh, sorry about that. Well, similar would be, for instance, the same image rescaled to a different size or with slightly different colors. For example two photos of La persistencia de la memoria with the only difference being the resolution (in reality, it's likely that picture A was taken at high resolution, and then shrunk to make picture B). This is detected correctly.
Often, the same wallpaper will came in different resolutions, with its elements redistributed to better fit the size. Imagine at 1024x768 you have Superman in the middle; but at 1366x768, Superman is the exact same size and still in the middle. These cases are not detected well, but I would like to.
Another case that is poorly detected that I also ideally classify as a similar, is when picture A is a close-up of picture B.

It'd be great if you could mention a few algorithms. This whole clustering stuff is completely new to me, so I don't know where to start looking.
Last edited on
There are different matching strategies for size-invariant, rotation-invariant, color-invariant etc. comparisons. Unfortunately, there isn't an algorithm which does all at the same time.

You cannot (efficiently) find exact matches. Determine features of your image (e.g., average green value of the pixels in the center 10x10px box, or whatever else you think of and gives good results) instead.
One of the semantically most important parts of an image are the edges; use a edge-detection (e.g. Canny operator), or at least a gradient image (Sobel, Prewitt, Roberts), and compute features of the resulting images. Then create a vector of all sorts of different features you created (use more features first; in the end you can determine which are actually important for your images and eliminate the rest, e.g. by using the eigenvectors of the covariance-matrix as a new coordinate system for your feature vectors).
The result is a feature vector for each image. Use the metric induced either by the usual vector norm or the minimum norm, to calculate the distance of the images in feature space (or do both and give out both as "possible matches"). If you have chosen the right features, similar images will map to similar feature vectors, on which you can then perform a clustering (if you didn't do anything with clustering at all, try out k-means first. It is very simple and you can play around with it. When you figure out in which situations it doesn't work, look for a clustering method explicitly addressing this problem).

From what you said, I would try as features:
- Average value of each color channel, for a grid of overlapping rectangles (e.g. [(0,0),(20,20)],[(10,10),(30,30)] etc.)
- Compute Sobel-filtered image. Go from the center of the image to the right until the gray value is > x. Do the same for left, bottom & top. The values (right-left, top-bottom) describe the size the bounding box of what might be the "most significant Object" in the middle of the image (you might want to do the same for the golden ratio or other significant points on the "average" image).
- The average gray value of the Sobel filtered image
- If you perform edge detection: the sum of the length of all edges, average edge length, maximal edge length

If images overlap in parts, the average color values of the rectangles will match (well, some of them), so you will detect images embedded in other images using the minimum norm.
Okay, I think I (mostly) understand what you said. One question, though: is it preferable to have simple values for clustering, or something more complex like, say, a histogram, is also feasible?

Now that you bring up edge detection, I had an idea last night.
I found an algorithm for corner detection, FAST, that gives good, stable results even for rotated images, so I thought I could find all the corners, build a graph by connecting the points with their neighbors in a radius of, say, 50 px (failing that, the nearest one), and use the relative distances between the points to do comparisons. The good part is that it's rotation-resistant and could be used to find an image in another (e.g. feed it a text and find images that contain something that looks like that. It's purely theoretical, but it should be possible). It's probably very expensive, but it should be pretty accurate.
Another option would be to partition the image into squares and search for each square. That would probably be faster and could produce fuzzier results. For example, if one image is basically another with its elements rearranged, it could find A is almost exactly like B, but B is not exactly like A (because it contains more elements or something).
Yet another possiblity is, since points tend to conglomerate in certain places, to build a second graph with these clusters as nodes.

It's not perfect, but I think it's not a bad idea. What do you think?
Last edited on
is it preferable to have simple values for clustering, or something more complex like, say, a histogram, is also feasible?

The main problem here is finding "good" relations - e.g., a strict weak ordering is required for sorting (and thus speeding up the process considerably from O(n2) to O(n log n)). If you manage to define such relations in a meaningful manner, then complex attributes aren't a problem. (However, how would you define such a relation for histograms? A lexicographical comparison doesn't make sense, what you most likely will end up with is something like the average brightness level, disguised in a complicated formula).
Also bear in mind that for k-means you require a (semi-)metric, but while you can interpret histograms as K-vectors they do build metric space, again this metric most likely doesn't make sense for your application (and the metric is the most important thing for you: after all, what you want to do is measure the "distance" of two images: the distance is 0 if and only if the compared images are the same; distances are not negative; the distance d(i1, i2) = d(i2,i1), i.e. it is symmetric; and d(i1,i2)+d(i2,i3) ≥ d(i1,i3). Thus your similarity function in the end *is* a metric, and therefore you want a metric feature space, in which the distance has the same meaning.)

[...] It's not perfect, but I think it's not a bad idea. What do you think?

I think you should try it. Don't be disappointed if it doesn't give you the expected results - image analysis is at this time, as I understand it, as much a mathematical based science as it is experimentation. The Sobel operator, for example, has mathematical properties which make it "good" (it eliminates noise along the edge by a Gaussian filter while differentiating orthogonal to the edge), but it wouldn't have been found if experiments with Prewitt's filters wouldn't have shown such bad results in real-life images where noise actually is a major factor (even if not visible).
So what I would do is to start with something like you have described, and then try to figure out in which situations it works and it which it doesn't. Then you can try to build a mathematical model based on that, and to optimize your approach within the model.
Topic archived. No new replies allowed.