Suppose you have a big collection of image files and you want to find which ones "look like" one particular file. There are basically three ways to define what "looks like" means, and they all have different strengths and weaknesses, but the basic idea behind all of them is that there's a function that, given an operand image and a database, returns a set of images with the probabilities of each one being a duplicate of the operand.
1. Binary approach.
This involves considering the image as a file and search for other files that are binary copies of it. Some methods consider the file as a monolithic entity, and other, smarter ones use only using the actual pixel information, allowing them to recognize as duplicates, for example, two identical JPEGs with modified metadata fields.
The advantage of this approach is that it's very efficient to both index an image (pass the file through a hash function, such as MD5 or SHA-1) and to query a database for matching entries; the obvious disadvantage is that it will miss near duplicates.
2. Histogram-based approach.
The basic idea behind this approach is to allow the program to recognize duplicates when one of the operands has gone through some minor alterations that don't significantly change the appearance of the image, such as rescales, minor color changes (e.g. what happens when you re-compress a JPEG stream), etc. There's a lot of room for creativity in this category, but a possible method could be
* Rescale the image to a standard size. The resampling function has to be carefully chosen to avoid losing too much information in this step, or to avoid creating information.
* Reduce the colors to a standard bit depth. Again, the function that maps the old color space to the new one must be chosen carefully.
* Get a histogram of the image.
* Hash the histogram.
The disadvantages are that it can take a lot of time and memory to process a large collection. Also, a poorly designed method can fail disastrously with monochrome images.
3. Feature-based approach.
Feature recognition is a technique in computer vision to find interest points in an image and encode their surroundings, so that two images can be compared. There are several algorithms capable of taking an image and returning a set of representative points. Some of them return points that are strong (sometimes called "invariant") against radical modifications, such as rotations. Using this, it's possible to construct a database of points that would be able to recognize duplicate images with such hard modifications as cropping (because interest points are independent of each other), rotations, viewpoint changes (when two photos of the same thing are taken from different positions; not all algorithms are capable of this, though), color changes (exactly what types of changes depends on the algorithm), etc. The problems are several, and I'll tell you all about them.
1. Feature recognition is a very expensive operation. My Core 2 Duo could only process up to 0.5 images per second, which is much slower than any other approach. Luckily, these kinds of algorithms are very parallelizable, so GPU-based implementations are very fast, although you're limited by VRAM.
2. These algorithms return interest points in two spaces. One is the 2D space of the image; the other is typically a high-dimensional space (over 50 dimensions, usually). A vector in the latter describes an interest point in such a way that two points from the region (e.g. someone's left iris) in the two images, where one of them is a slightly modified version of the other, will be very likely close to each other. Due to something called the "curse of dimensionality", it's remarkably hard to find which vectors in a "small" (the definition of "small" grows exponentially with the dimensionality of the space) set of high-dimensional vectors are close to a given vector. For example, to find the exact nearest neighbor to a vector in R40 using a BSP tree, the search is very likely to degrade into an overly complex linear search, where every element in the tree ends up being evaluated. The same problem is shared by all space partitioning methods.
There is a technique, which I frankly don't understand, called locality-sensitive hashing, which solves the approximate nearest neighbor problem, by making it likely that points that are close to each other (usually, I've read, using the L1 metric) will end up in the same hash bucket. I won't go into the details, because I'm almost sure I misunderstood them, but I will tell you what I did. If you have a 64-cube bounded by -1 and 1 in all dimensions, it's possible to slice it into 2^128 pieces (each dimension is sliced into 4 regions). This is easy to store in a database, and it means close points will be either in the same region, or in some contiguous region (but note that each region has 3^64 contiguous region. But again note that if you're willing to have only approximate results, the search can be restricted to 128 regions or less, and even tuned to only test the most likely regions).
Which leads me to the point of the post:
3. Storing these values in a database for efficient query and insertion is very hard for large collections of images.
Some facts: a) The nature of the algorithm and its input turns it into a PRNG of sorts. b) On average, the uncompressed information on interest points requires about as much storage as the collection that generated it.
If points are added unsorted into a table, similar points will end up spread all over the file. For a large database (over 50 GiB in my case), this means that both when inserting and querying an indexed table, a lot of time is spent seeking different parts of the table. Sooner or later, seeking time dominates insertion an retrieval times.
I haven't found any embedded database engine capable of handling this situation well. PostgreSQL starts out much slower than embedded databases, but scales much better. I managed to process my entire collection of 137k images (217M points) in less than a day with it. However, retrieval times aren't great; taking between one and six minutes to process an image (note that the time grows with the number of points that the image generates), although I'm sure there are things that can be done to improve it. I haven't tested MySQL, but I doubt there'll be much of an improvement. I've also considered using a custom file format that would work by hashing points into fewer and larger buckets (2^18 instead of 2^128). It should reduce the amount of seeking needed, while increasing data that is unnecessarily read. It's not clear how its size would compare to a B-tree's.
I never would have thought that this would be the hard part of the project.
I have a feeling that you are simply ignoring the project's requirements. What is the purpose of this ? What is the likelihood that you'll have a copy of an image, rotated 1 rad and with slightly different palette ? And even if you do, do you really need to know that it is the same image ?
The purpose is implementing a hybrid image search that will use all of the above techniques.
What is the likelihood that you'll have a copy of an image, rotated 1 rad and with slightly different palette ?
Just because it's unlikely doesn't mean that it should be ignored. However, a much more likely situation is having two images, where one of them is a cropped version of the other. It's possible to find the original from a 5% crop.
Actually, I've been interested in how computers do this for a while, interesting(frustrating for you though). Is this a Proprietary Project?(i.e. can I see source code :3)
It's a prototype for an OSS project, but it's still too experimental to be worth sharing the source. If you want to play around, go get OpenSURF. You'll also need to download OpenCV.
Sounds very interesting. Just out of curiosity is your algorithm supposed to recognize when people cut or dye their hair, have plastic surgery, are wearing different clothes and are in different surroundings too? I assume that some modified version of the 'feature based approach' or possibly even a 4th different approach would be necessary. Both seem very complex and very deep into AI.
Although most of your post was above my head it sounds very intriguing. I remember reading something about cryptography where some mathematicians devised an almost-brute-force approach by generating a list of probable keys (or probable encryption methods I don't know which) which made kryptanalis much more practicle. It sounds like your problem might be related, i.e. you are brute-forcing, so to speak, all the possible interest points when it might suffice to just analyze a portion of the most highly likely candidates. Just me brainstorming, I don't really understand what you mean by 'interest point' or how you could devise a way to determine probable candidates, but I hope I could at least help you think of new ideas.
Just out of curiosity is your algorithm supposed to recognize when people cut or dye their hair, have plastic surgery, are wearing different clothes and are in different surroundings too?
Well, technically this is no longer considered duplicate image detection; it's a problem in biometrics. I presume face recognition algorithms rely on feature points with specialized descriptors.
I solved my access time problem by doing this: take the spacial 128-bit hash and pass it through SHA-1. This gives back five 32-bit integers. Take the first one and leave only the 18 least significant bits, giving an 18-bit hash with the property of uniformly distributing the spacial hashes among the 262144 buckets (the uniform distribution is a property of SHA-1 and cryptographic hashes in general). With so few buckets, it's easy and space-cheap to construct a hash table that indexes a flat binary file. This gives a point index capable of O(1) access times.
Previous configuration:
PostgreSQL
All data on one table arranged by points.
Point data stored as SQL arrays of 64 floats.
Indexing by B-tree on 128-bit hashes (at 217M points, averaging slightly more than 1 elements per bucket).
Database size: ~80 GiB
Typical search times: 1-6 minutes.
Current configuration:
SQLite
All data on one table arranged by images.
Point data stored as DEFLATEd arrays of 64 bytes (points belonging to the same image are gathered and compressed as one stream).
Indexing by custom 18-bit hash function (at 220M points, averaging ~800 elements per bucket).
Database size: 14.3 GiB (note: this is the size for an ungrowable index. It takes 15 minutes to rebuild the index every time more images are added to the database. With a growable index, the database needs 15.4 GiB)
Typical search times: 5 seconds to 2 minutes.
I'm quite satisfied with the result. The best part is that the index (5 GiB) can be stored separate of the database (e.g. in solid state). A server could even conceivably keep it in RAM.
I can't wait for the day that my personal projects sound this complex(or are this useful). Do post a link to your works when you are satisfied with it :)