May 19, 2025
Technical Deep Dive
Perceptual Hashing in Node.js with sharp-phash
At Brand.dev, we're building an API that helps you fetch company brand data—like name, address, logos, colors, and more— from any domain with a single API call.
Because we process tons of images every day, finding the best way to automatically de-dupe logos has been a key part of our journey. In this post, we’re diving deep into one of the most effective image deduplication techniques we’ve tested (and now use in production): perceptual hashing. Here’s how it works, why we chose it, and how you can use it too.
If you'd like to try out our API completely free, click here.
Enjoy 😊
Perceptual hashing is like a "fuzzy fingerprint" for media files – especially images. It produces a hash value that stays similar for images that look similar , rather than changing completely when a single pixel changes (as a cryptographic hash would). In other words, perceptual hashes are designed to change as little as possible for similar inputs, the exact opposite of cryptographic hashes which exhibit an avalanche effect.
This makes perceptual hashing incredibly useful for tasks like finding duplicate or near-duplicate images. In this blog post, we'll dive into what perceptual hashing is, why it's useful, and how you can implement it in Node.js using the sharp-phash library. We'll walk through generating and comparing perceptual hashes, explore real-world use cases (from catching meme reposts to cleaning up photo libraries), design an image deduplication pipeline, and discuss limitations and best practices. Grab your favorite beverage, and let's get hashing!
What is Perceptual Hashing (pHash)?
Traditional hashing algorithms (like MD5 or SHA-256) are built to generate completely unpredictable, unique outputs – even a tiny change in input (say, flipping one pixel) will radically change the hash. This property is great for security and data integrity, but not so great if you want similar inputs to have similar outputs.
Perceptual hashing flips that script. A perceptual hash is a type of locality-sensitive hash designed so that if two images are visually similar, their hashes will also be similar. For example, two copies of a meme with different compression levels or a slight color filter might end up with almost the same perceptual hash. This allows us to compare images based on content appearance rather than exact binary data.
Under the hood, perceptual image hashing algorithms condense the important visual information of an image into a compact fingerprint. There are a few different algorithms (average hash, difference hash, etc.), but one popular technique is the pHash algorithm (short for perceptual hash).
The pHash algorithm, as implemented by sharp-phash , uses a bit of math and signal processing magic. In simple terms, it might resize and normalize the image (often converting to grayscale and scaling down), then perform a Discrete Cosine Transform (DCT) to analyze the image's frequencies (similar to how JPEG compression works). The algorithm then picks out the most significant frequency components and turns them into a binary pattern – a string of bits. The result is typically a 64-bit hash (often represented as a 64-character string of 0 s and 1 s) that encapsulates the image's looks. Images that look alike will produce hashes that differ by only a few bits, whereas very different images will have hashes that vary widely.
Why is this useful? Because it gives us a way to quantitatively measure image similarity. By comparing two images' perceptual hashes, we can compute a distance (usually the Hamming distance, which is basically the count of different bits). If the distance is small (e.g. just a few bits differ), the images are likely very similar or nearly identical. If the distance is large, the images are probably different. In fact, perceptual hashing enables comparing images by looking at the number of differing bits between their hashes – a metric known as the Hamming distance. This ability to hash content instead of exact data opens up all kinds of possibilities for detecting duplicates, finding similar images, and more.
Setting Up sharp-phash in Node.js
Now that we know what perceptual hashing is, let's get hands-on and use it in Node. One convenient library for image pHash in Node is sharp-phash. It's built on the popular Sharp image processing library, so under the hood it handles all the image decoding and pixel crunching for us.
Installation: To get started, you need to install both sharp and sharp-phash (since sharp-phash depends on Sharp for image processing). You can add them to your project via npm or Yarn:
npm install sharpsharp-phash
or yarn add sharp sharp-phash
Make sure you've installed the prerequisites for Sharp (on some systems, Sharp may require certain binaries or libraries like libvips – consult the Sharp docs if you hit any install issues). Once installed, you're ready to generate some hashes!
Generating and Comparing Perceptual Hashes in Node.js
Let's write a simple Node.js snippet to generate perceptual hashes for images and compare them. We will use sharp-phash to compute the hash and a helper function it provides to measure the Hamming distance between two hashes.
A few things to note in this code:
We use fs.readFileSync
to read the image files into memory. You could also use asynchronousreads or even pipe a Sharp stream, but a simple synchronous read is fine for demo purposes.The phash function (imported from sharp-phash) returns a Promise that resolves to the hashstring.
We await it to get the result.The resulting hash1 and hash2 are strings consisting of 0 and 1 characters. This is the binaryfingerprint of the image. By default, sharp-phash gives a 64-bit hash (hence 64 characters) foreach image.We then use sharp-phash/distance (the library's distance function) to get the Hammingdistance between the two hash strings. The Hamming distance is just the count of positions at whichthe two hash strings differ (i.e., the number of bits that are different).
Now, what do we do with this distance? This number is a measure of how different the images are visually. A distance of 0 means the hashes are identical (the images are likely identical or extremely similar). A small distance (say 1, 2, up to a few bits) indicates the images are very close in content. As the distance grows, the images share fewer similarities. In practice, you'll choose a threshold distance to decide if images count as "the same" for your use case. For example, using pHash on standard images, a Hamming distance ≤ 5 or so might indicate near-duplicates. In our code above, we would see the distance printed out – if image1.jpg and image2.jpg are just re-encoded versions of the same picture or have minor edits, the distance will likely be quite low (under 5). If they are completely different images, the distance will be larger (often dozens of bits different out of 64).
Example: Suppose image1.jpg is a meme image and image2.jpg is the same meme converted to PNG or with a slight caption change. The perceptual hashes might look like 101001100110... vs 101001100010... – very close. The Hamming distance might be something like 2 or 3. On the other hand, if image2.jpg was a totally different picture (say a cat photo vs a dog photo), their hashes would have many bits different and the distance could be 20, 30, or more.
With sharp-phash, comparing is straightforward: generate hashes and then use the distance function. You could also compute distance manually by XORing the bit strings and counting 1 s (that's essentially what the library does internally). But having distance() handy is convenient, and it saves us from writing our own bit-counting loop (though for learning purposes, writing a quick Hamming distance function in JavaScript is easy too).
Real-World Use Cases for Perceptual Hashing
So, what can we do with these fuzzy image fingerprints? It turns out, a lot! Perceptual hashing is used in many real-world systems where identifying similar images is important. Let's go through some use cases that highlight why pHash is so useful:
1. Detecting Duplicate or Near-Duplicate Memes in Social Apps
If you've ever scrolled through a social media app and thought, "I've seen this meme 10 times today," you're not alone. Memes get screenshot, reposted, slightly edited (maybe a new caption or a slight crop), and circulated ad nauseam. Using perceptual hashes, a social app can automatically flag or group these near- duplicate memes. For example, if user A posts a popular meme image and user B posts the same meme with a minor edit (like a different text font or a small sticker added), the perceptual hashes of those images will be very close. By comparing the hash of the newly uploaded image to hashes of known memes, the app can detect the repost. This could be used to avoid showing the user duplicate content, to auto-tag the new post as a "known meme", or even just to track how a meme evolves as it spreads.
Consider that a common strategy for de-duplicating images is to compute perceptual hashes for each image and compare those hashes with each other. These hashes are tiny compared to the images themselves (often just 32-128 bits), and if the hashing function is good then hashes that differ by only a few bits likely correspond to similar images. In our meme example, two screenshots of the same meme template will yield hashes that differ by very few bits. A system can easily spot that and treat them as duplicates or siblings. This is way more reliable than naive methods like comparing file names or file sizes, and far more flexible than exact hashing (MD5/SHA) which would consider two images different if anything is changed (even re-saving the JPEG with different compression would break an exact hash match).
2. Content Moderation and Flagging Reposts
Perceptual hashing isn't just for fun; it's also used in serious applications like content moderation. A classic example is the fight against inappropriate or banned images. Organizations and platforms often maintain databases of hashes of illegal or banned content (for instance, child abuse images or extremist propaganda). By using perceptual hashes, they can automatically flag uploads of known harmful images even if they've been resized, cropped, recolored, or lightly edited. A notorious image could be converted to grayscale or have a logo added, and a cryptographic hash would no longer match the original. But a perceptual hash can still catch it, because the core visual content remains the same.
A well-known system in this domain is Microsoft's PhotoDNA , which is essentially a perceptual hashing technique for identifying known illegal images. It was developed in 2009 to combat child exploitation images and is provided to many tech companies and law enforcement agencies. Likewise, platforms like social networks use perceptual hashes to prevent banned images from being re-uploaded after minor edits. Even beyond illegal content, moderation teams might use it to spot things like people trying to evade an image ban (e.g., if an offensive meme is removed, and someone tries to upload it again with a tiny change, a perceptual hash system could catch that).
Another moderation angle is spam or repost detection. If someone is flooding a forum with the same image, or if a bot is scraping and re-uploading popular images, perceptual hashes can identify those reposts. Many content-sharing sites use this to suggest "Hey, we think this image was posted already – here’s the link," or to auto-collapse duplicate posts.
3. Organizing Large Photo Libraries
If you've ever managed a large collection of photos (personal or enterprise), you know duplicates and similar shots are a pain. You might have burst shots on your phone that are almost identical, or edited 100 versions of the same photo, or the same picture stored in different formats. Perceptual hashing can help organize and de-clutter photo libraries by finding groups of similar photos.
For instance, suppose you have a folder with 10,000 pictures. Some are exact duplicates (copies), and others are near-duplicates (you edited a photo or exported it in a different resolution). Running perceptual hashing on all of them and then comparing hashes can cluster these images. You might discover that out of those 10,000 files, only 8,000 are unique scenes and 2,000 are duplicates/variants. Photo-management software could use this to prompt you: "We found 5 sets of duplicate photos, do you want to eliminate extras?" or automatically consolidate storage. The nice thing is the hash sees past minor differences – maybe you have one photo in color and one in black & white, but if they are the same scene, a pHash (which usually ignores color) will consider them similar. This way, you can group them together.
In professional settings, think of stock photo libraries or media asset management: perceptual hashes could help ensure you don't store 15 copies of the same high-res image under different names, saving storage and making searches more efficient.
4. Avoiding Redundant Uploads in Cloud Storage
This use case is like a mix of the above, but specifically for cloud services or any system where users upload images. Suppose you're building a cloud storage service, or an image hosting platform. Users might unknowingly upload the same image that already exists in the system (or even two users uploading the same popular image). With perceptual hashing, you can detect these cases and avoid storing duplicate data. Instead of saving two copies, you might just keep one and use references, or at least alert the user "Hey, this image looks similar to one you've already uploaded."
Another scenario: imagine a website generator or CMS where users might upload a company logo multiple times into different sections. The system could detect it's the same image and just reuse the one stored copy to optimize bandwidth and storage.
For cloud storage efficiency, perceptual hashes can be part of a deduplication pipeline where every incoming image is hashed and checked against existing hashes in a database. If a match or near-match is found (within a threshold), the system can decide to skip storing a new file. This saves space and also helps with consistency (everyone referencing the same image asset). It's like having a smart filter that says "we've seen this picture (or something very close to it) before."
Building an Image Deduplication Pipeline with pHash
Let's put some of those ideas into practice conceptually. How would you build a system to automatically detect and handle duplicate images using perceptual hashing? Here's a high-level guide:
1. Compute and store hashes for incoming images: Whenever a new image is added to your system (uploaded by a user, added to a library, etc.), immediately compute its perceptual hash (using sharp- phash or similar). This should be done as part of your image processing pipeline. For example, in an upload endpoint, after receiving the file you might do const newHash = await phash(fileBuffer);
. Store this hash along with the image's record (e.g., in your database or an in-memory index). The storage is tiny (64 bits per image), so even millions of hashes won't take huge space – and checking hashes is much faster than comparing full images.
2. Compare to existing hashes: Before finalizing the upload or saving a second copy of an image, compare the new image's hash against hashes of images you already have. The naive approach is to linearly scan through all existing hashes and compute Hamming distances. This can actually work fine if you don't have a massive number of images (a few thousand or even a hundred thousand can be manageable with a proper index). As an example, you could loop through your stored hashes and use distance(newHash, oldHash)
for each to find the smallest distance. If any distance is below a chosen threshold (meaning the new image is very similar to something already stored), you've found a duplicate (or at least a suspected duplicate).
3. Use a threshold to decide on duplicates: The threshold will depend on how strict you want to be. If you set threshold = 0, you're only catching exact hash matches (which will catch images that produce identical hashes). If you set threshold a bit higher (like ≤ 5 bits difference), you'll catch images that are not pixel-for- pixel identical but still essentially the same content (resized, minor edits, etc.). For instance, if you found a distance of 2 between newHash and some oldHash in your DB, that's a strong indicator they're near- duplicates. At that point, your pipeline can decide to flag it. You might reject the upload as a duplicate ("Looks like you already uploaded this image earlier") or just log it for manual review, or perhaps link the user to the existing image rather than storing another copy.
4. (Optional) Use a faster search structure for many images: If your image collection is very large (say millions of images), scanning through each hash for every upload could become a bottleneck. This is where clever data structures like a BK-tree come in. A BK-tree (Burkhard-Keller tree) is specifically designed for distance-based lookups (like Hamming distance). It can drastically speed up finding "close" hashes without checking every single one.
The idea is you build a tree of hashes where each node branches by distance, allowing you to query "find any hashes within X distance of this new hash" efficiently.
There are Node.js implementations (for example, the bktree-fast package) that can help with this. However, if performance isn't an issue yet, you can keep it simple with a loop or using a Set/Map for exact matches.
5. Handle the duplicates: Once you detect a duplicate or near-duplicate, decide what to do. Possible actions:
Skip storing the new image and simply reference the existing one (saving space).
Store it but mark it as a duplicate of X (which could be used to show "hey, this image already exists in the system"). - Alert the user (if it's something like a submission that should be unique).
For a photo library, maybe automatically group the new image with the existing one (so the user sees one entry with two versions).
6. Periodic maintenance (optional): Over time, you might want to periodically re-hash images (if you change your hash algorithm or parameters) or scan through all hashes to find duplicate clusters that slipped through. But generally, once stored, the hashes can be reused indefinitely.
By following a pipeline like this, you effectively create a deduplication system. Every image gets a fingerprint on arrival, and that fingerprint is used to catch duplicates quickly. Remember, comparing two 64- bit hashes is extremely fast (just a XOR and bit count, which computers do in nanoseconds). As an example, one developer notes that comparing via Hamming distance can be as quick as a single CPU instruction (popcount after XOR) on modern CPUs. So even scanning thousands of hashes is usually fine.
One real-life takeaway: similar images will have similar hashes. So if you see a hash in your database that is exactly the same as the new one, it's a dead giveaway of a duplicate (or perhaps the exact same image file uploaded again). If the hash is only a few bits off, it's likely the same image with minor changes. This strategy is widely used because perceptual hashes are small and efficient to compare, making them ideal for deduping tasks.
Limitations and Considerations
Perceptual hashing is powerful, but it's not magic. As a developer implementing pHash in Node, keep these considerations in mind:
Not a cryptographic identifier: Do not use perceptual hashes for security-critical checks or as unique IDs for content. Different images can sometimes produce the same perceptual hash (a collision), though it's rare in practice. The design goal is similarity, not uniqueness. If you need to guarantee two files are exactly the same, use a cryptographic hash (like SHA-256). In fact, if your goal is to detect byte-for-byte duplicates (exact copies), perceptual hashing is overkill – a cryptographic hash will do that perfectly. Use pHash when you care about visual similarity, not exact matching.
False positives/negatives: Because perceptual hashes deliberately ignore certain differences (like tiny color changes or minor crops), you might get false positives – images that are actually different but happen to have similar pHashes. Conversely, it's possible (though not common) for two images that look alike to have a larger hash distance than expected due to algorithm quirks. Tuning yourHamming distance threshold is important. You might start with a conservative threshold (like 5 or 10 for a 64-bit hash) and adjust based on what you observe in your dataset.
Invariance (or lack thereof) to transformations: Most basic pHash implementations (including sharp-phash) are robust to things like compression artifacts, resizing, blurring, and color shifts. They are less robust to geometric transforms. If someone rotates an image 90 degrees or flips it horizontally, the hash will likely change a lot (since the pixels move around). Cropping or adding borders can also throw off the hash because the image's overall composition changes. So if your use case involves rotated or heavily cropped duplicates, you may need to handle those separately (e.g.,by rotating images to a canonical orientation before hashing, or using more advanced algorithms that are rotation-invariant). In general, pHash assumes the images are roughly aligned the same way.
Grayscale and color: Perceptual hashing often works on grayscale images (to focus on structure and ignore color). This means it treats a color image and its grayscale version as essentially thesame. That’s usually what we want for duplicate detection. But be aware: two images that differ onlyin color (say one is in full color and one has a blue tint) will have very similar or identical pHashes. If color differences matter to you, basic pHash won't capture that. On the flip side, this is a feature if you want to catch an image that's been color-filtered or has altered brightness – the hash stays stable despite those edits.
Performance and scaling: Computing a pHash is more CPU-intensive than computing a quick MD5.It involves reading image data, resizing, possibly a DCT, etc. Libraries like sharp are written in C++and are pretty fast, but if you're hashing thousands of images a second, you'll need adequate CPU resources or to offload the work. Also, if you plan to compare a new image against millions of existing images, you'll want to implement an efficient search (as mentioned, BK-trees or other indexing methods for high-dimensional data) rather than brute force. However, for moderate scales, perceptual hashing is quite feasible. And remember, the hash comparison itself is extremely fast; it's reading and decoding the image that usually dominates the time.Hash storage: Storing a 64-bit hash for each image is trivial in modern databases (8 bytes or even asa 16-character hex string, etc.). The fun part is using those hashes. You can store them as binary blobs or as strings. If using SQL, you might store them as BINARY(8) or CHAR(16) (if hex) or CHAR(64)(if storing the literal bit string). Some developers even store an integer representation. Just be consistent in how you store and compare them. And if you use a specialized data structure (like a BK-tree or an Elastic search plugin for hash similarity), that will dictate storage.Collision risk: While unlikely, it's theoretically possible for two completely different-looking images to yield the exact same 64-bit pHash (a collision).
Good perceptual hash algorithms minimize this chance (and it's far less likely than with something like a naive 8-bit average hash). Still, it's not zero.In practice, collisions are not a big issue for most use cases because the space of 64-bit values is huge (2^64 possibilities) and the algorithm tends to distribute hashes according to image content. If you're worried, you could increase the hash size (some libraries let you generate larger hashes, e.g.,128-bit), at cost of more computation and storage. But again, this is usually overkill unless you have millions of images and are extremely unlucky or dealing with adversarial cases.
Deliberate attacks: If your system's adversary knows you're using perceptual hashing, could they craft an image that looks like one thing but has the same hash as another? It's an interesting question (and there has been research into hash spoofing ). Generally, basic pHash isn't designed to be secure against a determined attacker – it's for benign similarity detection. If someone deliberately wants to fool a perceptual hash (to, say, sneak an image past a filter by causing a hash collision withan innocuous image), it might be possible with sophisticated techniques. For most applications, thisisn't a concern, but for high-stakes content moderation (e.g., bad actors trying to evade CSAMdetection), more robust, proprietary solutions like PhotoDNA are used because they are harder tofool. In our everyday developer scenarios, this usually isn't a worry, but it's good to know the limits.
In summary, perceptual hashing is a fantastic tool but use it appropriately. If you just want to know if two files are exactly the same, use a cryptographic hash. If you want to find out if two images look the same, use a perceptual hash. Often, systems use both: for example, first deduplicate exact file copies by MD5 (to save processing), then use perceptual hashes to find the remaining similar pairs.
Useful Links
Amazing blog post: https://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
Introduction to Perceptual Hashes: Measuring Similarity - Apiumhub https://apiumhub.com/tech-blog-barcelona/introduction-perceptual-hashes-measuring-similarity/
Perceptual hashing - Wikipedia https://en.wikipedia.org/wiki/Perceptual_hashing
sharp-phash - npm https://www.npmjs.com/package/sharp-phash
bktree-fast - npm https://www.npmjs.com/package/bktree-fast
pHash in NodeJS | SSOJet https://ssojet.com/hashing/phash-in-nodejs/