APIs

Get Started

Get API Access

APIs

Blog

Documentation

Pricing

Get Started

Get API Access

How to Scrape Logos from Websites [For Developers]

May 21, 2025

Developer Deep Dive

At Brand.dev, we're building an API that helps you fetch company brand data, like name, address, logos, colors, and more from any domain with a single API call.

Because we scrape hundreds of thousands of websites every day, we've gained some valuable insights that could be of help to you on your scraping journey. We've put our insights into this blog posts in non-linear order so please skip ahead if you see a specific section that is of use to you.

If you'd like to try out our API completely free, click here.

Web scraping company logos from websites at scale can be a challenge, but with the right approach you can automate it. This guide covers techniques for extracting logos using Node.js and TypeScript, including DOM traversal tricks, filtering out non-logo images, handling different image formats, dealing with dynamic content, avoiding scraper blockers, and scaling up to thousands of domains. Code examples in TypeScript are provided along the way.

A brief word on an alternative CV based approach

This blog post is going to be primarily focused on DOM-parsing techniques, notes, and learned insights. However when DOM parsing fails, you can also treat the entire page as an image and apply vision algorithms. A mixture of these approaches produces the best results based on our testing. We will write up a separate blog post on CV based approaches and link it here, however, in the meantime, here's a brief blurb detailing what that will look like:

A common CV based approach approach is to take a full-page screenshot (using a headless browser) and run an object detection model to locate the logo. Models like YOLO or Detectron can be fine-tuned for logo objects – for example, the YOLOv5 architecture has been demonstrated for logo detection. Alternatively, you can combine OCR and layout heuristics: run an OCR engine (e.g. Tesseract) on the screenshot to find text, then flag any text matching the company name or known brand words as a potential logo.

For instance, Microsoft’s video-indexer service treats “textual logos” by detecting occurrences of a brand name via OCR l. After extracting candidate logo patches, it’s useful to validate them with an image-quality model. You could compute CLIP embeddings or use a learned “aesthetic” scorer: CLIP is known to capture composition and style attributes (like color, framing, and lighting), and models like NIMA output a 1–10 quality score for an image. A NIMA-like model assigns high scores to sharp, well-framed images, so you can filter out blurry or off-center captures.

These AI-driven checks attempt to ensure the final scraped logo is indeed clean and centered, which is especially valuable when no structured logo tag is available in the page HTML.

Identifying Logo Elements in the DOM

The first step is to locate the logo in a webpage’s HTML. Most websites place their main logo in the header or navigation area, typically as an <img> tag or an SVG element. Here are common patterns to find logos in the DOM:

HTML Attributes: Look for <img>, <picture>, or <object> tags with id or class attributes containing keywords like "logo", "brand", "header-logo", "site-logo", etc. For example, <img class="site-logo"...> or <img id="logo"...>. Many sites use semantic class names for logos. You should expand your search past what is mentioned here, there are A-LOT of potential ways a logo can pop up on the page.
Alt Text: Check the alt attribute of images for the word "logo" or the website’s name. E.g. <img src="logo.png" alt="Acme Corp Logo">.
Container text: Check the container's text attributes for words that denote whether the image inside is a logo or not.
Filename/Path: The image src URL often contains "logo" (e.g. /images/logo.png). This isn’t foolproof, but can be a clue.
Logo Link: The logo is often wrapped in a link to the homepage. For example, <a href="/" ...><img ...></a> or <a href="index.html"><img ...></a>. Finding an <a> tag linking to the root of the site with an image inside is a strong indicator.
Header Container: Logos usually reside in the header or nav section. If the site has a <header> or <nav> element, searching within it for an <img> can narrow the scope.

Using these patterns, you can traverse the DOM to find likely logo images. Here’s a TypeScript example using Cheerio (a fast HTML parsing library for Node) to fetch a page and locate a logo image by common selectors:

import axios from 'axios';
import cheerio from 'cheerio';

async function scrapeLogoImage(url: string): Promise<string | null> {
  const res = await axios.get(url, { timeout: 10000 });
  const $ = cheerio.load(res.data);

  // Look for img elements that likely represent the logo
  const logoImg = $('img').filter((_, el) => {
    const id = $(el).attr('id')?.toLowerCase() || "";
    const className = $(el).attr('class')?.toLowerCase() || "";
    const alt = $(el).attr('alt')?.toLowerCase() || "";
    // Check for keywords in id, class, or alt text
    return /logo|brand|header|logo-img/.test(id + " " + className + " " + alt);
  }).first();

  if (logoImg.length > 0) {
    // Resolve the absolute URL of the logo src
    let src = logoImg.attr('src') || "";
    if (src && !src.startsWith('http')) {
      // Convert relative URL to absolute
      const base = new URL(url);
      src = new URL(src, base).href;
    }
    return src;
  }
  return null;
}

In this snippet, we load the HTML and filter all <img> tags by checking their id, class, and alt attributes for common logo keywords. We take the first match as the primary logo. We also convert relative URLs to absolute. This simple approach catches a large number of cases, but not all. Some sites might not include "logo" in their HTML attributes, so additional tactics are needed (like looking for an <img> inside a header link as mentioned, or checking for image dimensions as we’ll discuss next).

Tip: Cheerio supports jQuery-like selectors, so you can also directly use selectors like $('img[class*="logo"], img[alt*="logo"]') to find images with “logo” in their attributes. E.g. $('header img') to target images in the header. Use these selectors to narrow down candidates quickly.

For cases where the logo is embedded as an inline SVG (e.g. using <svg> code for the logo instead of an <img>), you can search for an <svg> element in the header or with an id/class of "logo". If found, you might serialize that SVG or convert it to an image (we’ll cover SVG handling later).

Use the Company's Social Media Accounts!

Official brand logos can often be harvested from a company’s own social profiles. For example, many websites link to their official Twitter, LinkedIn, Facebook or Instagram accounts in the footer or metadata; once you identify a company’s handles, you can fetch the profile image directly.

In practice this works well because “most businesses rely on their logo to fill in the profile photo”, so the profile avatar is often the exact logo used on the site. LinkedIn and Facebook pages similarly have a profile picture: a LinkedIn company page’s HTML typically embeds the logo in an <img> tag (often classed “hero-img”), which can be scraped from the page source. Instagram business accounts likewise provide the brand icon as the profile pic. Since these official channels are maintained by the company, the scraped images tend to be clean, high-resolution logos. In fact, this logo-detection use-case even crawls official Instagram feeds (e.g. a brand’s @Nike account) because “these accounts typically have logos on most of the images” and high-quality visuals Gathering logos this way is a reliable fallback to improve coverage when DOM-based methods or guesswork might miss a brand’s icon

Filtering Out Non-Logo Images

Once you’ve extracted candidate images, you need to ensure they are actually the real logo and not some other graphic. Websites often have many images (banners, icons, etc.), and scraping by a simple keyword might pick up false positives. Here are strategies to distinguish the true logo from generic images:

Size and Aspect Ratio: Logos are usually of moderate size – neither tiny icons nor huge full-screen images. If you accidentally grabbed a large hero/banner image, that’s likely not a logo. Conversely, very small images (e.g. 16x16px) are probably icons or favicons, not logos. You can filter images by dimensions or file size. For example, skip images below, say, 30px in both width and height (common social icons are around 16–24px, whereas logos are often larger than 30px in one dimension). Also, check aspect ratio – many logos are wider than tall (like a horizontal wordmark) or vice versa, whereas perfectly square images might be generic icons (not always, but it’s a hint).
File Path Keywords: Look at the image filename or URL path. If it contains keywords like icon, icons, social, or names of social networks (facebook.png, twitter.svg), it's likely not the main logo. By contrast, filenames containing the company name or "logo" are likely the right image.
HTML Element Context: Examine the DOM context. If an image is inside a <div class="carousel"> or <section class="banner">, it's probably a content image. The main logo is often in the top navigation bar or a footer (for secondary logos). Many sites put the logo inside a <header> tag or a <div id="logo"> container. Limiting your search to likely regions (header/nav) can avoid picking up random images in the page body.
Anchor Link Destination: As mentioned, the logo is commonly wrapped in a link to the homepage. If an image’s parent <a> tag points to "/" or the site’s main URL, that image is very likely the main logo. In code, you could check $('a[href="/"] img') or $('a[href*="yourdomain.com"] img') for candidates.

By applying these filters, you can usually zero in on the real logo. For example, you might collect all <img> candidates and then choose the one with the largest pixel area within a reasonable range (to exclude huge banners). Or prioritize images whose filename includes the site’s name.

Another approach to confirm the logo is to use the alt text if it contains the company/site name. If the alt text of an image matches the domain or site title (e.g. alt="AcmeCorp"), that’s a strong sign it’s the main logo.

Dealing with Partner and “Trusted By” Logos

One tricky scenario is when a page has a section like "Trusted by these companies" or "Our Partners" displaying multiple logos of other companies. These are logos, but not the logo of the website you are scraping — they are extraneous in the context of grabbing the site’s own brand logo. You may want to avoid capturing these, or handle them separately if your goal is to collect them.

Characteristics of "trusted by/partner" sections:

They usually contain multiple logos grouped together, often in grayscale style. The presence of a cluster of several <img> tags in a single section (especially all about the same size) is a hint that these are partner/client logos rather than the site’s main logo.
The section might have a heading or text like "Trusted by", "Our clients", "Partners", etc. If you see an image within a container that also contains those phrases in text, you might choose to skip those images. For example, you could check if an image is inside a section whose text content includes "Trusted by".
The alt text or filenames of those images often contain other company names. If the alt attribute of an image is "BigCorp Inc." and your target domain is something else, that image is obviously an external company’s logo.
Size-wise, partner logos might be smaller and uniformly sized to fit a grid. The main site logo might be uniquely sized and usually appears only once.

Strategy: If you detect multiple logo-sized images in a group, you can either exclude them all by default, or implement a toggle to collect them separately. One approach is: find the primary logo first (likely in header), and treat any additional logos found in the body as partner logos. If the goal is strictly one logo per site (the site’s own logo), you’d ignore images that have alt text not matching the site’s name or that appear in a list of many logos. If the goal is to also gather those partner logos, you could store them under a different field (e.g. siteLogo vs partnerLogos).

Using Cheerio or DOM parsing, you can do something like:

tsCopyEdit// Pseudocode for skipping trusted-by logos
$('section, div').each((_, section) => {
  const text = $(section).text().toLowerCase();
  if (text.includes("trusted by") || text.includes("our clients")) {
    // If this section contains trusted-by text, skip its images
    $(section).find('img').each((_, img) => {
      // mark these images as partner logos to ignore or handle separately
    });
  }
});

This scans sections for keywords and then flags the images within them.

In practice, you might not catch all cases with text cues alone (some designs might not literally have the words "trusted by"). In such cases, fallback on detecting clusters of multiple images as a heuristic. For example, if after extracting the main logo, you find 5 other <img> tags on the page all of similar dimensions, that’s likely a client logo section.

The key is to differentiate the single primary logo of the site from other logos present. Focus on the one usually at the top of the page or in the header as the primary.

Avoiding Social Media Icons and Favicons

Almost every website also has tiny icons for social media (Facebook, Twitter, LinkedIn, etc.), and you definitely want to avoid misclassifying those as logos. These icons typically appear in the header or footer as clickable links to the company’s social profiles, or as share buttons. Here’s how to avoid them:

Size Filter: Social icons are small (often 16x16 up to 32x32 pixels). If you have the ability to get image dimensions (for example, by downloading the image or perhaps by using a headless browser to get the rendered size), you can drop anything below a certain size threshold as mentioned earlier. A legitimate logo is rarely that tiny.
Filename/URL: The src of social icons might contain the name of the social network or generic terms. e.g. facebook.svg, twitter.png, or paths like /icons/facebook.png. If the image URL contains known social platform names or is hosted on those domains (like a link to facebook.com/images/...), skip it.
Alt/Title attributes: Often these icons have alt text like "Facebook" or a title attribute like "Follow us on Twitter". If you see that in the element, it’s not the site’s logo but a social icon.
CSS classes: The presence of classes like "social-icons", "icon-facebook", "fa-facebook" (for FontAwesome icons, though those might not be <img> at all), can flag an element as a social media icon. If using a DOM parser, you can explicitly exclude any <img> that has a class or parent with "social" in its class name.
Link target: Social icons are usually inside <a> tags that link out to external domains (facebook.com, twitter.com, etc.). So if an <img> is inside an <a href> pointing to a different domain than the site’s domain, it’s likely a social icon. You can programmatically check the parent anchor’s href and skip those.

By combining these checks, your scraper can reliably ignore social media icons. For example, if using Cheerio:

tsCopyEdit$('img').filter((_, el) => {
  const src = $(el).attr('src') || "";
  const alt = $(el).attr('alt') || "";
  const parentLink = $(el).parent('a').attr('href') || "";
  // Condition to identify a social icon
  if (/facebook|twitter|linkedin|instagram/i.test(src + alt + parentLink)) {
    return false; // exclude this from logo candidates
  }
  // ... other checks for size or class
  return true;
});

This filters out images associated with known social sites. A similar logic can be applied for favicons (the small icon in browser tabs) – those are usually linked via <link rel="icon" ...> tags, not inline images, so they shouldn’t interfere if you’re only looking at <img> tags. Just be careful not to mistake a favicon for a logo; since favicons are typically 16x16 or 32x32 and referenced in <head>, they’re easy to distinguish.

Deduplicating Logos with Image Analysis

When scraping at scale, you might encounter multiple instances of what is essentially the same logo. For example, a site might serve a colored logo and a monochrome version (for dark mode), or the same logo at different resolutions (standard and retina). If you store all of them, you’d have duplicates. Deduplication ensures you end up with a unique set of logos per site.

There are a few techniques to deduplicate images:

Exact Duplicate Check (Hashing): First, use a quick hash like MD5 or SHA-256 on the image data to catch exact byte-for-byte duplicates. This will flag if you, say, downloaded the same PNG twice from two different pages.
Perceptual Hashing (pHash): Perceptual hashing is the key for near-duplicates. A perceptual hash generates a fingerprint of an image based on its visual appearance (ignoring minor differences). Images that look alike produce similar or even identical pHashes. This is very useful for logos which might have small differences (different background color, slight scaling) but are essentially the same graphic. By comparing perceptual hashes, you can cluster images that are the same logo. For example, the team at Brand.dev noted that de-duplicating logos was a crucial part of handling many images and found perceptual hashing to be one of the most effective techniques.
Color Histograms: Another approach is comparing color distribution. If you have two images with the same shape but one is a white version and another is black, a color histogram might flag them as different (since one is mostly light pixels, one dark). However, if you convert images to grayscale first, their histograms or pHashes might converge. Generally, pHash or difference hash algorithms inherently handle color differences by focusing on structure, but you can explicitly grayscale images before hashing to be safe when color inversions are a factor.
Feature Matching (SIFT/ORB): For a more advanced computer vision approach, you could use algorithms like SIFT or ORB (via OpenCV) to detect key points in the images and see if they align. This can catch duplicates even if one image is a scaled or rotated version of another. However, this is computationally heavier and usually overkill for logo scraping, given pHash is much faster and works well for this use-case.

Implementing Perceptual Hashing in Node: There are libraries like sharp-phash (built on the Sharp image processing library) that can generate perceptual hashes in Node.js. We talk about this in great detail in this blog post.

For example:

import * as fs from 'fs';
import phash, { distance as hammingDistance } from 'sharp-phash';

const img1 = fs.readFileSync('logo1.png');
const img2 = fs.readFileSync('logo2.png');

// Compute perceptual hashes (64-character strings by default)
const hash1 = await phash(img1);
const hash2 = await phash(img2);

// Compute Hamming distance between hashes
const dist = hammingDistance(hash1, hash2);
console.log(`Hash1: ${hash1}\nHash2: ${hash2}\nHamming Distance: ${dist}`);

If the Hamming distance (the number of differing bits) is below a certain threshold, you can consider the images to be essentially the same. A distance of 0 means identical pHash (very likely the exact same image or a visually indistinguishable variant), while a small number (say <= 5 for a 64-bit hash) indicates very similar images. You can tune the threshold based on experimentation. In production at scale, you might generate a pHash for each new logo image and compare it against a database of existing hashes to decide if it’s new or a duplicate. This is far more efficient than pixel-by-pixel comparisons.

Often, a two-step de-dup is ideal: first filter out exact duplicates by an MD5 checksum (to avoid extra image processing work), then use perceptual hashing on the rest to catch near-duplicates. As a practical example, Brand.dev uses perceptual hashing in their pipeline to automatically de-dupe logos in their massive image set.

Handling Variations: Logos might have slight variants – e.g., one version with text, one without, or updated design. Perceptual hashing will flag only very similar images. If a logo changed significantly, it should produce a different hash (which is correct – it’s a different logo). So pHash helps group truly identical logos. For color variants (like a white vs colored logo), a grayscale pHash will usually match them, which is what you want. But be mindful: if the logo has transparent background versus colored background, that can affect the hash. It may be worth adding a preprocessing step: trim whitespace or transparent padding around images before hashing to avoid differences due to padding.

Handling Different Image Formats (SVG, PNG, JPEG, etc.)

Websites use various formats for logos:

SVG (Scalable Vector Graphics): Many modern sites use SVG for logos because it’s resolution-independent and crisp on all screens. SVG logos might be embedded via an <img src="logo.svg">, an <object> or <embed> tag, or inline as <svg>...</svg> in the HTML. If you extract an SVG, you’ll get either the SVG file or markup. The advantage is you have a high-quality vector. However, you might want to convert it to a raster format (PNG) for consistency in processing (for example, if you plan to compute hashes or display them in certain contexts). You can use Sharp in Node to read an SVG and output a PNG without launching a browser. For instance:
import sharp from 'sharp'; const svgBuffer = fs.readFileSync('logo.svg'); const pngBuffer = await sharp(svgBuffer) .png({ background: { r: 255, g: 255, b: 255, alpha: 0 } }) // preserve transparency .toBuffer(); fs.writeFileSync('logo.png', pngBuffer);
This converts an SVG file to PNG. Using Sharp is fast and avoids the heavy overhead of a headless browser just for image conversion. If the SVG has text elements or references external fonts, note that conversion might slightly alter appearance unless fonts are embedded.
PNG: A very common format for logos (supports transparency, which is often used to place the logo on different backgrounds). PNG is lossless, which is great for graphics like logos. When scraping, if you get a PNG, you can usually use it as-is. Ensure you preserve the alpha channel if you need to place it on a background later.
JPEG: Not as common for logos (because JPEG is lossy and doesn’t support transparency), but some older sites or certain designs might use it. If you encounter a JPEG logo, it might have a solid background. You might consider converting it to PNG and possibly removing the background (via manual editing or an algorithm) if transparency is desired. But that’s post-processing.
GIF: Rare for logos, except maybe an older site or an animated logo. If it’s animated, decide if you need the animation or a single frame.
WebP/AVIF: Some sites might serve logos in modern formats for web performance. If the <img> has a .webp or .avif source, ensure your scraping setup can download those. You might convert them to PNG/JPEG for compatibility, using a library (Sharp supports WebP as well).

When scraping, it’s good practice to standardize the output format or at least the handling. You might, for example, choose to store everything as PNG in your database for consistency (converting SVG or WebP to PNG on the fly). Keep the original format if you need high fidelity or vector data, but for tasks like image hashing, you can hash the rasterized version of an SVG.

Extracting text from SVG: Some SVG logos contain textual elements (e.g., the company name rendered as text in the SVG). If you need to extract or verify the brand name, you can parse the SVG XML. Simply searching for <text> elements in the SVG XML and retrieving their text content can give you any embedded text. For example:

const svgContent = fs.readFileSync('logo.svg', 'utf-8');
const textMatches = svgContent.match(/<text[^>]*>([^<]+)<\/text>/i);
if (textMatches) {
  console.log("SVG text content:", textMatches[1]);
}

This will capture "CompanyName" if the SVG has something like <text x="0" y="0">CompanyName</text>. Be aware that many logo SVGs convert text to paths for styling reasons, in which case there won’t be any <text> tags (the text is essentially vector shapes then). Still, extracting text can be a quick way to double-check a logo’s identity (e.g., some logos are just stylized company names).

Scraping Logos from Dynamic Sites (SPA/CSR) with Headless Browsers

Not all websites deliver the logo image URL in the initial HTML. Many modern sites (SPA – Single Page Applications built with React, Angular, etc.) might render the layout via JavaScript. In such cases, a simple HTTP GET and Cheerio parse might return an empty shell or missing image tags. For example, the HTML might just contain <div id="root"></div> and the actual DOM (with the logo <img>) is populated by client-side JS. To handle this, you need to use a headless browser to execute the JavaScript and then extract the logo.

Puppeteer (for Chrome) or Playwright (for Chrome/Firefox/WebKit) are popular choices. Here’s how you can use Puppeteer in TypeScript to get a logo from a page that requires rendering:

import puppeteer from 'puppeteer';

async function scrapeLogoWithPuppeteer(url: string): Promise<string | null> {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  // Optional: set a realistic User-Agent to avoid detection
  await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4472.114 Safari/537.36');

  await page.goto(url, { waitUntil: 'networkidle0', timeout: 30000 });
  // Wait for an <img> to appear in the DOM (in case the logo loads late or is lazy-loaded)
  await page.waitForSelector('img', { timeout: 10000 }).catch(() => { /* handle if not found */ });

  // Evaluate in the page context to find the logo src
  const logoSrc = await page.evaluate(() => {
    // Look for likely logo elements
    const img = document.querySelector('img[alt*="logo"], img[src*="logo"], img[class*="logo"]');
    return img ? (img as HTMLImageElement).src : null;
  });

  await browser.close();
  return logoSrc;
}

This script launches a headless Chrome, navigates to the URL, and waits for network activity to finish (networkidle0 waits until 0 network connections for a moment, which helps when sites load additional resources). We then optionally wait for any <img> to ensure images are loaded, and use page.evaluate to run a snippet in the page that finds an <img> matching typical logo selectors. We gather its src URL and return it.

Using a headless browser like this will let the site’s own JavaScript run, so if the logo is added to the DOM dynamically (or if the <img> was present but needed JS to set the src or something), it will now be available. It also handles cases like if the logo is loaded via a lazy-loader (some sites do not put the src immediately, but use data attributes and a script to load it).

Performance considerations: Running a full browser for every site is heavy. If you have thousands of domains to scrape, launching Puppeteer for each sequentially will be slow. You can mitigate this by using concurrency (running multiple headless browsers in parallel) and by reusing browser instances if possible. We’ll talk more about scaling in a later section, but note that you might use a library like puppeteer-cluster to manage multiple pages.

Also, for purely static sites or those where a simple HTTP fetch yields the logo, use the lightweight Cheerio approach. Reserve Puppeteer for cases where you truly need it (detect this by checking if your initial Cheerio parse found a logo or not – if not, then try Puppeteer as a fallback).

Client-side rendered (CSR) apps: If the site is an SPA, often the logo is actually included as a static asset reference in the initial HTML (for performance reasons) – e.g., some SPAs might still have a basic <img src="/static/media/logo.png" ...> in the HTML. But if not, Puppeteer is the way. Another trick: some sites use an API to fetch branding info; for example, an internal endpoint might serve the logo URL. If you reverse-engineer that, you might not need a full browser. However, this is very site-specific and generally not scalable to implement for each site, so headless scraping is the universal solution for dynamic content.

Preventing and Mitigating Scrape Blocking

When scraping many websites, you’ll inevitably run into anti-scraping measures. Websites may block rapid or suspicious requests, especially when you’re fetching images or using headless browsers in bulk. To ensure your scraper can run at scale without getting IP-banned or served CAPTCHAs, consider these best practices:

Rotating Proxies/IPs: Use a pool of proxy servers or IP addresses and rotate them between requests. This distributes the traffic and avoids making hundreds of requests from a single IP that could get flagged. Services exist that provide residential or datacenter proxy networks. You can also use libraries like proxy-chain in Node to automatically rotate proxies with Puppeteer. The idea is to not hit all sites from one machine identity.
User-Agent Rotation: Vary the User-Agent header in your requests (both in axios/fetch and in headless browser). Bots often have default or no user-agent which is a red flag. Use a list of common browser UA strings and randomize them. For Puppeteer, as shown, you can use page.setUserAgent() to a realistic value. There are NPM libraries like user-agents that can generate believable user-agent strings in code.
Headless Evasion: By default, headless Chrome has indicators (like the navigator.webdriver property) that some anti-bot scripts detect. You can use puppeteer-extra with the stealth plugin, or Playwright’s stealth mode, to mask those. These plugins tweak the browser environment to look more like a real user. Additionally, enabling headful mode (non-headless) can sometimes bypass simple headless detection, though running headful Chrome is heavier.
Throttle Request Rate: Do not bombard a single website with too many requests in a short time. If you’re scraping thousands of domains, it’s best to put a small delay or concurrency limit per target domain (e.g., if you need to get multiple pages from one site). For just one logo per site, you’re mostly doing one request per site, which is usually fine. But if you try to do all thousand simultaneously, your machine or network may trigger alarms. A staggered or batched approach is safer.
Avoiding Common Patterns: Some anti-scraping systems key off common bot behaviors. For instance, making requests with no referrer, or not loading any other resources (a real browser would load images, CSS, etc.). When using HTTP libraries, you might set a Referer header or accept cookies to appear more like a normal browser. When using Puppeteer, loading the whole page (including images) actually helps appear legitimate, albeit at cost of bandwidth.
CAPTCHA and Cloudflare bypass: In rare cases, a site might present a CAPTCHA or a JavaScript challenge (like Cloudflare IUAM page) before letting you in. Solving these at scale is non-trivial – it may involve using services or injecting solution scripts. If you encounter these, you might consider using a scraping API service that handles it, or skipping those domains if logos aren’t absolutely critical from them. There are also headless browser automation services that have solutions for Cloudflare (e.g., Playwright has some built-in handling for certain challenges, and tools like ScrapingAnt or ScraperAPI can offload this).
Proxy Failover: If a proxy/IP gets blocked (you start receiving 403/429 errors), switch to a new proxy. Monitor the HTTP status codes. Implement retry logic with a backoff – e.g., if a request fails or is blocked, wait a bit and try with a different IP.

Overall, a combination of rotating IPs and varied user agents goes a long way. As one scraping expert noted, sophisticated anti-bot measures have made advanced proxy rotation paramount for successful scraping. By distributing your requests across many identities and mimicking real browser traffic, you reduce the chance of detection.

In code, rotating proxies with Puppeteer might look like:

// Launch puppeteer with a proxy
const proxyUrl = getNextProxy(); // your function to fetch a proxy from pool
const browser = await puppeteer.launch({
  headless: true,
  args: [`--proxy-server=${proxyUrl}`]
});
const page = await browser.newPage();
await page.setUserAgent(getRandomUserAgent());

For HTTP requests (axios), you can configure a different proxy for each request (or use a proxy rotation service’s API endpoint). Keep in mind that some proxies require authentication, so you might need to embed credentials in the URL or use an HTTP agent.

Scaling the Scraping Process Across Thousands of Domains

Scraping one site is easy; scraping thousands requires careful planning for efficiency and stability. Here are considerations and techniques to scale up:

Parallelism: Use concurrency to scrape multiple sites at the same time. Node.js is single-threaded for JS execution, but it can handle many concurrent I/O operations. Leverage asynchronous calls to fetch multiple pages in parallel. If you have a list of domains, you can use Promise.all in batches, or a controlled concurrency library (like p-limit or async library) to limit how many are in-flight at once. For headless browser tasks, consider using multiple browser instances or contexts in parallel.
Puppeteer Cluster: If using Puppeteer heavily, consider the puppeteer-cluster library which manages a pool of Chromium instances and distributes tasks to them. This helps optimize resource usage by reusing browsers and controlling concurrency. For example, you might run 5–10 headless browsers simultaneously, each handling one domain at a time, which speeds up the throughput significantly while avoiding the overhead of launching a new browser for each domain. Puppeteer Cluster also has built-in error handling and retries, which are useful at scale (if one browser crashes, it can retry the job in a new one).
Job Queue: If the list of websites is extremely large, you may want to use a queuing system. For instance, put all domain jobs into a queue (like RabbitMQ, Redis queues using Bull, etc.) and have worker processes pulling from it. This way, you can distribute the work across multiple machines or processes. Each worker runs the scraping logic for one domain at a time and pushes results to a database.
Resource Management: Be mindful of CPU and memory. Processing images (especially computing hashes or conversions) can be CPU-intensive. Node’s event loop can handle a lot of I/O, but CPU-bound tasks (like image decoding) might block it. Offload heavy image processing to worker threads or do it in smaller chunks. Sharp is pretty efficient in C++ under the hood, but if you hash dozens of images concurrently, that’s CPU work. Monitor your system and perhaps throttle image processing concurrency.
Timeouts and Failures: At scale, some websites will be down or extremely slow. Implement timeouts for requests (as shown with axios) and for page loads in Puppeteer (goto has a timeout option). If a site fails to respond in a reasonable time, log it and move on – you might retry it later, but don’t let it stall the whole pipeline.
Data Storage and Management: You’ll be collecting a lot of image files. Plan where to store them – local disk, cloud storage (S3, etc.), or a database as BLOBs. Storing images on a filesystem with thousands of files might require organizing into subfolders or using a naming scheme to avoid huge directories. If using a cloud bucket, you can upload as you go. Also store metadata: which domain corresponds to which image file, any flags (e.g., dark vs light version).
Scaling Out: If thousands become tens of thousands or more, you might run your scraping distributed. For example, use a cloud function or container per domain (though that could be overkill and more overhead than needed). More traditionally, you’d run multiple instances of your Node script on different servers, each handling a portion of the domain list.

Example: Using puppeteer-cluster for concurrency. The snippet below conceptualizes using a cluster to scrape multiple domains in parallel:

import { Cluster } from 'puppeteer-cluster';

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 5,  // launch 5 headless Chrome contexts
    timeout: 30000,
    puppeteerOptions: { headless: true }
  });

  await cluster.task(async ({ page, data: site }) => {
    await page.goto(site.url, { waitUntil: 'domcontentloaded', timeout: 15000 });
    const logoSrc = await page.evaluate(() => {
      const img = document.querySelector('img[alt*="logo"], img[class*="logo"]');
      return img ? (img as HTMLImageElement).src : null;
    });
    site.logo = logoSrc;
    // ... (maybe download the image or save the result)
  });

  // Queue a bunch of domains
  for (const url of listOfDomains) {
    cluster.queue({ url: `https://${url}` });
  }

  await cluster.idle();
  await cluster.close();
})();

In this code, up to 5 browsers will run concurrently, each grabbing a logo. The cluster ensures efficient reuse of browsers and avoids spawning thousands of processes. As the Webshare tutorial notes, a cluster optimizes resource use and prevents overload by controlling how many tasks run in parallel. You can scale maxConcurrency up based on your CPU/RAM and the aggressiveness of scraping you need.

Finally, keep in mind the crawl politeness considerations. Hitting thousands of domains is usually fine if each is just one request, but make sure you aren’t violating any usage policies. If you’re doing this internally (e.g., for a company that has a list of websites to get logos for), you’re probably okay. If distributing a product that scrapes, be mindful of legal and ethical concerns per site.

Leveraging an Automated Solution: brand.dev

Building and maintaining a logo scraping pipeline with all the above capabilities – identifying logos, filtering out the noise, handling different formats, deduplicating variations, and avoiding blocks – is a substantial effort. If your goal is simply to get company logos (and related brand info) without re-inventing the wheel, an alternative is to use a service like brand.dev.

Brand.dev provides an API that automatically fetches high-quality logos for any domain, along with other branding data, in a single call. Under the hood, such a service has already implemented logo detection, extraction, deduplication, and even categorization. For example, brand.dev’s API returns not just the logo image URL(s) for a domain, but also the primary brand colors, company description, industry category, and more. This means if you query something like api.brand.dev/logo?domain=example.com, you get back the logos (often multiple versions like dark/light mode) without needing to scrape the site yourself.

The advantages of using brand.dev or similar services are:

Accuracy: They often maintain a curated dataset or use advanced algorithms to ensure the logo is correct. This reduces the chance of grabbing a wrong image.
Deduplication & Variants: Brand.dev groups logos by variant – for instance, you might get the colored logo and a monochrome version, but they’ll be identified as the same logo group. This solves the duplicate issue for you.
Updates & Maintenance: If a company changes their logo, the service might catch that (since they periodically refresh or have user contributions), whereas a static scraping approach might have to be rerun to notice changes.
Categorization: As a bonus, brand.dev provides categorization like the company’s industry or other metadata alongside the logo, which can be useful if you’re building, say, a directory or doing analytics.

For developers who don’t want to implement the whole scraping pipeline, using such an API is a huge time saver. As an example, Clearbit and Brandfetch are other well-known logo APIs, but brand.dev is positioned as a comprehensive brand data API (with logos, colors, etc.) and can be integrated with just a simple request. By offloading the heavy lifting to an API, you also avoid issues of getting blocked, since the API provider handles the scraping on their end and delivers you clean results.

In summary, scraping logos at scale involves a gamut of technical challenges – DOM parsing, image analysis, anti-bot avoidance, concurrency – but with the techniques outlined above, you can build a robust system to do it. Start simple with HTML parsing and gradually add layers (headless browser, better filtering, hashing) as needed to improve accuracy. And remember, if it fits your scenario, services like brand.dev exist to give you a turnkey solution for logo and brand data Happy scraping!