Jun 11, 2025
Developer Deep Dive
At Brand.dev, we're building an API that helps you fetch company brand data, like name, address, logos, colors, and more from any domain with a single API call.
Because we scrape hundreds of thousands of websites every day, we've gained some valuable insights that could be of help to you on your scraping journey. We've put our insights into this blog posts in non-linear order so please skip ahead if you see a specific section that is of use to you.
When building applications that need a company’s contact information, automating the retrieval of a brand’s address can save time and ensure accuracy. In this guide, we’ll explore how to programmatically scrape a brand's address from various online sources using Node.js and TypeScript. We’ll cover extracting address data from the brand’s official website, scraping social media pages (Facebook, LinkedIn, Instagram), merging fragments from multiple sources into a complete address, and handling one-time versus recurring scraping jobs.
1. Extracting the Address from the Official Website
The first and best source for a brand’s address is usually its official website. Companies often list their location on a Contact Us page or in the footer of the site. Our goal is to fetch the site’s HTML and parse out the address. Two key techniques are useful here:
HTML Parsing: Scanning the raw HTML for address information (e.g. within
<address>
tags or specific DOM elements).Structured Data (schema.org): Many sites embed structured metadata (like JSON-LD or Microdata) containing addresses. For example, websites often include a schema.org
Organization
orLocalBusiness
object with aPostalAddress
field, which provides a structured breakdown of the address (street, city, postal code, etc.). Leveraging this data can make extraction much easier.
Steps to Scrape the Official Site Address:
Fetch the Website HTML: Use an HTTP client (e.g.
axios
ornode-fetch
) to retrieve the page. If you know the address is on a specific page (like “Contact” or “About” page), target that URL. Otherwise, start with the homepage or footer.Load into Cheerio: Cheerio is a fast HTML parsing library that emulates jQuery’s selector API in Node. This allows us to easily query the DOM for specific elements.
Search for Structured Data: Check for
<script type="application/ld+json">
tags containing JSON-LD. These often contain an"address"
field if the site is using SEO-friendly markup. You can parse the JSON and look for an object of type"PostalAddress"
which includes properties likestreetAddress
,addressLocality
,addressRegion
,postalCode
, etc.Fallback to HTML Elements: If no JSON-LD is found, search for clues in the HTML text:
Look for an
<address>
tag or any element with keywords like “Address”, or common address patterns (number and street, city names, zip codes).Many sites have the address in the footer or a contact section. Sometimes it’s inside a specific class or ID (e.g.
<div class="address">
).You can use Cheerio selectors or even regex as a last resort to find something that looks like an address.
Extract and Clean: Once found, extract the text and trim it. Remove any unwanted whitespace or line breaks. If the address is split across multiple elements (street in one, city in another), concatenate them appropriately (in the correct order).
Below is a TypeScript code snippet demonstrating this process. It tries to find structured data first and falls back to searching for an <address>
tag:
In this code, we use Cheerio to find any JSON-LD scripts. If an address is present in the structured data, we assemble it from its parts (street, city, region, etc.). If not, we try an <address>
tag, and then a regex pattern as a last attempt. You can expand these heuristics as needed (for instance, looking for elements with specific classes or labels that indicate address).
Tips for Website Scraping: Always consider the structure of the specific site:
View the page source or use browser dev tools to identify where the address lives in the DOM. Adjust your Cheerio selectors accordingly.
Some sites might load content dynamically (via JavaScript) – in such cases, a simple HTTP GET + Cheerio (which only sees initial HTML) might not capture the address. If you suspect this (e.g., the address isn’t in the HTML source), you may need to use a headless browser (Puppeteer) to render the page (we’ll discuss Puppeteer shortly).
Respect robots.txt and the site’s terms of service. Scrape gently and not too frequently.
2. Scraping Addresses from Social Media Platforms
Brands often maintain social media pages that contain public address information (especially for local businesses or offices). We’ll examine how to gather address data from Facebook, LinkedIn, and Instagram. Each platform has different structures and restrictions, so our approach will vary for each:
2.1 Facebook Pages
Facebook pages for businesses commonly list the address (if provided by the page owner) in the “About” section. However, scraping Facebook has challenges:
The content is loaded dynamically and often requires a logged-in session to view fully.
Class names in the HTML are auto-generated and can change (making CSS selectors brittle).
Facebook offers a Graph API for pages, which can provide location data in a structured way, but it requires an access token and specific permissions (e.g., Page Public Content Access to read public page data).
Approach 1: Graph API (If Feasible) – If you have a Facebook Developer app and the appropriate token, you can query the Graph API for a page’s address. For example, using Facebook’s Node.js SDK or a simple HTTP GET:
This would return a JSON object with the page’s location details (if the page has one). Keep in mind the API may require approval for the permissions if used broadly.
Approach 2: HTML Scraping with Puppeteer – For a general solution without API access, we can use Puppeteer, a headless Chromium browser, to load the Facebook page and scrape the address. The headless browser can execute the necessary JavaScript and bypass some anti-scraping measures. Facebook pages load content via JavaScript, so Puppeteer is well-suited here.
Steps:
Launch a Puppeteer browser and open the business page URL (e.g.
https://www.facebook.com/YourBrandName/
for the brand’s page).Wait for the page to load the content (you might use
waitUntil: 'domcontentloaded'
or a short manual delay).Extract the address from the page’s DOM. You can either use Puppeteer’s page functions (like
page.$eval
to run a querySelector in the browser context) or get the page’s HTML and parse it with Cheerio.
Facebook’s DOM is complex, but by inspecting a sample business page, you might find the address inside a specific container. For instance, in one recent Facebook layout, the address was found within a <ul>
list in a div
with class name x1heor9g
(this class name is an example that Facebook generated). We can use such a selector to retrieve the text. Keep in mind these classes can change, so you might need to adjust your scraper if Facebook updates their markup.
Example: Scraping a Facebook Page’s Address with Puppeteer
In this snippet, we navigate to the Facebook page and then search for a <div>
with class x1heor9g
inside the content. In one observed layout, that was the container holding the address. We then extract and trim the text. In practice, you should inspect the particular page you are scraping; the classes or structure might differ. Sometimes the address might be inside an <a>
tag (if it’s a map link), or have a label like “Address:” preceding it. Adjust your selector logic accordingly.
Notes & Tips for Facebook:
Authentication: If the page information isn’t public, or if Facebook’s public view is limited, you may need to log in. Puppeteer can automate a login sequence (filling the username/password), but be very careful with Facebook’s terms of service. Many scrapers avoid logging in to not violate terms.
API vs Scraping: Using the Graph API is more stable and structured (no HTML parsing needed) but requires setup and permission. Direct scraping can break if Facebook changes their UI. Weigh these options based on your use case.
Performance: Puppeteer is heavier than a simple HTTP request. Use it only when necessary (e.g., for pages that require JS rendering). For example, a static official website might not need Puppeteer at all, but Facebook and other SPA-like pages do.
Rate limiting: Facebook may detect and block scrapers, especially if making many requests quickly. Use delays, rotate IPs if possible, and scrape responsibly.
2.2 LinkedIn Company Profiles
LinkedIn company pages often list a company’s headquarters or address in the “About” section. For example, a LinkedIn page might show “Headquarters: Springfield, IL” or a full address if provided. However, LinkedIn is notably strict with scraping:
Much of the content is behind a login. If you visit a company profile without being logged in, you’ll typically see only a very limited snippet before a login prompt.
LinkedIn has strong anti-scraping measures (bot detection, rate limiting, IP blocking).
Possible Approaches:
Official API: LinkedIn does have an API, but it’s not openly available to all developers (it often requires partnerships or specific use cases, and isn’t free to scrape arbitrary profiles). For most, this is not an option.
Headless Browser with Login: You can use Puppeteer to automate a login to LinkedIn and then navigate to the company page. This is complex (handling 2FA, cookies, etc.) and against LinkedIn’s terms if not done carefully. Alternatively, some scraping APIs or services handle LinkedIn by using real browsers and rotating proxies.
Public Page Scraping: In some cases, certain basic info might be visible without login. For example, using Google’s web cache or a bot that mimics a crawler might get some data. Generally though, to get the address you likely need to authenticate.
Assuming we go with a Puppeteer approach, here’s an outline:
Launch Puppeteer, navigate to LinkedIn’s login page. Provide credentials (could be stored in env variables).
After login, go to the company’s page (e.g.
https://www.linkedin.com/company/YourBrand/about/
).Wait for the content to load. The “About” subpage typically has the company overview including location.
Scrape the text of the address. It might be in a section labeled “Headquarters” or listed alongside other info like “Founded”, “Employees”, etc.
Clean up and output the address.
Because LinkedIn might list just city and state as the HQ address (instead of a full street address in many cases), you might only retrieve partial information (e.g. just City, State). Still, this can be useful to merge with other sources.
Important: LinkedIn scraping often requires rotating IPs and user-agents to avoid blocks. Using a scraping service or proxy network can help (try brand.dev). Also note that company profiles are considered public in the sense that if you can authenticate, you can view them, but you should not scrape at a rate that impacts their servers or violates usage policies.
(We won’t provide a full LinkedIn Puppeteer code example here due to its complexity and LinkedIn’s policies, but it would be similar in spirit to the Facebook Puppeteer example – automating login if needed, then querying the DOM for the address.)
2.3 Instagram Business Profiles
Instagram is primarily visual, but business or creator profiles can list contact info:
They often have a bio section where some businesses might put an address or location.
Business accounts can add an address through Instagram’s tools, which then shows up as a clickable “Directions” button or a small line of text with the location. On the web profile, this might not be directly visible without clicking.
Challenges for Instagram:
The web profile may not show the address unless you’re logged in or if you click an element. Instagram also tries to block web scraping by requiring login after a few page loads.
There is an official Instagram Basic Display API, but it’s limited to your own account’s data (or users who authorize your app) – not useful for arbitrary brands.
Like LinkedIn, Instagram employs anti-scraping measures. You might encounter bot detection if making many requests.
Approach:
Use Puppeteer to load the Instagram profile page. You might need to log in to see the full profile details if the account is private or if Instagram blocks anonymous access after a certain point.
Check the profile’s page HTML. Instagram often embeds the profile info in a script tag as JSON (for initial render). You may find a snippet of JSON containing
address_json
or similar if the profile has a declared business address.Alternatively, if the address is in the bio text, you can parse it from there (though it might just be a city name or area).
Another trick: Instagram’s web internal API (undocumented) can sometimes be queried. For example, appending
?__a=1
to a profile URL used to return JSON data for the profile. This has been restricted in recent years (often requiring an authenticated session cookie). If it works, it would provide structured data including location.
Tips for Instagram:
Space out your requests, and consider using proxies for multiple profiles.
If using Puppeteer, you might emulate a mobile device (Instagram might serve data differently on mobile web).
As with other platforms, ensure you handle errors (like if a profile doesn’t have an address or if the account is not found).
Comparing Approaches for Social Media
Each social media platform requires a tailored strategy. Below is a summary comparison of methods to obtain address data from these sources:
Source | Data Availability | Access Requirements | Recommended Method |
---|---|---|---|
Official Website | Full address often listed on site (HTML or JSON-LD). | Public (usually no login needed). | HTTP fetch + Cheerio parsing (fast). |
Facebook Page | Address in “About” (if set by page). Graph API provides structured location. | Public page content, but JS-rendered. API requires token and permission. | Puppeteer to scrape dynamic HTML; or Graph API (with proper access). |
LinkedIn Profile | Headquarters address or location in About section. | Login typically required; anti-bot measures in place. | Puppeteer with login (simulate a real user); rotate proxies. |
Instagram Profile | Address possibly in bio or as location link for business profiles. | Often requires login after few requests; API limited. | Puppeteer (possibly mobile emulation) to scrape profile JSON or bio. |
3. Merging and Normalizing Address Information
After gathering address snippets from multiple sources, the next challenge is merging them into one complete, structured address. Different sources might give partial data:
The official site might have the full street address but not explicitly mention the country (assuming local context).
The Facebook page might list just city and state.
LinkedIn might only give city and state, or sometimes the street if it’s a specific office location.
There could be discrepancies or formatting differences (e.g., “St.” vs “Street”, abbreviations, etc.).
Heuristics for Merging:
Identify Unique Parts: Break each address into components. For example, you can attempt a simple split by commas or newlines, since addresses are often written as “Street, City, State Zip, Country”. This isn’t foolproof (addresses can have commas within them), but it’s a start.
Normalize Formats: Convert all pieces to a standard format (e.g., state names to their abbreviation or vice versa, country names to full names, etc., depending on desired output).
Combine Data: Use the most authoritative or complete source as the base. Then fill missing pieces from other sources. For instance:
If the website gave “123 Main St, Springfield”, and LinkedIn gave “Springfield, IL”, you can deduce the combined address is “123 Main St, Springfield, IL”.
If one source lists a country and another doesn’t, add the country.
Be cautious of conflicts (e.g., one source says “New York, NY” and another says “New York, NJ” – you’d need a way to verify which is correct, possibly via a third source).
Validation: If possible, validate the merged address. This could be done by a quick query to a geocoding API or an address validation service, or using known patterns (ZIP code should match the state, etc.).
NLP Tools for Address Parsing and Normalization:
For a more robust solution, you can leverage libraries or APIs that specialize in address parsing:
Libpostal: An open-source library (with Node.js bindings via
node-postal
) that uses NLP to parse and normalize addresses worldwide. This can split an address string into components (house number, road, city, state, country, etc.) and even expand abbreviations. For example, usingpostal.parser.parse_address("781 Franklin Ave Crown Heights Brooklyn NY 11238")
would return structured components. You can feed each source’s address into libpostal, get structured parts, and then intelligently merge those.Other Node address parsers: e.g.,
parse-address
oraddressit
(which use regex and rules, often for US addresses). These can be easier to use but might be locale-specific.Heuristic scripts: If you know your data is mostly from one country or format, you can write simple rules. For example, if you expect a US address, you might regex for a state abbreviation (two capital letters) and ZIP (5 digits) to isolate that part, etc.
Example – Simple Merge: Suppose we got the following:
From official site: “123 Main Street, Springfield”
From Facebook: “Springfield, IL”
From LinkedIn: “United States” (just country in this hypothetical case)
A simple merge script might do:
Of course, implementing findStreet
, findCity
, etc. requires some logic or pattern matching. This is where an NLP library shines, because it can do this parsing for you. If using libpostal, it might return something like:
From which you can easily pick out city or state.
Cleaning & Normalization Tips:
Trim and Case: Remove surrounding whitespace and use consistent casing (e.g., title case for street/city, uppercase for state codes, etc.).
Expand or Abbreviate: Decide if you want “Street” or “St.”, and apply consistently. Libraries like libpostal can expand abbreviations for you (e.g., "St" -> "Street").
Remove Duplicates: If one part is repeated (sometimes one source might include city and state together, and another gives them separately), ensure you don’t repeat in the final string.
Use Delimiters: Join with commas or newlines based on your needs. Commas are common in one-line addresses; newlines are common in postal format (street on one line, city/state/ZIP on next, country on final line for international).
Merging address data is about being smart with string manipulation or using a dedicated address parsing tool to avoid re-inventing the wheel. By cross-referencing multiple sources, you increase confidence in the accuracy of the final address. For example, if the official site and the Facebook page both independently mention “Springfield, IL”, you can be pretty sure that’s correct for the city and state.
4. Scheduling Scraping: One-time vs Recurring
Depending on your use case, you might need to run this scraping process just once (one-off script) or at regular intervals to keep the address updated. Here’s how to handle both scenarios:
One-Time Scraping: This is straightforward – you can run your Node.js script manually or as part of a build/deploy process. For example, if you are populating a database of brand addresses, you might run the scraper once to get initial data. Just be sure to log or store the result because subsequent runs may be unnecessary unless data changes.
Recurring Scraping: If you want to monitor changes or ensure you always have the latest address (addresses don’t change often, but it could happen if a business moves), you can schedule the scraper to run periodically (daily, weekly, monthly).
Options for Scheduling in Node.js:
Cron Jobs (System-level): On a Unix-like system, you can use crontab to schedule the execution of your Node script at a set interval. For example, add a cron entry to run
node scrapeAddresses.js
nightly. This is external to your app’s code.Node Cron Libraries: There are packages like
node-cron
ornode-schedule
that allow you to schedule jobs within your Node application. This is useful if you want the scheduling to be part of your program logic (say, you have a long-running service that should scrape every hour). For instance, using node-cron:
The cron expression'0 0 * * *'
means midnight (00:00) every day. You can adjust expressions for different schedules (cron format is very flexible).Background Workers / Queues: In a larger application, you might have a worker process that handles scraping jobs. Tools like BullMQ (Redis-based queue) or Agenda can schedule jobs too. These allow more complex job management (retries, failure handling, staggered schedules, etc.). For example, Agenda is a job scheduler that works with MongoDB, and BullMQ uses Redis – both can be set to repeat jobs on a schedule.
Cloud Schedulers / Serverless: If your infrastructure is cloud-based, consider using services like AWS Lambda with CloudWatch scheduled events, or Heroku Scheduler, etc., to trigger your scraping script on a schedule. This offloads the scheduling to the platform.
When doing recurring scrapes, also consider:
Logging: Keep logs of when scrapes happen and if they succeeded. This helps in monitoring and debugging.
Differences Only: You might not need to store the address every time if it hasn’t changed. You could compare the newly scraped address with the last known address and only update your database if there’s a change, to avoid clutter.
Rate and Etiquette: Even if automated, don’t scrape too frequently. For a company address, checking once a day or even once a week is usually sufficient. Hitting the source website or social pages too often could be unnecessary load and may get your scraper blocked.
5. Handling Common Challenges and Best Practices
Scraping web data can be tricky. Here are some common challenges you might face when scraping addresses, and tips to handle them:
Anti-Bot Protections: As discussed, platforms like LinkedIn and Instagram are aggressive in blocking bots. To mitigate this:
Use rotating IP proxies to spread out requests and avoid IP-based blocking.
Throttle your scraping speed – introduce random delays between page requests.
Randomize your User-Agent header so you don’t always appear as the same program. There are libraries and lists of user-agent strings you can cycle through.
Use headless browsers like Puppeteer or Playwright to simulate real user interactions (some sites detect typical HTTP libraries but will serve content to a real browser). You can even automate scrolling or clicking if content loads on user interaction.
Some sites use CAPTCHAs or login walls – at that point, you might consider services or APIs that handle these, or ultimately, rely on official data sources instead of scraping.
Malformed or Inconsistent Data: Addresses might not always be in the exact format you expect:
Some might be missing commas, have different ordering (e.g., “Paris, 75001, FR” vs “75001 Paris, FR”). Try to account for variations, perhaps by allowing regex patterns rather than fixed parsing rules.
International addresses can have different formats (not all are “street, city, state, zip”). If you expect global data, a library like libpostal is very useful for normalization.
Remove extraneous text. Sometimes addresses come with directions or descriptions (e.g., “123 Main St, Springfield (next to Central Park)”). You may need to strip out the parenthetical part if it’s not needed.
Error Handling: Always code defensively:
Network requests can fail – use try/catch around axios/puppeteer operations. Implement retries with backoff for transient errors.
If a selector isn’t found (e.g., Facebook changed their DOM), handle that gracefully (maybe log a warning and move on, or try an alternative strategy).
When parsing JSON from a script tag, wrap it in try/catch as shown above – bad or unexpected JSON shouldn’t crash your script.
Legal Considerations: Remember that scraping should respect the terms of service of the websites. Public data is generally accessible, but using it at scale or for certain purposes might violate terms. Be especially careful with logged-in scraping (you don’t want to get a LinkedIn or Instagram account banned). In some cases, using an official API (even if limited) might be safer long-term than HTML scraping.
Use of Caches: If you have to scrape the same sources repeatedly, consider caching the page data to reduce hits. For example, if you check daily and the address rarely changes, you could save the last fetched HTML and only re-fetch if a certain time has elapsed or if you detect a change via an API.
Scaling: If you need to scrape addresses for many brands, do it in batches and possibly in parallel (but not too many at once to avoid getting blocked). Use asynchronous operations in Node to handle multiple scrapes concurrently, and manage concurrency with limits (e.g., process 5 at a time).
If you're building products that rely on accurate brand data, scraping addresses manually is just the start. With tools like Cheerio and Puppeteer, you can get pretty far, but stitching together reliable, structured data across the web takes time. That’s exactly why we built brand.dev: to give developers instant access to logos, color palettes, fonts, social links, and yes, even structured address data, all from a single API.
If you're tired of building and maintaining brittle scraping logic, give it a try. It'll save you hours and make your product feel a whole lot smarter.