HOW TO FIND ALL CURRENT AND ARCHIVED URLS ON A WEB SITE

How to Find All Current and Archived URLs on a web site

How to Find All Current and Archived URLs on a web site

Blog Article

There are plenty of reasons you may perhaps need to search out each of the URLs on a web site, but your exact target will identify Anything you’re searching for. As an illustration, you might want to:

Detect every indexed URL to research challenges like cannibalization or index bloat
Accumulate latest and historic URLs Google has noticed, especially for web site migrations
Obtain all 404 URLs to Get well from post-migration glitches
In Every situation, one Resource gained’t Provide you every little thing you need. Regrettably, Google Search Console isn’t exhaustive, along with a “website:example.com” lookup is proscribed and difficult to extract data from.

With this publish, I’ll stroll you through some resources to make your URL record and right before deduplicating the info using a spreadsheet or Jupyter Notebook, according to your site’s measurement.

Aged sitemaps and crawl exports
If you’re looking for URLs that disappeared within the Reside web-site not too long ago, there’s an opportunity anyone with your group may have saved a sitemap file or possibly a crawl export ahead of the adjustments have been created. For those who haven’t now, check for these data files; they can normally deliver what you require. But, in case you’re reading this, you almost certainly didn't get so lucky.

Archive.org
Archive.org
Archive.org is a useful tool for Search engine optimization duties, funded by donations. In case you seek for a website and select the “URLs” choice, you'll be able to access up to 10,000 detailed URLs.

Nonetheless, There are some restrictions:

URL Restrict: You may only retrieve approximately web designer kuala lumpur ten,000 URLs, which can be insufficient for more substantial web sites.
Good quality: Lots of URLs could be malformed or reference useful resource information (e.g., photographs or scripts).
No export option: There isn’t a constructed-in method to export the record.
To bypass the lack of an export button, utilize a browser scraping plugin like Dataminer.io. Even so, these limitations suggest Archive.org might not deliver an entire Answer for larger sized websites. Also, Archive.org doesn’t reveal no matter if Google indexed a URL—but if Archive.org uncovered it, there’s a fantastic probability Google did, way too.

Moz Professional
Although you could commonly use a url index to find external internet sites linking to you personally, these tools also uncover URLs on your website in the process.


How to utilize it:
Export your inbound one-way links in Moz Pro to get a brief and simple listing of concentrate on URLs out of your web page. When you’re handling a large Site, think about using the Moz API to export data over and above what’s manageable in Excel or Google Sheets.

It’s imperative that you note that Moz Professional doesn’t affirm if URLs are indexed or found by Google. Even so, considering that most web-sites implement precisely the same robots.txt principles to Moz’s bots since they do to Google’s, this method normally functions properly as a proxy for Googlebot’s discoverability.

Google Look for Console
Google Research Console features a number of beneficial resources for setting up your listing of URLs.

Back links studies:


Much like Moz Professional, the Back links area offers exportable lists of focus on URLs. Sadly, these exports are capped at 1,000 URLs Each individual. It is possible to utilize filters for particular web pages, but because filters don’t apply towards the export, you could possibly should rely upon browser scraping instruments—limited to five hundred filtered URLs at a time. Not suitable.

General performance → Search Results:


This export provides you with a listing of internet pages receiving lookup impressions. While the export is limited, You should use Google Search Console API for more substantial datasets. Additionally, there are cost-free Google Sheets plugins that simplify pulling far more intensive info.

Indexing → Internet pages report:


This area gives exports filtered by difficulty variety, though they are also minimal in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is an excellent resource for collecting URLs, having a generous limit of a hundred,000 URLs.


A lot better, you can implement filters to produce various URL lists, properly surpassing the 100k Restrict. By way of example, in order to export only web site URLs, stick to these ways:

Step 1: Include a section into the report

Move two: Click “Produce a new segment.”


Move three: Outline the phase with a narrower URL sample, for instance URLs that contains /website/


Take note: URLs present in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they supply useful insights.

Server log data files
Server or CDN log files are Probably the ultimate Instrument at your disposal. These logs capture an exhaustive list of every URL route queried by customers, Googlebot, or other bots in the course of the recorded interval.

Concerns:

Info dimension: Log files is usually substantial, a lot of web pages only retain the last two weeks of information.
Complexity: Examining log information is often tough, but various resources can be obtained to simplify the method.
Incorporate, and great luck
As soon as you’ve gathered URLs from all of these sources, it’s time to combine them. If your web site is small enough, use Excel or, for greater datasets, resources like Google Sheets or Jupyter Notebook. Be certain all URLs are continuously formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of latest, old, and archived URLs. Excellent luck!

Report this page