How to Find All Present and Archived URLs on a web site

There are numerous causes you could need to have to discover many of the URLs on a website, but your precise aim will figure out Anything you’re trying to find. For illustration, you might want to:

Establish every single indexed URL to research problems like cannibalization or index bloat
Obtain existing and historic URLs Google has seen, especially for site migrations
Uncover all 404 URLs to Get better from submit-migration glitches
In Every circumstance, only one Software received’t give you almost everything you would like. Unfortunately, Google Look for Console isn’t exhaustive, in addition to a “web page:illustration.com” search is limited and challenging to extract info from.

Within this post, I’ll walk you through some equipment to develop your URL listing and right before deduplicating the data employing a spreadsheet or Jupyter Notebook, depending on your website’s dimension.

Old sitemaps and crawl exports
For those who’re in search of URLs that disappeared through the Are living internet site just lately, there’s an opportunity anyone on your team could have saved a sitemap file or simply a crawl export before the adjustments have been created. In case you haven’t by now, check for these data files; they will usually deliver what you require. But, in case you’re looking through this, you probably did not get so lucky.

Archive.org
Archive.org
Archive.org is an invaluable Device for Search engine optimization duties, funded by donations. For those who seek for a site and select the “URLs” choice, it is possible to entry around ten,000 mentioned URLs.

On the other hand, There are several constraints:

URL Restrict: You are able to only retrieve approximately web designer kuala lumpur 10,000 URLs, and that is inadequate for larger sized sites.
High quality: Many URLs may very well be malformed or reference source documents (e.g., photographs or scripts).
No export alternative: There isn’t a crafted-in strategy to export the list.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. However, these limitations indicate Archive.org may well not supply a whole Remedy for much larger sites. Also, Archive.org doesn’t suggest irrespective of whether Google indexed a URL—but if Archive.org discovered it, there’s a great opportunity Google did, much too.

Moz Pro
Though you would possibly commonly use a backlink index to discover exterior web-sites linking for you, these resources also uncover URLs on your site in the procedure.


Tips on how to use it:
Export your inbound hyperlinks in Moz Professional to get a speedy and straightforward listing of target URLs out of your web page. Should you’re addressing a massive Web site, think about using the Moz API to export details past what’s manageable in Excel or Google Sheets.

It’s imperative that you Take note that Moz Professional doesn’t affirm if URLs are indexed or learned by Google. On the other hand, because most sites implement the identical robots.txt procedures to Moz’s bots as they do to Google’s, this method normally will work well as a proxy for Googlebot’s discoverability.

Google Look for Console
Google Lookup Console gives several beneficial resources for making your list of URLs.

Links reports:


Just like Moz Professional, the Backlinks portion gives exportable lists of focus on URLs. Regrettably, these exports are capped at one,000 URLs Each and every. You can apply filters for specific pages, but considering that filters don’t implement on the export, you might need to rely on browser scraping resources—limited to 500 filtered URLs at a time. Not great.

Functionality → Search engine results:


This export gives you a summary of internet pages receiving search impressions. Whilst the export is limited, You should utilize Google Research Console API for much larger datasets. Additionally, there are totally free Google Sheets plugins that simplify pulling more intensive info.

Indexing → Internet pages report:


This part provides exports filtered by situation type, while they're also minimal in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is a superb supply for gathering URLs, by using a generous limit of 100,000 URLs.


A lot better, you could use filters to produce distinct URL lists, efficiently surpassing the 100k Restrict. As an example, if you want to export only website URLs, observe these actions:

Phase one: Increase a phase into the report

Step two: Click on “Develop a new section.”


Step three: Outline the phase that has a narrower URL pattern, such as URLs that contains /blog/


Take note: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log files
Server or CDN log information are Most likely the ultimate Instrument at your disposal. These logs capture an exhaustive checklist of every URL route queried by end users, Googlebot, or other bots in the recorded interval.

Criteria:

Facts dimensions: Log information can be significant, numerous websites only keep the final two months of data.
Complexity: Examining log files is often difficult, but several resources can be obtained to simplify the procedure.
Incorporate, and good luck
After you’ve gathered URLs from these sources, it’s time to combine them. If your site is sufficiently small, use Excel or, for more substantial datasets, resources like Google Sheets or Jupyter Notebook. Be certain all URLs are persistently formatted, then deduplicate the listing.

And voilà—you now have an extensive list of recent, aged, and archived URLs. Great luck!

Leave a Reply

Your email address will not be published. Required fields are marked *