How to Find All Present and Archived URLs on a web site

There are numerous reasons you could want to discover all of the URLs on a web site, but your specific purpose will decide Everything you’re looking for. For instance, you might want to:

Recognize each indexed URL to analyze challenges like cannibalization or index bloat
Obtain latest and historic URLs Google has witnessed, especially for website migrations
Find all 404 URLs to Get better from put up-migration errors
In Just about every situation, a single Device won’t Supply you with every little thing you need. Sad to say, Google Search Console isn’t exhaustive, and also a “website:case in point.com” search is restricted and difficult to extract knowledge from.

In this put up, I’ll wander you thru some resources to develop your URL list and in advance of deduplicating the data using a spreadsheet or Jupyter Notebook, based upon your site’s dimensions.

Aged sitemaps and crawl exports
In case you’re looking for URLs that disappeared through the Are living website not long ago, there’s an opportunity someone in your crew may have saved a sitemap file or maybe a crawl export before the improvements had been built. Should you haven’t already, check for these data files; they're able to often give what you need. But, in case you’re reading this, you most likely didn't get so lucky.

Archive.org
Archive.org
Archive.org is a useful tool for Website positioning duties, funded by donations. When you look for a website and choose the “URLs” choice, you may access approximately ten,000 outlined URLs.

On the other hand, there are a few restrictions:

URL Restrict: You can only retrieve around web designer kuala lumpur ten,000 URLs, which is insufficient for much larger sites.
Top quality: Quite a few URLs may be malformed or reference source documents (e.g., visuals or scripts).
No export choice: There isn’t a created-in solution to export the list.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limitations suggest Archive.org might not present an entire solution for greater web-sites. Also, Archive.org doesn’t suggest no matter whether Google indexed a URL—but when Archive.org located it, there’s a very good chance Google did, as well.

Moz Professional
While you may normally use a link index to locate exterior web-sites linking to you, these tools also discover URLs on your web site in the method.


The way to use it:
Export your inbound one-way links in Moz Pro to get a quick and simple listing of focus on URLs from your web page. When you’re managing a massive Web page, consider using the Moz API to export facts past what’s workable in Excel or Google Sheets.

It’s imperative that you note that Moz Professional doesn’t ensure if URLs are indexed or learned by Google. On the other hand, due to the fact most web-sites implement the exact same robots.txt policies to Moz’s bots because they do to Google’s, this method usually functions well as a proxy for Googlebot’s discoverability.

Google Look for Console
Google Research Console delivers numerous useful sources for setting up your listing of URLs.

Hyperlinks reports:


Comparable to Moz Professional, the Backlinks section provides exportable lists of concentrate on URLs. However, these exports are capped at 1,000 URLs Every. You can utilize filters for distinct pages, but given that filters don’t apply into the export, you would possibly should depend on browser scraping instruments—limited to 500 filtered URLs at any given time. Not excellent.

Performance → Search Results:


This export gives you a listing of webpages receiving research impressions. Although the export is limited, You should utilize Google Lookup Console API for larger datasets. You will also find cost-free Google Sheets plugins that simplify pulling much more considerable knowledge.

Indexing → Pages report:


This part gives exports filtered by issue kind, however they're also limited in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is an excellent supply for amassing URLs, by using a generous Restrict of one hundred,000 URLs.


A lot better, you'll be able to utilize filters to produce distinctive URL lists, efficiently surpassing the 100k Restrict. As an example, if you would like export only website URLs, abide by these measures:

Move 1: Insert a segment for the report

Move 2: Click on “Make a new segment.”


Action three: Define the section with a narrower URL pattern, which include URLs made up of /blog site/


Observe: URLs present in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they supply valuable insights.

Server log information
Server or CDN log data files are Maybe the last word Software at your disposal. These logs capture an exhaustive checklist of each URL route queried by customers, Googlebot, or other bots over the recorded interval.

Concerns:

Knowledge sizing: Log files could be large, so many web sites only keep the final two months of data.
Complexity: Analyzing log information might be complicated, but numerous equipment can be obtained to simplify the procedure.
Incorporate, and great luck
As soon as you’ve collected URLs from these resources, it’s time to combine them. If your site is small enough, use Excel or, for more substantial datasets, instruments like Google Sheets or Jupyter Notebook. Be certain all URLs are consistently formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of recent, outdated, and archived URLs. Superior luck!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Comments on “How to Find All Present and Archived URLs on a web site”

Leave a Reply

Gravatar