Interim Linkscape Update for January

Jan
11

Posted by randfish

If you've been following my posts on Linkscape's index, you know that we've been trying to aim for fresher, better and larger indices over the past few months, but have been finding some very tough challenges. It turns out that indexing the web, canonicalizing millions of pages and calculating a link graph with quality metrics is super-hard; who knew? :-)

As part of those efforts, we've been working toward an experimental index that leverages a more search-engine style crawler that crawls fresher pages/sites more often and less fresh stuff less frequently. That index, however, is taking its sweet time (and we're doing a lot of babysitting and monitoring to make sure it's smooth). Our tentative plan is to have that index launched in the next 2 weeks, but we felt that since our last index was at the very end of November, a new one with fresher data was warranted. Hence, last night, we launched an interim index with the following metrics:

  • 36,660,519,013 (36 billion) URLs
  • 427,626,242 (427 million) Subdomains
  • 128,149,029 (128 million) Root Domains
  • 387,656,119,262 (387 billion) Links
  • Followed vs. Nofollowed

    • 2.05% of all links found were nofollowed
    • 55.00% of nofollowed links are internal, 45.00% are external
  • Rel Canonical – 10.57% of all pages now employ a rel=canonical tag
  • The average page has 69.12 links on it (negligible from last index)

    • 57.76 internal links on average
    • 11.36 external links on average

This index is smalller than our last few, but the numbers look reasonably solid and the data's from the first few weeks of December, so it should be helpful to all you link builders and analyzers. Do be aware, though, that this update is likely to only last a couple weeks before we replace it with our new version, for which we have high expectations (but don't want to promise the moon just yet).

Also noteworthy – last night, when the index first launched, we experienced some wackiness with Page and Domain Authority scores. Those should have largely settled down to normalcy now, but if you see anything odd, please let us know.

Do you like this post? Yes No

SEOmoz Daily SEO Blog

2nd November Index Update: Our Broadest Index Yet, and New PA/DA Scores are Live

Dec
5

Posted by randfish

Hey gang – it's that magical time again when Linkscape's web index has updated with brand new data (for the second time this month). Open Site Explorer, the Mozbar and the PRO Web App all have new links and scores to check out. This index also features the updated Page Authority and Domain Authority models covered by Matt last week on the blog.

Here's the current index's metrics:

  • 38,295,116,929 (38 billion) URLs
  • 466,742,600 (466 million) Subdomains
  • 125,007,049 (125 million) Root Domains
  • 387,379,700,299 (387 billion) Links
  • Followed vs. Nofollowed

    • 2.03% of all links found were nofollowed
    • 55.57% of nofollowed links are internal, 44.43% are external
  • Rel Canonical – 10.34% of all pages now employ a rel=canonical tag
  • The average page has 70.61 links on it (down 6.67 from last index; we're likely biasing to a different set of webpages with the broader vs. deeper focus of this release)

    • 59.02 internal links on average
    • 11.59 external links on average

As you can see, we're crawling a LOT more root domains – we expect to have data for an extremely high percentage of all the domains that you might find active on the web. However, because of this broader crawl, we're not reaching as deeply into some large domains (some of that is us weeding out crap, including many more millions of binary files, error-producing webpages and other web "junk"). You can see below a chart of the root domains we've crawled in the last 6 months vs. the total URLs in each index.

November Linkscape Update Graph of Root Domains vs. URLs

We work toward a few key metrics to judge our progress on the index:

  • Correlations with Google rankings (not only of PA/DA, but of link counts, linking root domains, mozRank, etc)
  • Percent of successful API requests (meaning a request for link data on a URL from any source that we had link data for)
  • Raw size and freshness (total # of root domains and URLs in the index, though, as Danny Sullivan has pointed out, this may not be a great metric on which to judge a web corpus)

We've gotten better with most of these recently – PA/DA have better correlations, more of your requests (via Open Site Explorer, the Mozbar or any third-party application) now have link data, and we're slowly improving freshness (this index was actually completed last week, but didn't launch due to the Thanksgiving holiday). However, we are not improving as much on raw index size (root domains, yes, which we've seen correlate with other metrics, but raw URL count, no). This will continue to be a focus for us in the months to come, and we're still targeting 100 billion+ URLs as a goal (though we're not willing to sacrifice quality, accuracy or freshness to get there).

As always, if you've got feedback on the new scores, on the link data or anything related to the index, please do let us know. We love to hear from you!

Do you like this post? Yes No

SEOmoz Daily SEO Blog