Using Site Crawl Data

Reading Site Crawl’s Analysis

When showing the results after crawling a website, Site Crawl displays its data using a color coding system. This analysis uses colors to help you understand what is going on when it comes to a website’s usability and SEO.

Site Crawl uses 3 colors to help you understand your data:

Red: These are high-priority issues that are seriously impacting a website’s SEO and/or user friendliness. These issues should be at the top of any to-do list for fixing a website.

Orange: These are issues that are important, but not critical to user experience or SEO. Issues highlighted in orange should definitely be fixed but not before those marked in red.

Blue: Data displayed in blue are informational notes. This analysis shows things that don’t necessarily impact a site’s ability to rank in search results, or highlights issues that do impact SEO but might be done on purpose (such as disallowing a page via robots.txt).

Reading Site Crawl’s Data

Site Crawl’s website data is divided into 4 sections:

On-Page
HTTP Status
Indexing
Canonical

Here’s the data that each section collects and what that data means for your website:

On-Page

The On-Page section of Site Crawl contains data about elements that are important for on-page SEO and/or user experience:

Title tags
Meta descriptions
H1 tags
Body content

When the crawler encounters one of the issues described below, it will display the following information:

A description of the issue
A character count of the tag’s text and length in pixels
The actual text found by the crawler
The URL of the page or pages where the element was found

Title tags

The title tags area of Site Crawl’s On-Page section highlights the pages with title tags that are possibly not as well-optimized as they could be for search engines. In this area you’ll find title tags that are:

Duplicate: If Site Crawl finds the same title on 2 or more pages, it will list along with all the pages that use that title. Having pages with identical titles can cause duplicate content issues for a website.
Too short/long: Google dedicates 600 pixels (around 65 characters) to showing a page’s title in its search results. Title tag content that is longer or shorter than this is technically valid, but could perhaps be reworked to make better use of the space it has in search results.
Missing: Pages marked as “missing” don’t contain the HTML code to define a title tag at all. Pages with missing title tags are missing out on a very important signal that Google uses to determine a page’s topic and relevance.

Meta descriptions

The meta descriptions area of the On-Page section highlights potential issues discovered with your website’s meta descriptions. In this section you will find meta descriptions that are:

Duplicates: Meta descriptions that are used by 2 or pages are listed in this area. Like title tags, duplicate meta descriptions can cause your site to experience issues with duplicate content.

Too short/too long: Meta descriptions appear in Google search results underneath a page’s title and URL. The amount of space Google dedicates to meta descriptions fluctuates but is usually between 650 and 900 pixels (around 120-150 characters). If a page’s meta description is too short or too long, Google could simply scrape the page’s content to show what it thinks is the most relevant part. Crafting relevant, enticing meta descriptions can help encourage people to click on your site in search results.
Missing: These pages don’t contain the HTML tag for a meta description. Like when descriptions that are too long or short, Google will use a snippet from the page’s text when showing it in search results if it doesn’t contain a description. This gives up a certain element of control you have when it comes to how your site appears in search results.

HTML headers

The H1 tags area of the On-Page section pages found that have either multiple or missing H1 HTML header tags.

Multiple H1: Google has said multiple times that using multiple H1 tags on a page in order to structure content in a logical, user-friendly way will help improve usability and readability. However, it’s also an old spam tactic to improve a page’s search rankings. Page’s with multiple H1 tags are listed here so you can be sure that any pages using multiple H1 tags are doing so legitimately.
Missing: Pages that don’t contain the HTML code defining H1 content are identified as “Missing” in this area. Leaving out H1 content on a page makes it harder for Google to read and understand what the page’s content is about.

Body content

Body content is the content that appears between the <body> and </body> HTML tags on a page. When we talk about “content” in terms of a web page and Google we are referring to this content. The body content area of the On-Page section of Site Crawl identifies pages with body content that is:

Thin content: Search engines like content that provides the most in-depth and authoritative content in order to answer a user’s question so pages with brief or thin content won’t rank as well. There’s no exact word count that defines “thin” content, but Site Crawl will highlight pages with content that is 250 characters or fewer for your attention.
Duplicates: Pages that are copied versions of each other can confuse search engines as to which is the original versions and result in these pages struggling to rank in search results. Too many duplicate pages can make a website look low quality, impacting the ability of other, original pages to rank.
Blank: As mentioned earlier, body content is the main content of a page. Blank pages don’t have any content in the body HTML tags. It’s highly possible search engines won’t even bother to index these blank pages as they have no content to recommend to users.

HTTP Status

HTTP status codes tell you what’s happening when a browser or crawler (including search engine crawlers) tries to load a page on your website. They’re incredibly important for understanding a website’s search engine optimization and usability.

With Site Crawl you can follow the links on a website to analyze HTTP status code data and uncover potential issues with a website’s overall general health.

When Site Crawl encounters one of the HTTP status codes listed below, it will collect and display the following data:

HTTP status code: This is the code returned by the site’s server when Site Crawl tries to visit the specified page.
Error page URL: This is the page that returns the specified status code or error when Site Crawl tries to access it.
Source page URL: This is the page that is linking to the broken, missing, redirected page or otherwise inaccessible.

5xx

HTTP status codes in the 500 range indicate the server encountered an internal error while trying to send the requested page to the user’s browser. Essentially, something went wrong with the server somewhere such as it taking too long to respond, the site being overwhelmed by traffic or encountering errors in code.

4xx

A server returns a status code in the 400 range when it is unable to find a page located at the requested URL. 404 errors are the most well-known instance of these errors, however, there are other codes to specify other errors when trying to access the request page.

3xx

Status codes in the 300 range tell a browser that the requested page has moved from one URL to a new one. This move can be permanent (301) or temporary (302). The server will also tell the browser where to go instead. Hence why these status codes are called “redirects”.

Redirects themselves usually aren’t a problem. But there are things that can go wrong that impact SEO and user experience for a website. Site Crawl will list important data it uncovers about a website’s redirects:

Broken redirects: As the name suggests, these are redirects that point users and crawlers to pages that don’t load or don’t exist.
Redirect chains: Redirect chains happen when a redirect points to a page that also redirects to a third page.
Redirect loops: Loops occur when one page redirects to a second page which in turn redirects back to the first page. This causes the user to get stuck in an infinite back and forth between two pages that don’t load. Most search engines these days (including Google) consider redirect loops to be broken redirects and will ignore any they come across.

When Site Crawl encounters one of these errors, it will show the following data:

Error type - Broken redirect, redirect loops, redirect chains or just normal redirects.
Code - The HTTP status code of the page the redirect points to.
Redirect page URL - The URL of the page in the original link discovered by Site Crawl and each URL encountered as it follows the redirect. So this will include all the URLs in the redirect chain or loop.
Source page URL - The page on which Site Crawl found the link to the redirected page.

HTTP within HTTPS

The HTTP within HTTPS area of the HTTP Status section lists pages that are hosted on HTTPS URLs but contain files (images, videos, scripts, etc.) that use regular HTTP URLs. It lists the data it finds about secure pages hosting unsecured files:

Number of assets: This is the number of files on the page that use unsecured URLs.
HTTP asset type: This is the type of file that is using an HTTP URL. This could be images, JavaScript, CSS or other types of files.
HTTP asset URL: The unsecured URL for the detected file.
HTTPS source page URL: This is the secure page that contains the unsecured image, script or other file.

HTTPS refers to a method of sending and receiving data over the web. It is a secure, encrypted extension of the original HTTP protocol.

Securing your website with an SSL certificate and using HTTPS in your URLs is important for SEO — search engines rank secure websites higher than less secure websites all else being equal. Using SSL for your website also helps protect your users and users’ information when they access your website, which is vital in building trust with your customers.

Using assets without HTTPS URLs on pages that have HTTPS URLs not only makes your site less secure, but many browsers won’t allow users to access these pages. Instead, they will show a red warning page informing the user that the page they want to access isn’t fully secure.

Crawl Errors

Crawl errors occur when Site Crawl is unable to connect with a website because it doesn’t respond or takes too long to respond when it requests a page.

Indexing

The Indexing section lists all of the pages Site Crawl finds that search engines will struggle to find, crawl and index, or won’t be able to crawl at all. Therefore, these pages likely won’t appear in search results.

Setting pages up so that they can’t be indexed by search engines is often done on purpose, like during a site migration/redesign or to avoid duplicate content. However, this can also be done in error so it’s worth continuously monitoring a website to ensure optimal indexability.

Note: The pages listed in the Indexing section of Site Crawl are marked as informational only because it’s possible a website owner wants a page not be to indexed by search engines. However, pages that are non-indexable that appear in the website’s sitemap should be considered an error as listing a non-indexable page in a sitemap can confuse search engines as to which direction they should follow.

Non-indexable pages

Pages listed in the non-indexable pages area of this section are pages that can’t be crawled by search engines for one of these reasons:

It’s been disallowed by the website’s robots.txt file
The page contains a canonical tag pointing to another page
It has a noindex meta robots tag
The page has a nofollow meta robots tag or HTTP header

When Site Crawl finds a page that can’t be indexed for one of these reasons, it will collect the following data for you:

The URL of the non-indexable page
The URL of the page linking to the non-indexable page
The reason search engines can’t index the page (one of the reasons listed above)
Whether or not the non-indexable page is listed in the website’s XML sitemap

Disallowed pages

This area identifies pages on a website that have been disallowed by that site’s robots.txt file but are being linked to from somewhere on that site.

When Site Crawl finds a disallowed page while following the links on a website it will list:

The URL of the disallowed page
The URL of the source (linking) page
Whether or not the disallowed page appears in the site’s XML sitemap

Nofollowed pages

The Nofollowed pages area of the Indexing section lists pages that Site Crawl finds containing a meta robots tag or an x-robots-tag HTTP header set to nofollow. Pages that contain the nofollow tag or HTTP header prevent search engines from following any of the links on those pages.

While these elements can help an advanced website owner control how their site is crawled and indexed, errors with these methods can prevent a site’s pages from appearing in search results.

When Site Crawl encounters a page using the nofollow meta robots tag or x-robots HTTP header it will present the following data:

The URL containing the nofollow tag or header
The page linking to the nofollowed page
Whether the page is nofollowed by the meta robots tag or HTTP header method

Deep pages

“Deep pages” refers to pages that require a user or search engine to click six or more links to find that page from the homepage. Deep pages are less likely to be found by search engines and therefore are less likely to appear in search results for users. Deep pages that are discovered by search engines are seen as less important and are less likely to rank as well as another page.

When Site Crawl comes across deep pages it lists

Depth: The number of clicks it takes to reach the deep page from the homepage
Source URL: The URL of the page linking to the deep page
Deep Page URL: The URL of the deep page that requires least six clicks to find from the homepage

Having a lot of deep pages on a website is a sign either a very large website or a smaller website with a poor internal linking structure. When depth reaches around 50 clicks or more, that is often a sign of a bug in the website’s CMS that causes it to automatically generate unique URLs when loading a page.

Canonical

The Canonical section of Site Crawl lists pages that contain canonical tags and/or hreflang tags. These tags tell search engines where they can find the original version of a page or versions of a page in alternate languages, respectively.

Canonical

The Canonical area lists all of the important data Site Crawl finds about a website’s discovered canonical tags:

Self-referencing/Non self-referencing: This will tell you whether or not a page’s canonical tag links to the page on which it appears (self-referencing) or not (non self-referencing)
Conflicting canonicals: The URL listed in the canonical tag cannot be accessed
Sitemap mismatch: The URL contained in the canonical tag cannot be found in the site’s XML sitemap
Multiple canonicals: Pages marked with multiple canonicals contain more than one canonical tag. This is often caused when a canonical URL as added in both the page’s HTTP header and HTML code.
Open Graph mismatch: The URL listed in the page’s canonical tag does not match the page’s Open Graph URL.

Site Crawl will then list the URL of the page containing the analyzed canonical tag as well as the page linking to this page.

Hreflang

The Hreflang area lists Site Crawl’s analysis of hreflang tags that contain links to pages that it is unable to properly access or load. The Site Crawl results will list the broken link contained in the hreflang tag as well as the page on which the tag appears.

Extra Resources

For further information on getting the most out of Site Crawl’s data and results, check out these resources:

Reading Site Crawl’s Analysis

Reading Site Crawl’s Data

On-Page

Title tags

Meta descriptions

HTML headers

Body content

HTTP Status

5xx

4xx

3xx

HTTP within HTTPS

Crawl Errors

Indexing

Non-indexable pages

Disallowed pages

Nofollowed pages

Deep pages

Canonical

Canonical

Hreflang

Extra Resources

Related articles