How Google Crawls your Website



How Google Crawls your Website’s JavaScript and AJAXHow Google Crawls your Website

A huge “step” forward in how Google “crawls” of Ajax content! Google is the best in the world at crawling content within JavaScript and AJAX but. However, they are still perfecting it. Do you find it’s a battleground getting your site crawled?

Google has changed it way of handling AJAX calls to web content. John Mueller of Google said they can be “on and off” so there are extra SEO tactics that search experts can engage to let Google know your crawl intent of content. Going forward, it is best not to rely on the way it used to be. Manual maintenance of your website’s robots.txt file is still a good idea if you have specific content you do or don’t want to be crawled.

GoogleBot is the arm of Google’s search engine that crawls your web pages and creates an index. It’s also known as a spider. GoogleBot uses machine learning to crawl every page you allow it to access, and adds it to Google’s index where it can be retrieved and returned to match users’ search queries. Your efforts to clearly indicate to Google which pages on your website you want crawled and which ones you do not can seem like combat, too. Most Internet surfers never realize the different steps you have taken to improve your site’s crawlability and indexation.

Getting your website crawled by Google and indexed correctly is the pivotal matter in Internet marketing success. It’s the entire starting point. You must have web crawling and the ability to be indexed to succeed. Without a sitemap upload into your website’s root folder, crawling can take a long time — maybe 24 hours or more to index a new blog post or deep website.

Two important concepts you need to understand for Google to crawl your website are:

1. If you want your site crawled and indexed, then search engine spiders need to be able to view your site correctly.

2. There is a lot you can do to ensure that your website is crawled correctly by Google’s spiders.

New Insights on Syndicated Content and How Google Crawls AJAX

Google can now index ajax calls and it is important to understand what that means in Google Search results.

When John Mueller was asked in last Friday’s English Google Webmaster Central office-hours hangout how syndicated content and an ajax call are handled, his response was: “In the past we have essentially ignored that. What could be done is using JS.” I find it fascinating what has both changed and what we can expect. You could indicate that you wish to render a page for indexing while having a dynamic title left out. If you want to exclude a part of a page, you can do so with the robots.txt file to indicate that wish.

For example, a products details page with description, reviews, buttons and answers (Q&A). That part of the syndicated call could be hidden; he would suggest moving that content within a separate directory within the site where you are aggregating this, so that could block aggregated content by your robot.txt file. This may avoid it looking like you are auto-generating content that doesn’t actually exist on the site.

“What we try to do is render the page like it would look in a browser. Look at the final results and use those results in search” added Mueller. If you just want only part of a page is not taken into account, than roboting the text can be done by Web Masters. Interestingly, even Mueller hinted that it can be tricky to indicate which AJAX content you don’t want parsed within a given page, while the rest of the page is parsed for AJAX content.

Google will not always crawl all of your JavaScript all the time. “It is still on and off but headed in the direction of more and more,” stated Mueller. It seems clear that Google’s developers are watching tests on how Google crawls AJAX and wants to index titles properly that are being injected with JavaScript and more consistently. It is not about “sneaking content to the user that isn’t being indexed”.

Watch the full Google Webmaster Central Hangout here.

You can use the Fetch as Google Tool to see how your site looks when crawled by Google. From this, site owners can take advantage of more granular options and can select how content is indexed on a page-by-page status. One example is the ability to say just how your pages appear with or without a snippet – in a cached version, which is an alternate version collected on Google’s servers in case the live page is not viewable at the moment.

QUESTION: What is GoogleBot? How Does it Crawl Your Site?

ANSWER: “Googlebot is the search bot software used by Google, which collects documents from the web to build a searchable index for the Google Search engine,” according to Wikipedia.

Whether you are seeking to learn methods of the Google Crawler for earned search or paid search, SEOs can improve search tactics with a correct understanding of GoogleBot.

An Overview of How Google Crawls Websites

1. First thing to know is that your website is always being crawled. Google has indicated that; “Googlebot shouldn’t access your site more than once every few seconds on average.” In other words, your site is always being crawled, provided your site is correctly set up to be available to crawlers. Google’s “crawl rate” means the speed of Googlebot’s requests; it is not about how often your website is crawled. Typically, businesses what more visibility, which comes in part from more freshness, relevant backlinks with authority, social shares and mentions, etc. the more likely it is that your site will appear in search results. Imagine how many crawls Googlebot does, so it is not always feasible or necessary for it to crawl every page on your site all the time.

2. Google’s Routine is first to access a site’s robots.txt file. From there it learns what a site owner has specified as to what content Google is permitted to crawl and index on the site. Any web pages that are indicated to “disallowed’ will not be indexed.

As is true for SEO work in general, keeping your robots.txt file up to date is important. It is not a one-time done deal. Knowing how to manage crawling with the robots.txt file is a skilled task. Your technical website audit should cover the coverage and syntax of your robots.txt and let you know how to fix any existing issues.

3. Google reads the sitemap.xml next. While search engines don’t need a sitemap to discover any and all areas of the site to be crawled and indexed, it still has a practical use. Because of how differently websites are constructed and optimized, web crawlers may not robotically crawl every page or segment. Some content benefits more from a professional and well-constructed Sitemap; such as dynamic content, lower-ranked pages, or expansive content archives, and PDF files with little internal linking. Sitemaps also help GoogleBot quickly understand the metadata within categories like news articles, video, images, PDFs, and mobile.

4. Search engines crawling sites more frequently that have already established a trust factor. If you web pages have gained significant pagerank, than we have seen times when Googlebot awards a site what is called “crawl budget.” The greater trust and niche authority your business site has earned, the more crawl budget you can anticipate benefiting from.

Why Site Linking Structure May Impact Crawl Rate and Domain Trust

Once you understand how the Google Crawler works, new updates that may reflect if they have lifted some search filters or applied, a new patch are easier to respond to, or a change in domain link structure. Benchmark your site’s performance in SERP ranking changes as well as your competition to see if everyone gains a spike in converting traffic at a particular time. This will help to rule out an isolated occurrence.

Be ethically and earn domain trust. Rather than attempt to maintain a web server secret, simple follow Google’s search best practices from the beginning. “As soon as someone follows a link from your ‘secret’ server to another web server, your ‘secret’ URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. Similarly, the web has many outdated and broken links”, states the search giant. Whenever an individual publishes a link incorrectly to your website or fails to update links to reflect changes in your server, the result is that GoogleBot will now be attempting to download an incorrect link from your site.

How Marking Up Your Content Aids Google Crawler

When an SEO expert correctly implements Google structured data to mark-up web content, Google can better apprehend your context for exhibiting in Search. This means that you can realize superior distribution of your web pages to Internet users of Google Search. This is accomplished by marking up content properties and enabling schema actions where pertinent. This makes it eligible for inclusion in Google Now Cards, the large display of Answer Boxes, and Featured Rich Snippets.

Steps to Mark-up Web Content Properties For GoogleBot

1. Pinpoint the best data type from the table schema.org provides.

Find what best fits your content, then choose from the markup reference guide for that type to find the required and recommended properties. It is permissible to add markup for multiple content types into a single HTML or AMP HTML content page to aid your next Google crawl. We find that users favor news articles that contain video content, which creates a perfect opportunity to add markup for both types to assist your content page eligibility for inclusion in top stories within the news carousel or rich results for video.

2. Craft a section of markup containing your key products and services.

Make it as easy as possible for your site to be crawled with the help of required structured data properties for visual presentation in SERPs that you want to gain. SEOs now have an extensive data-type reference to draw from that contains many examples of customizable markup.

What is a Google Crawl Budget?

“The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline,” according to Eric Enge of Stone Temple.

Make sure they fully understand the essentials before discussing crawl optimization with a prospective consultant. Crawl budget is a term that some are unfamiliar with. It should be determined what the time or number of pages Google allocates to crawl your site is.

Matt Cutts of Google gives SEOs what to keep top of mind as to the number of pages crawled. In 2010 he stated, “That there isn’t really such thing as an indexation cap. A lot of people were thinking that a domain would only get a certain number of pages indexed, and that’s not really the way that it works. There is also not a hard limit on our crawl.”

We find it helps to view it with a focus on the number of pages crawled in proportion to your PageRank and domain trust. He added, “So if you have a lot of incoming links on your root page, we’ll definitely crawl that.”

Academy of Internet Marketing Google Q & A Crawling and RenderingAcademy of Internet Marketing Google Q & A Crawling and Rendering

Today, 9.7.2016, a panel on Web Promo covered several lingering questions about how Google crawls websites.

John Mueller talked about if Google has to render a page and then sees a redirect, that causes a delay. When asked, “Is there any kind of schedule to when pages get crawled?” He answered, “It is scientific.”

When asked if content with structured data that includes information such as price, or items that may be out of stock, does that increase crawl rate for accurate data? The response was, “It is a complicated technical field.” John Mueller added, “I think structured data is something you can give us in different ways. Use the sitemap to let us know. Just because there is some pricing information out there doesn’t mean that data will update quickly.”

Existing Problems the Google Crawler may still Face

1. Sites with complicated URL structure, which is most often due to URL parameter issues. Mixing things like session ids into the path can result in a number of URLs being crawled. In practice, Google doesn’t really get stuck there; but it can waste a lot of resources that your site needs to use more wisely.

2. GoogleBot may slow down crawling when it finds the same path sections repeated over and over again.

3. To render content, if the Google Crawler cannot pull out page content immediately, it renders the pages to see what comes up. If there are any elements on the page that you have to click something or do something to see the content, that might also be something that it might miss. GoogleBot is not going to click around to see what might come up. John Mueller said, “I don’t think we try any of the clicking stuff. It is not like we scroll on and on.”

What I understood from the conversation, it helps to differentiate between what is not loaded until the user takes an action and what GoogleBot cannot see without scrolling down.

Rather than spend extensive time configuring JavaScript to manage what displays on a page, look for what provides the end user the most right and complete content. Consider tweaking your website’s pagination, JavaScript, and techniques that help the users have a better experience. “For the 3rd and last time, look at AMP, reiterated Andrey Lipattsev at the close of the event.

We strongly recommend that every site prepares adequately for the rise in Google Mobile Search. Also, GoogleBot may time out trying to get embedded content. For users, this can make accessibility harder.

The most important information you want crawled is:

* Web URLs — Your pages, posts, and key document web URL addresses.

* Page Title tags — Page Title tags indicate the name of the web page, blog post or news article.

* Metadata — This can encompass many things like your page’s description, structured data markup and prevalent keywords.

This is the main information that GoogleBot retrieves when it crawls your site. And this is also most likely what you see indexed. That’s the basic concept. For the site that is advancing, there’s a lot more complexity to the way that your site may be crawled and how search results are returned, organized and have a chance to show up in rich snippets.

How Google Crawls the New Domain Extensions

Google announced on July 7th, 2015 how they plan to handle the ranking of the new domains like .news, .social, .ninja, .doctor, .insurance, .shopping, and .video. In summary: they’ll be ranked exactly the same as .com and .net. In this creative digital environment, seeing a live demonstration of how Google experiences your site during a crawl will show how premium SEO better offers genuine, palpable and tangible content for daily searches on the Internet. With new domain extensions here and expanding, if you use them, be sure that your site will automatically be crawled and content delivered quite naturally.

Google offers glimpses into how Google Crawler will handle these upcoming Domains in search results, hoping to side-step possible misconceptions as to how they’ll process the latest domain extension options. When asked if a.BRAND TLD may gain more or less weight than a .com, Google responded: “No. Those TLDs will be treated the same as another gTLDs. They will require the same geotargeting settings and configuration, and they won’t have more weight or influence in the way we crawl, index, or rank URLs.”

For webmasters who may be wondering how the newer gTLDs impact search, we learned that Google will crawl treat new gTLDs like other gTLDs (for example.com, .net, & .org). From our interpretation of the post, the use of Keywords in a TLD will not affect sites by granting a particular advantage or disadvantage in SERP rankings.

How to Make AJAX Web Applications Crawlable

When Web Masters choose to use an AJAX application with content intended to appear in search results, Google announced new process that, when implemented, can help Google (and potentially additional major search engines) crawl and index your content. In the past, AJAX web applications have posed challenges for search engines to process due to the dynamic process that AJAX content may entail.

Most website owners have more important tasks at hand than to set up restrictions for crawling, indexing or serving up web pages. It takes someone deep into SEO to specify which pages are eligible to appear in search results or which sections on a page. For the most part, if you web content is well optimized, your pages will should get indexed without having to go to extra measure. For a more granular approach that is often needed for large shopping carts, many options are available for indicating preferences as to how the site owner permits Google to crawl and indexes their site. Most of this expertise is executed through the Google Search Console and a file called “robots.txt”.

John Mueller invited comments from Webmasters as to crawling AJAX. As this develops further, Google is more forthcoming on how well or just how GoogleBot parses JavaScript and Ajax. It is best to stay tuned to development treads of communication on the topic before implementing too many opinions. For the time being, we recommend not consigning much of your important site elements or web content into Ajax/JavaScript.

More Advanced Means of Helping Google Crawl a Site

Within your Google Search Console, formerly known as Google Webmaster Tools, it is possible to set up URL Parameters. For a simple website, this is typically not needed; even Google forewarns users that they should have developed expertise in this SEO tactic before using it. Whether or not your site faces and issue with duplicate content may be one determination. Crawl problems can be caused by dynamic URLs, which in turn could mean that you have some challenges on the URL parameter indices. The URL Parameters section permits Webmasters to configure their choice in how Google crawls and indexes your site with URL parameters. By default, web pages are crawled correspondingly to just how GoogleBot has determined to do so.

It is helpful if you have fresh content to win more frequent Google crawls. So the more of you post on your blog, the more frequently you can expect to be crawled. Google Search Console only stores historical crawl data for up to 90 days. Some SEO’s are requesting that span of time increases, however, for now, this is your best way to discover Google’s crawling habits as they relate to your site.

Preparing for Mobile Web Performance and Faster Google Crawls
How Well Your Mobile Site is Crawled
Do You Know How Well Your Mobile Site is Crawled?

Google’s Accelerated Mobile Pages (AMP) may well help website owners improve their performance in search rankings and crawlability for the mobile-first world. Switching to Google AMP and learning how it will impact your site’s crawlability typically requires someone experienced at the helm. For those watching site visibility and positioning, we know that speed load matters. If your web page is similar in all other characteristics but for speed, then expect GoogleBot to favor emphasis to the faster site that is easy to crawl, and is what users find compelling to rank top in SERPs.

If you need help updating to AMP web pages add then testing how your mobile site is crawled, read here to gain solutions. Sites may load differently on various mobile devices which impacts load performance. Test to see if Google’s caching servers load faster on slower connections

Quickly Fix Issues with Server Connectivity that Hurt Web Crawls

Too often business owners are unaware of the quality of their hosting page and the server they are on. That brings up one very important point. If your website has connectivity errors, the result may be that Google cannot access the site when it tries to because your site is down or its servers are down. Especially, if you are running an AdWords campaign that links to a landing page that cannot load to server issues, the results can be very destructive. You may get a warning in your Google AdWords console and too many of them and they can cancel the ad.

But you have much to weigh beyond that. Google may even stop coming to your site if this continues unheeded, your site’s health will be negatively impacted, your page rankings may plunge, and as a result, your traffic could decline significantly. It is pure logic – if Google can’t access your site for a long period of time, they, like we would, need to move on to tasks that are doable. Set up an alert – keep a sharp eye out on your server connectivity and crawl errors.

\

How GoogleBot Checks Web Page Resources

Most of your web pages use CSS and / or JavaScript to load. How your site is built and how many of these resources are used impacts your load times. Typically both CSS and JavaScript are loaded as external files that are linked to from your HTML. Google must have the access they want to these resources in order to fully understand your web pages. Often someone unfamiliar with SEO and how Google crawls your website will block these files within your robots.txt file.

You can check to determine if your website is adhering correctly to this guideline.

Take advantage of the Google guidelines tool to know what files (if any) are set up as “blocked” from Googlebot. It only stands to reason that if web crawlers cannot understand your site’s contents, they cannot rank you. Google needs the right to crawl your web pages in order to understand it fully and match your content to relevant search queries. Put your page through the SEO tool to obtain a better idea of how Google sees your site. Or request us to perform this vital task for you. Then we can go over the results together so that you address any issues correctly.

Google Crawls Sites that Follow their Webmaster Guidelines

The answers you need to know that your site correctly follows the Googlewebmaster guidelines for being crawled.

* Page headers are present when accessed by Googlebot.

* Well-formed static links are discovered.
The number of on page links is not excessive.

* Page avoids ordinary accessibility issues.
Robots.txt file found and is correctly formed.

* All images have alt text to help GoogleBot render pages faster.

* All CSS and JavaScript files testy as visible to Googlebot

* Sitemaps for both search engines and users are available.

* No page speed issues.

NOTE: Additionally, you’ll want to know that your web server correctly supports the If-Modified-Since HTTP header. This helps your web server to tell GoogleBot if your content has changed or updated since its last crawl. Having this feature working for you saves on your website’s bandwidth and overhead.

“Google essentially gathers the pages during the crawl process and then creates an index, so we know exactly how to look things up. Much like the index in the back of a book, the Google index includes information about words and their locations. When you search, at the most basic level, our algorithms look up your search terms in the index to find the appropriate pages.” – Matt Cutts of Google

“The web spider crawls to a website, indexes its information, crawls on to the next website, indexes it, and keeps crawling wherever the Internet’s chain of links leads it. Thus, the mighty index is formed.” – Crazy Egg

“Search engines crawl your site to get the contents into their index. The bigger your site gets, the longer this crawl takes. An important concept while talking about crawling is the concept of crawl depth. Say you had 1 link, from 1 site to 1 page on your site. This page linked to another, to another, to another, etc. Googlebot will keep crawling for a while. At some point though, it’ll decide it’s no longer necessary to keep crawling.” – Yoast on Crawl Efficiency

“We strongly encourage you to pay very close attention to the Quality Guidelines below, which outline some of the illicit practices that may lead to a site being removed entirely from the Google index or otherwise affected by an algorithmic or manual spam action. If a site has been affected by a spam action, it may no longer show up in results on Google.com or on any of Google’s partner sites.” – Google Webmaster Guidelines

Summary

With so many tasks involved today to in digital marketing and improving site performance with SEO current best practices, many small businesses feel challenged to give sufficient time and effort to Google crawl optimization. If you fall in this bucket, it is quite possible you are missing a significant amount of traffic. Crawl optimization should be a highly rated priority for any large website seeking to improve its SEO efforts. By implementing tracking, monitoring your Google Analytics SEO reports, and directing GoogleBot to your key web content, you can gain an advantage over your competition.

In order to be indexed and returned in search engine results, your website should be easy to crawl first. If you think your business website is poorly indexed or returned, it is important to determine if your site is correctly crawled. Start with full website SEO audit, implement improvements, and then see how the benefit you gain in increased Internet traffic and site views.

Remember, reaching your goal of having your website indexed by Google is only the first step in successful digital marketing. To improve your website beyond being crawled and indexed, make sure you’re following basic SEO principles, creating high-value content users want, and integrating with paid search.

Hill Web Creations can offer you new ideas on how to “encourage” Google to re-crawl your website, or select web pages that have been recently updated. Call 661-206-2410 and ask for Jeannie.

Or you can start by checking out ourTypes of Website Audits Available

Ads

You May Also Like

Online Reviews Establish Business Credibility

Online Reviews Establish Business Credibility Businesses that have wrangled with how to manage reviews ...

Utilize the Best Features of Google Plus

How to Take Advantage of the Best Features of Google Plus Social signals on ...

New AdWords Features for PPC Advertising

New AdWords Features for PPC Advertising Learn the latest strategies for how to win ...