Why is log analysis useful for SEO?

The recipe of a good SEO Hero includes a very large number of ingredients, some of which animate the community to a greater or lesser extent. The issue of logs has been in the spotlight for some time, since it offers a very useful set of data for SEO, which we wanted to come back to.

What about logs?

A logs file is a list of all requests to a site’s server. When a user browses, for example, it adds a line to the log file with several information (IP address, connection date, URL visited, …). In an SEO logic, it becomes very interesting to follow a particular visitor: Google. Its robot indeed leaves a “trace” when crawl the web and to locate it in the logs allows to know all its displacements.

The crawl indispensable to natural referencing

Visiting a site by Google’s robot is logically mandatory to reference a site. But the “quality” crawl will influence that of the SEO. Indeed, when a site is well structured and has good loading performance, the robot navigates more quickly and easily, allowing it to index more pages in a given time, to better perceive Day and return more frequently. So the best-optimized sites are the ones that appeal most to Google because they require fewer resources to be browsed. This explains, in part, why the notion of crawl is an integral part of the SEO work.

A real challenge for Google

If he could, the search engine would crawl every day every web page. But the problem is that it is now unable to do so: Google can not pass its robot on all pages indexed at short intervals, simply because they are too numerous. “Google has indexed 30,000,000,000,000 different URLs” googlebotTo crawl all these pages every 3 months, the robot should go through 700 million pages per second. And it is in itself a problem because the crawl is the quality assurance of Google since it allows him to keep a new look on the indexed sites, and thus avoids to him to present bad results in his SERP. Thanks to the crawl, for example, the engine can rapidly remove first results from a hacked site whose content is degraded, avoiding to propose to its users a harmful search result. But this reactivity is only defined by the frequency of the crawl. The search engine does not currently have the technological resources necessary for a regular full crawl. It does not have the financial means to assume such a rhythm (installation of hardware, maintenance of servers, …), which is why it encourages sites to be more “easy” to crawler by integrating the performance and the ” Web architecture in its positioning criteria. And even if Google were to reach this stage, the crawl’s intensity would pose too great a risk for the sites: to keep pace, the robot should crawl millions of pages every second, generating thousands of instant requests on each site Could cause a server crash. As a result, Google prioritizes and allocates more resources to site crawl on the most competitive sectors.

Identify crawled pages

By following the Google robot’s itinerary, it is easy to know the number of crawled pages, which makes it possible to detect possible problems and to study how to solve them:

Total pages of a site 30,000
Total crawled pages 20,500
Total active pages (generating SEO traffic) 12,000
Total non-crawled pages 8,000

Obtaining these data leads to two analyzes:

  • The ratio between pages crawlers / active pages. A difference in value indicates that some pages crawled by Googlebot do not bring natural traffic, which is problematic, since this reflects a poor positioning of the pages visited by the robots of the engine. It is also a proof that the crawl is not the only factor of good positioning, otherwise the two values ​​would be close, if not identical. It is then necessary to intervene on pages crawlées but inactive to improve their SEO potential (markup, content, popularity, mesh);
  • The ratio of total pages to non-crawled pages. It is normal that Google does not pass on all the pages of a site, and the fact that some are not crawled is not problematic in itself. Where the problem lies is when pages deserving to be crawled are not, which reveals a serious problem of performance or too great a depth. It is therefore necessary to know how to sort between pages not crawled, to distinguish only those supposed to be, and to act on them.

What the optimization of the crawl in terms of referencing

Analyzing logs and working on navigating the robot from Google is an important part of SEO optimization of a site. Indeed, by indicating to the engine that this or that page deserves to be crawled, or by optimizing the inactive ones, it is possible to increase the positioning of a site on more queries. In our example, admitting that 500 pages out of the 1,500 unwrapped should be because they possess a real potential SEO, achieving them again “read” by the robot will improve their SEO (because it is impossible To be well positioned without crawl, according to our tests). If good actions are undertaken, 500 additional pages will be able to go up in the SERP and generate hundreds or even thousands of additional visits on the site.

How to avoid pages from being crawled?

The depth of a page influences its crawl frequency: the deeper the content in a site, the less often the crawler will crawl. Sometimes, however, relevant pages created at the launch of a site are “buried” under all the news, without having lost quality. However, they are becoming less visible to Google who may choose to leave them, which will hurt their SEO. It is thus necessary to shorten as much as possible the path to the interesting pages, via for example links from the home page of the site or suggestions of pages as often at the end of blog articles. But reducing the depth of targeted pages is not enough. We also need to ensure that various links point to them, and on a regular basis, making Google and its robots understand that they are relevant and therefore necessary. For example, the page of the legal mentions will be less often crawled, whereas it is generally found in the footer of a site, therefore on all its pages and at a very low depth. However, because this type of page receives no other link, the engine understands that it is not a major page. The very position of the links pointing to the targeted pages impacts the crawl: the robot will give more importance to a link placed in the header than to another present in the footer (such as legal mentions). And of course, even if the pages are crawled every day and are at the heart of an active link environment, their referencing will be impossible without the presence of quality content, working markup, maximized performance, A strong popularity, in short, of all the other elements that define an effective SEO!