Configure how web crawler will index your Magento website

ROBOTS.txt is a standard which provides instructions for web crawlers that index your website. We'll see in this post how Magento allows us to manage robots.txt, and how we can extend its work.

Robots.txt, a quick overview

What is robots.txt ?

W3c explains us that robots.txt is a standard way integrated in HTML to tell search engines bots how you want them to index your website:

  • Do you want to allow them to index the page content?
  • Do you want to allow them to follow internal links to search new pages to index?

All these actions can be controlled by the following instructions:

  • index / noindex: manage your wish to index the current page content
  • follow / nofollow: manage your wish if you want that bots can search for links in current page

Hum, interesting, how we can set up these values?

Where do I set up robots.txt?

There is two ways to set up robots informations:

Robots.txt file

By default, web crawlers will search in your document root directory for a file named robots.txt. In this file, we will define to web crawler our global indexation policy.

If this file does not exist, web crawler will check in the page header

With the ROBOTS meta header

We also have the choice to manage the robots informations directly in page headers

In this case, we send in the page header a meta name "ROBOTS" which provides our indexation preferences. Here's an example of possible header sent

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />

What's the interest of setting a good robots.txt policy?

Interest of these method is that you can control how indexers work:

  • I suppose that you wish to index your catalog pages
  • Do you really require that checkout page was indexed?
  • Do the links provided in cart page require an indexation?

Ok, I'm convinced. But Magento does not provides a mechanism to set up robots values?

How Magento manage robots ?

Magento provides a mechanism to configure robots values in header. This administrative panel is available in the System > Configuration > Design > HTML Head. Here's a screenshot:

As you can see, we can define the index / follow property but only in a global context: all pages will have the same meta ROBOTS values: so for now we are unable to provide differents robots instructions for our pages. Furthermore, if you do not set up a configuration value in the adminstrative panel, default applied is *; in this case it's the robot that decides to index your content!

Ok, but how can we set up differents values for our pages?

Setting differents robots values in Magento

There is a solution: we will use the layout structure of our pages: In these layouts, we can reference our block Mage_Page_Block_Html_Head which is responsible for displaying the robots instructions in our page. If we check the block class, it extends Varien_Object (so it's a classic Magento object), and have a method getRobots defined as this:

As you can see, robots values are stored in the _data property, under index 'robots'. So we just have to call setRobots method on your block, and it will update the related _data property, and so our robots values for our page.

Here's an example of layout robots definition:

Conclusion

With this method, we do not have to make an overload. Magento layout allows to define easily our robots preferences. Now we are able to restrict indexation in checkout, do not follow links in customer account, etc.

If you embed these instructions in some reuse module, you can now have your common robots policy set up in 5 min for each of your projects

This configuration will only works for crawlers that take care of the robots instructions. Some of them does not use them and index everything. For these ones, we have to find another way 🙂