Wednesday, 12 December 2012

Indexing with robots.txt, meta tags and canonical


Indexing with robots.txt, meta tags and canonical

Why should search engines influence indexing?

There’s a variety of reasons to control indexing and thus to dictate how a search engine should deal with websites and links:
  • Allow or disallow following links
  • Prevent indexing of irrelevant websites
  • Index duplicate content under only one URL
The goal, of course, is to deliver only relevant HTML pages to the engine. But this doesn’t always happen properly. Duplicate content quickly occurs due to technical problems or the ubiquitous ‘human factor‘, which is is all to common. But there are ways to keep an index clean and counteract this.

Which methods work?

I will be covering 3 methods for influencing the indexing for your site. Which ones these are and how they can be used.

/Robots.txt protocol

The /robots.txt is like a ‘bouncer‘ for search engine crawlers. It explicitly allows which crawlers may search which pages/sections on a domain. Most of the crawlers follow the /robots.txt file, but it is more a suggestion, not an order.
The /robots.txt file primarily uses two instructions:
User-agent: – determines which crawler should apply for the following instructions
Allow/disallow: – determines the file or the index
An empty line closes the data set.
The /robot.txt file generally looks like this:
# robots.txt for http://www.example.de/
User-agent: ROBOTNAME
Disallow: /pictures/
User-agent: *
Disallow: /confidentialData/
Disallow: /allPasswords.html
If you want to respond to the crawler, use the following expression: User-agent: *
Be careful, with disallow/ you block: / all robots from the entire domain. If there is no organic traffic to a site, this could be the reason.
As long work is being done in a test environment and the data should not yet be found, it is useful to not index complete indexes.
Crawlers from dubious providers usually are not influenced by /robots.txt. But established search engines do observe the instructions.
But why should I disallow crawlers access to parts of my domain?
It’s very simple. Not all webserver contents should appear in a search engine index. The instructions request the crawler to not execute indexing for certain pathways. This could be the case, for instance, when there are test pages on the webserver that are not yet ready for the public. Or, it could be that not all pictures in a folder should be indexed.
/robots.txt is particularly suited to prevent the indexing of non-relevant HTML pages. However, page URLs can still end up in the index. For example, if pages are to be externally linked. If this is the case, no snippet is displayed in the SERPs. If individual URLs should be excluded from the index, the following methods are suitable.

Meta tags

Two elements of meta tags are useful for controlling crawlers and indexing HTML pages. How the crawler should proceed with the indexing and the links to the HTML pages can be set for every HTML page.
The meta instructions responds to the crawler for every HTML page individually and gives it the following possible instructions:
Parameter
Meaning
content=”index,follow”index HTML page, links follow
content=”noindex,follow”do not index HTML page, links follow
content=”index,nofollow”index HTML page, links donot follow
content=”noindex,nofollow”do not index HTML page, links do not follow
This tells the crawler whether it may take the HTML page into the index and whether it can follow the links in the HTML page. Links from “nofollow” HTML pages do not pass PageRank or other forms of link equity. The “nofollow” attribute can be specifically used to devalue links on an HTML page.
When dealing with documents that have no HEAD area, the X-Robots tag can help. This tag allows non-HTML documents, such as pictures or PDF files, to be limitedly indexed.
Meta tags are best used to prohibit following links or indexing individual HTML pages.

Canonicals

The canonical tag is primarily an aid to prevent duplicate content in the index. Canonicals tell search engines that instead of the page it found, the original (more relevant) page should be in the index.
The canonical tag belongs in the head area of an HTML page and is used as follows:
Duplicate content occurs as follows, for instance:
  • URLs can be found with and without www.
  • Session IDs are used in the URLs
  • similar content to the HTML page
    • the same product can be offered in several categories
It is useful to give every HTML page a canonical tag with its own URL so that they refer to themselves. In this way, potential URL tracking parameters don’t cause duplicate content.
for more information click here

No comments:

Post a Comment