Google announced this week that it will be open-sourcing its production robots.txt parser, but that brought with it a number of other impacts. The company now says it will no longer support the robots.txt noindex rule or any other “code that handles unsupported and unpublished rules” starting on Sept. 1.
Robots.txt noindex is unofficial
Google made the announcement about the noindex rule on its Webmaster Central Blog. The company said it has been collecting questions from webmasters and developers about its decision to open-source its robots.txt parser. One question many raised after the internet draft of the Robots Exclusion Protocol was published was this: “Why isn’t a code handler for other rules like crawl-delay included in the code?”
The draft does outline architecture for non-standard rules, which meant crawlers could support their own lines if developers wanted to do so. The company then began to analyze the usage of the robots.txt rules, especially rules that weren’t supported by the internet draft, which included noindex, crawl-delay and nofollow.
“Since these rules were never documented by Google, naturally, their usage in relation to Googlebot is very low,” Google said. “Digging further, we saw their usage was contradicted by other rules in all but 0.001% of all robots.txt files on the internet. These mistakes hurt websites’ presence in Google’s search results in ways we don’t think webmasters intended.”
Thus, the company decided to retire the robots.txt noindex code and other unsupported code “in the interest of maintaining a healthy ecosystem and preparing for potential future open source releases.”
Alternatives to the Robots.txt noindex command
Google did offer some alternatives for webmasters and developers who currently rely on the noindex code. For example, the company suggested that including noindex in the robots meta tags could be one alternative solutions. Google described this noindex directive supported in HTTP response headers and HTML as “the most effective way to remove URLs from the index when crawling is allowed.”
The company also suggested that webmasters use 404 and 410 HTTP status codes, both of which mean the page doesn’t exist. This deletes these URLs from Google’s index after they’ve been crawled and processed. Google also said that placing a page behind a login will also remove it from the index unless markup is included to designate it as subscription content.
The company also suggested using disallow in robots.txt. Search engines can’t index pages they aren’t aware of, so in most cases, blocking them from crawling a page will mean that the content will not be indexed. Another suggestion is to use Google’s Search Console Remove URL tool, which will remove content from the company’s index temporarily.