What is A Robots.txt File?

**mohammad umer** · 11-12-2020, 01:26 AM

A robots.txt file is the execution of this protocol. The protocol delineates the guidelines that every authentic robot must follow, including Google bots. Some illegitimate robots, such as malware, spyware, and the like, by definition, operate outside these rules.

You can take a peek behind the curtain of any website by typing in any URL and adding: /robots.txt at the end.

For example, here’s POD Digital’s version:

User agent directive
As you can see, it is not necessary to have an all-singing, all-dancing file as we are a relatively small website.

Where to Locate the Robots.txt File
Your robots.txt file will be stored in the root directory of your site. To locate it, open your FTP cPanel, and you’ll be able to find the file in your public_html website directory.

An image of an C Panel File manager
There is nothing to these files so that they won't be hefty – probably only a few hundred bytes, if that.

Once you open the file in your text editor, you will be greeted with something that looks a little like this:

An image of a basic robots.txt notepage file
If you aren’t able to find a file in your site’s inner workings, then you will have to create your own.

How to Put Together a Robots.txt File
Robots.txt is a super basic text file, so it is actually straightforward to create. All you will need is a simple text editor like Notepad. Open a sheet and save the empty page as, ‘robots.txt’.

Now login to your cPanel and locate the public_html folder to access the site’s root directory. Once that is open, drag your file into it.

Finally, you must ensure that you have set the correct permissions for the file. Basically, as the owner, you will need to write, read and edit the file, but no other parties should be allowed to do so.

The file should display a “0644” permission code.

An image of a change file attributes pop up
If not, you will need to change this, so click on the file and select, “file permission”.

Voila! You have a Robots.txt file.

Robots.txt Syntax
A robots.txt file is made up of multiple sections of ‘directives’, each beginning with a specified user-agent. The user agent is the name of the specific crawl bot that the code is speaking to.

There are two options available:

You can use a wildcard to address all search engines at once.
You can address specific search engines individually.
When a bot is deployed to crawl a website, it will be drawn to the blocks that are calling to them.

Here is an example:

robotstxt-syntax.png
User-Agent Directive
The first few lines in each block are the ‘user-agent', which pinpoints a specific bot. The user-agent will match a specific bots name, so for example:

user-agent-directive.png
So if you want to tell a Googlebot what to do, for example, start with:

User-agent: Googlebot

Search engines always try to pinpoint specific directives that relate most closely to them.

So, for example, if you have got two directives, one for Googlebot-Video and one for Bingbot. A bot that comes along with the user-agent ‘Bingbot’ will follow the instructions. Whereas the ‘Googlebot-Video’ bot will pass over this and go in search of a more specific directive.

Most search engines have a few different bots, here is a list of the most common.

Host Directive
The host directive is supported only by Yandex at the moment, even though some speculations say Google does support it. This directive allows a user to decide whether to show the www. before a URL using this block:

Host: poddigital.co.uk

Since Yandex is the only confirmed supporter of the directive, it is not advisable to rely on it. Instead, 301 redirect the hostnames you don't want to the ones that you do.

Disallow Directive
We will cover this in a more specific way a little later on.

The second line in a block of directives is Disallow. You can use this to specify which sections of the site shouldn’t be accessed by bots. An empty disallow means it is a free-for-all, and the bots can please themselves as to where they do and don’t visit.

Sitemap Directive (XML Sitemaps)
Using the sitemap directive tells search engines where to find your XML sitemap.

However, probably the most useful thing to do would be to submit each one to the search engines specific webmaster tools. This is because you can learn a lot of valuable information from each about your website.

However, if you are short on time, the sitemap directive is a viable alternative.

Crawl-Delay Directive
Yahoo, Bing, and Yandex can be a little trigger happy when it comes to crawling, but they do respond to the crawl-delay directive, which keeps them at bay for a while.

Applying this line to your block:

Crawl-delay: 10

means that you can make the search engines wait ten seconds before crawling the site or ten seconds before they re-access the site after crawling – it is basically the same, but slightly different depending on the search engine.

Why Use Robots.txt
Now that you know about the basics and how to use a few directives, you can put together your file. However, this next step will come down to the kind of content on your site.

Robots.txt is not an essential element to a successful website; in fact, your site can still function correctly and rank well without one.

However, there are several key benefits you must be aware of before you dismiss it:
Point Bots Away From Private Folders: Preventing bots from checking out your private folders will make them much harder to find and index.

Keep Resources Under Control: Each time a bot crawls through your site, it sucks up bandwidth and other server resources. For sites with tons of content and lots of pages, e-commerce sites, for example, can have thousands of pages, and these resources can be drained really quickly. You can use robots.txt to make it difficult for bots to access individual scripts and images; this will retain valuable resources for real visitors.

Specify Location Of Your Sitemap: It is quite an important point, you want to let crawlers know where your sitemap is located so they can scan it through.

Keep Duplicated Content Away From SERPs: By adding the rule to your robots, you can prevent crawlers from indexing pages which contain the duplicated content.

You will naturally want search engines to find their way to the most important pages on your website. By politely cordoning off specific pages, you can control which pages are put in front of searchers (be sure to never completely block search engines from seeing certain pages, though).

disallow.png
For example, if we look back at the POD Digital robots file, we see that this URL:

poddigital.co.uk/wp-admin has been disallowed.

Since that page is made just for us to login into the control panel, it makes no sense to allow bots to waste their time and energy crawling it.

Noindex
In July 2019, Google announced that they would stop supporting the noindex directive as well as many previously unsupported and unpublished rules that many of us have previously relied on.

Many of us decided to look for alternative ways to apply the noindex directive, and below you can see a few options you might decide to go for instead:

Noindex Tag/ Noindex HTTP Response Header: This tag can be implemented in two ways, first will be as an HTTP response header with an X-Robots-Tag or create a <meta> tag which will need to be implemented within the <head> section.

Your <meta> tag should look like the below example:<meta name="robots" content="noindex">

TIP: Bear in mind that if this page has been blocked by robots.txt file, the crawler will never see your noindex tag, and there is still a chance that this page will be presented within SERPs.

Password Protection: Google states that in most cases, if you hide a page behind a login, it should be removed from Google’s index. The only exception is presented if you use schema markup, which indicates that the page is related to subscription or paywalled content.

404 & 410 HTTP Status Code: 404 & 410 status codes represent the pages that no longer exist. Once a page with 404/410 status is crawled and fully processed, it should be dropped automatically from Google’s index.

You should crawl your website systematically to reduce the risk of having 404 & 410 error pages and where needed use 301 redirects to redirect traffic to an existing page.

Disallow rule in robots.txt: By adding a page specific disallow rule within your robots.txt file, you will prevent search engines from crawling the page. In most cases, your page and its content won’t be indexed. You should, however, keep in mind that search engines are still able to index the page based on information and links from other pages.

Search Console Remove URL Tool: This alternative root does not solve the indexing issue in full, as Search Console Remove URL Tool removes the page from SERPs for a limited time.

However, this might give you enough time to prepare further robots rules and tags to remove pages in full from SERPs.

You can find the Remove URL Tool on the left-hand side of the main navigation on Google Search Console.

Google Search Console removals tool
Noindex vs. Disallow
So many of you probably wonder if it is better to use the noindex tag or the disallow rule in your robots.txt file. We have already covered in the previous part why noindex rule is no longer supported in robots.txt and different alternatives.

If you want to ensure that one of your pages is not indexed by search engines, you should definitely look at the noindex meta tag. It allows the bots to access the page, but the tag will let robots know that this page should not be indexed and should not appear in the SERPs.

The disallow rule might not be as effective as noindex tag in general. Of course, by adding it to robots.txt, you are blocking the bots from crawling your page, but if the mentioned page is linked with other pages by internal and external links, bots might still index this page based on information provided by other pages/websites.

You should remember that if you disallow the page and add the noindex tag, then robots will never see your noindex tag, which can still cause the appearance of the page in the SERPs.

Using Regular Expressions & Wildcards
Ok, so now we know what robots.txt file is and how to use it, but you might think, “I have a big eCommerce website, and I would like to disallow all the pages which contain question marks (?) in their URLs.”

This is where we would like to introduce your wildcards, which can be implemented within robots.txt. Currently, you have two types of wildcards to choose from.

* Wildcards - where * wildcard characters will match any sequence of characters you wish. This type of wildcard will be a great solution for your URLs which follows the same pattern. For example, you might wish to disallow from crawling all filter pages which include a question mark (?) in their URLs.

Visual of user agent disallow wildcards
$ Wildcards - where $ will match the end of your URL. For example, if you want to ensure that your robots file is disallowing bots from accessing all PDF files, you might want to add the rule, like one presented below:

Visual of user agent $ disallow wildcards
Let’s quickly break down the example above. Your robots.txt allows any User-agent bots to crawl your website, but it disallows access to all pages which contain .pdf end.

Mistakes to Avoid
We have talked a little bit about the things you could do and the different ways you can operate your robots.txt. We are going to delve a little deeper into each point in this section and explain how each may turn into an SEO disaster if not utilized properly.

**GeethaN** · 11-12-2020, 01:31 AM

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.

**shimar456** · 11-12-2020, 04:25 AM

A robots. txt file tells search engine crawlers which pages or files the crawler can or can't request from your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.

**makoo** · 12-15-2020, 01:15 AM

Thanks, for providing good Information to community. Be well, buddy! Looking forward to see more post from your side!

**godwin** · 12-15-2020, 01:17 AM

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.

**shimarhussain12** · 12-15-2020, 05:17 AM

A robots. txt file tells search engine crawlers which pages or files the crawler can or can't request from your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.

**arianagrand** · 12-15-2020, 05:24 AM

In networking, a protocol is a format for providing instructions or commands. Robots.txt files use a couple of different protocols. The main protocol is called the Robots Exclusion Protocol. This is a way to tell bots which webpages and resources to avoid. Instructions formatted for this protocol are included in the robots.txt file.

The other protocol used for robots.txt files is the Sitemaps protocol. This can be considered a robots inclusion protocol. Sitemaps show a web crawler which pages they can crawl. This helps ensure that a crawler bot won't miss any important pages.

**chiragm984** · 12-15-2020, 05:32 AM

A robots.txt file is a set of instructions for bots. This file is included in the source files of most websites. Robots.txt files are mostly intended for managing the activities of good bots like web crawlers, since bad bots aren't likely to follow the instructions.

**GeethaN** · 12-16-2020, 01:52 AM

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.

**juliaalan** · 12-28-2020, 05:06 AM

A robots.txt file is a set of instructions for bots. This file is included in the source files of most websites. Robots.txt files are mostly intended for managing the activities of good bots like web crawlers, since bad bots aren't likely to follow the instructions.

Think of a robots.txt file as being like a "Code of Conduct" sign posted on the wall at a gym, a bar, or a community center: The sign itself has no power to enforce the listed rules, but "good" patrons will follow the rules, while "bad" ones are likely to break them and get themselves banned.

A bot is an automated computer program that interacts with websites and applications. There are good bots and bad bots, and one type of good bot is called a web crawler bot. These bots "crawl" webpages and index the content so that it can show up in search engine results. A robots.txt file helps manage the activities of these web crawlers so that they don't overtax the web server hosting the website, or index pages that aren't meant for public view.

Oryon Networks | Singapore Web Hosting | Best web hosting provider | Best web hosting in SG | Oryon SG

**davidweb09** · 01-03-2021, 01:26 AM

Robots.txt file includes indexing instruction for a website. (which page should index & which not) https://bit.ly/39QsZC9

**sophiawils59** · 01-03-2021, 11:34 PM

Hey@umer, thanks for sharing a great post on robots.txt!!

**SKPglobal** · 01-04-2021, 01:27 AM

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.

**vjvysakh** · 01-04-2021, 01:32 AM

Robots.txt file is a text file for restricting bots (robots, search engine crawlers ) from a website or certain pages on the website. Using a robots.txt file and with a disallow direction, we can restrict bots or search engine crawling program from websites and or from certain folders and files.

**godwin** · 01-04-2021, 02:17 AM

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.


Find Web Hosting
Shared Web Hosting	UNIX & Linux Web Hosting	Windows Web Hosting	Adult Web Hosting
ASP ASP.NET Web Hosting	Reseller Web Hosting	VPS Web Hosting	Managed Web Hosting
Cloud Web Hosting	Dedicated Server	E-commerce Web Hosting	Cheap Web Hosting

Thread: What is A Robots.txt File?

Thread Tools

Display

What is A Robots.txt File?

Bookmarks

Bookmarks

Posting Permissions