Home > Search Engine Optimization (SEO) > How to use Robot.txt

How to use Robot.txt

June 1st, 2009 Informer

Procedures through a search engine robot (also known as spider), automatically visit web pages on the Internet and obtain information page.

You can on your site to create a plain text file robots.txt, statements in this document in the site visit do not want to be part robot, so that the site of some or all of the content on the search engines do not have to be included, or designated search engine only the contents of the specified record.  robots.txt file should be placed on the root site.

When a search robot (some call search spiders) to visit a site, it will first check the site under the root directory of the existence of robots.txt, if it exists, the search robot will be in accordance with the contents of the document to determine the visit scope; If the file does not exist, then the search robots to crawl along the link.

robots.txt file:

“Robots.txt” file that contains one or more of the records through the blank lines to separate (by CR, CR / NL, or NL as at the end), each record format is as follows:

“<field>: <optionalspace> <value> <optionalspace>”.

In that paper, you can use # for comments, the specific methods and the use of UNIX in the same practice.  The document is usually recorded in one or more lines of User-agent year, followed by a number of Disallow lines, as follows:

User-agent:

The value of the search engine robot is used to describe the name of the “robots.txt” file, if the number of User-agent records that have more than one robot will be subject to the restrictions of the agreement, the document, at least There is a User-agent record.  If the value is set to *, the agreements are valid for any robot in the “robots.txt” file, “User-agent: *” This can only have a record.

Disallow:

The values do not wish to be used to describe a visit to the URL, the URL can be a complete path, it could be a part of, any Disallow the URL at the beginning will not be access to the robot.  For example, “Disallow: / help” on / help.html and / help / index.html search engines are not allowed to visit, and the “Disallow: / help /” allows robot to visit / help.html, and not be able to access / help / index . html.  Disallow any record is empty, that all parts of the site are allowed to be visited, in the “/ robots.txt” file, at least record a Disallow.  If the “/ robots.txt” is an empty file, then for all the search engine robot, the site is open.

For example the use of robots.txt file:

Example 1. A ban on all search engines to visit any part of the site to download the robots.txt file User-agent: * Disallow: /

Example 2. To allow the robot to visit all (or also can be used to build an empty file “/ robots.txt” file) User-agent: * Disallow:

Example 3. To prohibit access to a search engine User-agent: BadBotDisallow: /

Example 4. Permit to visit a search engine User-agent: baiduspiderDisallow: User-agent: * Disallow: /

Example 5. A simple example in this case, there are three directory of the Web site search engine to limit access so that search engines will not visit the three directories.  It should be noted that a directory for each statement should be kept separate, and not written in “Disallow: / cgi-bin / / tmp /”.  User-agent: * after a special meaning, representing “any robot”, so the document can not be “Disallow: / tmp / *” or “Disallow: *. gif” it was recorded there.  User-agent: * Disallow: / cgi-bin/Disallow: / tmp / Disallow: / ~ joe /

Robot special parameters:

1. Google

Allow Googlebot:

If you want to block in addition to all the roaming outside Googlebot to access your pages, you can use the following syntax:

User-agent: Disallow: /

User-agent: Googlebot

Disallow:

Googlebot follows the line of its own point, rather than point to the line of all robots.

“Allow” extension:

Googlebot identifiable as “Allow” extension of the robots.txt standard.  Other search engine bots may not be able to identify this extension, so please use your interesting to find other search engines.  ”Allow” the role of line with the principle of “Disallow” line, like.  You want to allow only listed in the directory or page you.

You can also use “Disallow” and the “Allow”.  For example, to intercept a subdirectory in the page other than all the pages, you can use the following entries:

User-Agent: Googlebot

Disallow: / folder1 /

Allow: / folder1/myfile.html

These entries will be in addition to intercept folder1 directory of all the pages outside myfile.html.

If you want to block Google’s Googlebot and allow the other robots (such as Googlebot-Mobile), can use the “Allow” rules to allow access to the robots.  For example:

User-agent: Googlebot

Disallow: /

User-agent: Googlebot-Mobile

Allow:

Use * to match its character sequence:

You can use an asterisk (*) to match the character sequence.  For example, to block all private visit at the beginning of the subdirectory, use the following entries:

User-Agent: Googlebot

Disallow: / private * /

To block all contain a question mark (?) Visit the web site, you can use the following entries:

User-agent: *

Disallow: / *?  *

Using the $ character matches the end of the URL

You can use the $ character the end of the URL specified with the matching characters.  For example, to block to. Asp at the end of the URL, you can use the following entries:

User-Agent: Googlebot

Disallow: / *. asp $

You can match this model used in conjunction with the Allow directive.  For example, if?  That a session ID, you can exclude all of the URL contains the ID to ensure that Googlebot will not crawl duplicate pages.  However, in order to?  At the end of the URL may be that you want to include the version of the page.  In this case, the robots.txt file can be set as follows:

User-agent: *

Allow: / *?  $

Disallow: / *?

Disallow: / *?  And his party will block contains?  Website (specifically, it will block all your domain name at the beginning, followed by any string, followed by a question mark (?), And then the string is arbitrary URL).

Allow: / *?  $ And his party will be allowed to contain any?  At the end of the web site (specifically, it would allow to include all your domain name at the beginning, followed by any string, followed by a question mark (?), There is no question mark after the character of the site).

Sitemap Site Map:

Site Map for the support of the new approach is the robots.txt file, including direct links sitemap file.

Like this:

Sitemap: http://www.eastsem.com/sitemap.xml

Expressed support for the current search engine company Google, Yahoo, Ask and MSN.

However, I would suggest or submit to Google Sitemap, which features a lot of links you can analyze the state of

Robots.txt benefits:

1. Almost all the search engines gives Spider follow robots.txt crawl rules, search engine Spider agreement to enter a Web site that is the entrance to the site’s robots.txt, of course, the prerequisite is the existence of the website this document.  Robots.txt is not configured for the site, Spider will be redirected to a 404 error page, the relevant studies have shown that if the site uses a custom 404 error page, then the Spider will be regarded as its robots.txt– although the is not a pure text file – Spider Index This site will bring big problems, the impact of search engine included on the site page.

2. Robots.txt to stop the unnecessary occupation of the search engines valuable server bandwidth, such as email retrievers, the majority of this type of search engine sites is meaningless; Another example image strippers, for most types of non-graphics Web site for its and has little significance, but a considerable amount of bandwidth.

3. Robots.txt to stop search engine to non-public page crawling and indexing, such as the site background processes, management procedures, in fact, for some in the operation of the site have a temporary page, if not configured robots.txt , search engines and even those temporary files will be indexed.

4. For the rich, there are many pages of web sites, configure the robots.txt is more important significance, because very often a search engine of its Spider face tremendous pressure to give Web site: Spider-like visit to the flood, if not checked and even affect the normal web site visit.

5. Similarly, if the existence of duplicate content sites, use the robots.txt page limit will not be part of search engine indexing and recorded, can be avoided by the search engine site duplicate content on penalties to ensure that Web site’s ranking will not be affected.

the risks associated with robots.txt and solutions:

1.?????everything, robots.txt at the same time also brought a certain degree of risk: the attacker also pointed out the site’s directory structure and location of private data.  Although the Web server’s security configuration properly under the premise of this is not a serious problem, but those ill reduced the difficulty of the attack.

For example, if the site privacy data www.yourdomain.com / private / index.html visit, then the settings in the robots.txt may be as follows:

User-agent: *

Disallow: / private /

In this way, an attacker can simply look at robots.txt to know the content you want to hide where the input in the browser will be able to visit our www.yourdomain.com/private/ did not like the content.  Of this situation, the general approach taken is as follows:

Set access permissions on the / private / content in password-protected so that attackers will not be able to enter.

Another approach is to the default directory changed its name to the main document index.html other, for example, abc-protect.html, so that the content will become the address www.yourdomain.com / private / abc-protect.htm, At the same time, the production of a new index.html file, the content along the lines of “you do not have permission to access this page” like, so that an attacker because I do not know the actual file name and do not have access to private content.

2. If the settings wrong, will lead the search engine will index all the data deleted.

User-agent: *

Disallow: /

The above code will be banned from all of the search engine index data.

Currently, the vast majority of search engine robots have to comply with the rules of robots.txt, and the Robots META tags are not currently supported, but is gradually increased, such as the well-known search engine on the full support of GOOGLE and GOOGLE also adds a command “archive”, can be restricted to whether or not to retain GOOGLE snapshot page.  For example:

<META NAME=”googlebot” CONTENT=”index,follow,noarchive”>

That crawl the site page and link pages to crawl along, but not to keep GOOLGE web page snapshot of the page

ShareThis

Source
How to use Robot.txt

Post to Twitter Post to Plurk Plurk This Post Post to Yahoo Buzz Buzz This Post Post to Delicious Delicious Post to Digg Digg This Post Post to Ping.fm Ping This Post Post to Reddit Reddit Post to StumbleUpon Stumble This Post

Comments are closed.