Gallery2:How to keep robots off CPU intensive pages - Gallery Codex
Personal tools

Gallery2:How to keep robots off CPU intensive pages

From Gallery Codex

Slideshow

Pages such as slideshow are very CPU intensive. To an indexing robot they are also totally useless since the information they provide is redundant. So the administrator has every reason to keep the robots from visiting such pages.

Using URL rewrite module, the default slideshow URL is the following form: "/v/my_album/my_sub_album/my_photo.jpg/slideshow.html". The problem is that there is no way to exclude that sort of URL in robots.txt syntax. In order to make the URL excludable, some URL rewriting is required.

There is no need for fiddling with mod_rewrite directly as the nifty rewrite module can handle the details itself. By default the "View Slideshow" rewrite target is "v/%path%/slideshow.html". The constant slideshow URL mark ("/slideshow.html") is on the right side of the variable path ("%path%") and this is why we could not express the slideshow ban in robots.txt syntax. Reversing this order will provide us with an excludable URL.

So change the rewrite target for "View Slideshow" from "v/%path%/slideshow.html" to "v/slideshow/%path%".

Then add "Disallow: /v/slideshow/" to your robots.txt. If you use the PATH_INFO mode of URL rewrite module then this will be "Disallow: /main.php/v/slideshow/".

And that's it: no more spiders hogging your precious resources in vain!

Other Pages

You can also prevent robots from visiting other pages which are of no use to them. Some examples are :

Advanced Search
/main.php?g2_view=search.SearchScan&g2_form%5BuseDefaultSettings%5D=1&g2_return=%2Fgallery%2Fmain.php%3F
Login
/main.php?g2_view=core.UserAdmin&g2_subView=core.UserLogin&g2_return=%2Fgallery%2Fmain.php%3F
Add Comment
/c/add/
Shutterfly
/main.php?g2_view=shutterfly.PrintPhotos&g2_itemId=80511&g2_returnUrl=http%3A%2F%2Fexample.com%2Fgallery%2Fmain.php%3Fg2_path%3Dexamplepath%2Fphoto.jpg.html&g2_authToken=6c4286fe85d0
Ecard
/main.php?g2_view=ecard.SendEcard&g2_itemId=80511&g2_return=%2Fgallery%2Fv%2Fexamplepath%2Fphoto.jpg.html

You can do this by adding the following lines to your robots.txt file :

Disallow: /main.php?g2_view=search.SearchScan&
Disallow: /main.php?g2_view=core.UserAdmin&
Disallow: /c/add/
Disallow: /main.php?g2_view=shutterfly.PrintPhotos&
Disallow: /main.php?g2_view=ecard.SendEcard&

Note : The comments link and robots.txt line assume you've enabled the rewrite plugin for comments (much like you did above for slideshow)

Note : If you've installed gallery into a sub-directory like http://www.example.com/gallery/ then your robots.txt lines will need to be prefixed by that same path (e.g. Disallow: /gallery/c/add/)

Note : Keep in mind that the way a robots.txt file is parsed is very simple. It's a substring search from the beginning of the path. No globbing or Regular Expressions are allowed.

You can test and confirm that your robots lines are doing what you want by using Google's Webmaster Tools, just add your site, verify it, then in the Dashboard under "Site Configuration" ... "Crawler Access" you can test URLs from your site.