Support

Admin Tools

#35380 Do we need robots.txt fo security reasons?

Posted in ‘Admin Tools for Joomla! 4 & 5’
This is a public ticket

Everybody will be able to see its contents. Do not include usernames, passwords or any other sensitive information.

Environment Information

Joomla! version
n/a
PHP version
n/a
Admin Tools version
n/a

Latest post by nicholas on Tuesday, 15 June 2021 07:38 CDT

jjst135

Hi! Joomla comes with a default robots.txt file. As I understand it the robots.txt file is used to 'advice' search engines not to index some parts of the website. So If you want part of your site not indexed you can use robots.txt to achieve this.

I believe this is not a security thing. Right? It just tells search engines what to index or not. And maybe indexers won't even obey the rules.

My question is: If we want all of our site to be indexed (we want that for most of our sites) why would we even have a robots.txt file? Why not just remove it?

For exampe: in the default robots.txt there is a rule : 'Disallow: /administrator/'. But how would google ever index this folder? Google can not log in and could at most index the login page. Right? And when we protect this page with a URL parameter Google won;t even find that login page, So why add it to the robots.txt?

And for other folders: Google won't index physical folder. Only web pages (URLs) that are linked to anywhere (on the site). So maybe a link to a stylesheet of js would cause Google to index this file, but would Google ever show this file in the search results to anyone? And would it not be better to have a nofollow link on those css and js includes to prevent Google from indexing those included files? In stead of using a robots.txt?

Also cache and tmp of logs: What's the harm in having those not in a robots.txt file? They are public accessible even when they are included in the robots.txt. Would Google index those folders? Why would they? And for the logs: If there are no links to the logs folder Google won't be able to find that folder.

So I'm just trying to figure out what the robots.txt is for in terms of site security. Would make removing the robots.txt file our sites less secure?

Hope you can clarify this for me a bit! Thanks.

nicholas
Akeeba Staff
Manager

The robots.txt file has nothing to do with security. See http://www.robotstxt.org/robotstxt.html It's basically a set of hints which crawlers should use when indexing your site.

The utility of this file is to provide a hint so that crawlers don't waste your server's resources trying to index URLs which are protected behind a password (e.g. all /administrator/ URLs, any URLs to your site's client area etc) or would otherwise result in a 403 or 404 error (all the folders listed in Joomla's default robots.txt). In short, it's there to ask crawlers to please not waste their and your processing power indexing URLs which are never meant to be public.

URLs can end up being in a search index in different ways which comes down to incompetence or malice. A site map generator can be misconfigured (or otherwise misbehave), listing private URLs. Some idiot developers may use private directories for public content — I'll come back to that in a minute. Attackers may explicitly add URLs to Joomla files, logs etc on a site they own (or have already hacked...). The latter is a relatively simple way to perform reconnaissance without exposing yourself. More so when they know that site owners won't block search engines' crawlers for repeated security infractions. This means that they can test a large number of URLs in case they hit the jackpot and even if they don't they'll know the site is well protected — like when using Admin Tools! — and won't bother with it; there are easier targets.

Now, let's go back to something utterly wrong you said:

Also cache and tmp of logs: What's the harm in having those not in a robots.txt file? They are public accessible even when they are included in the robots.txt. Would Google index those folders? Why would they? And for the logs: If there are no links to the logs folder Google won't be able to find that folder.

THIS IS ABSOLUTELY WRONG. NONE OF THE CACHE, TMP, ADMINISTRATOR/CACHE AND ADMINISTRATOR/LOGS FOLDERS WAS EVER MEANT TO BE WEB ACCESSIBLE, EVER!

The cache folders contain (partially or fully) rendered pages for guests and logged in users and database query results. Exposing that information publicly would be a security nightmare. Their contents are NEVER, EVER meant to be hot linked in the HTML output. They are meant to be read server-side, by the Joomla PHP application, when constructing an HTML response. There are idiot developers who might abuse them for public content but they are WRONG. Joomla has had the media folder since Joomla 1.5.0, released in 2007, for this kind of generated content. I can't believe that we're still talking about this nearly fourteen years later...

The tmp folder contains temporary files which need to be summarily removed as soon as possible. This includes temporary files created when uploading something to Joomla, installation packages when you are installing or upgrading extensions, temporary files used when processing larger pieces of content (e.g. images and video) and so on. None of this is meant for public consumption and exposing that publicly is a security nightmare. The same note about idiot developers applies here.

If you are using idiot developers' software which uses the cache or tmp folders for publicly accessible content please stop using that software until they fix it. It's been fourteen years since Joomla provided the media folder for generated content. There is no excuse whatsoever for developers not using it. Sure, in the first few years, when developers needed to provide backwards compatibility with Joomla 1.0, they couldn't. This has stopped being an issue circa 2010 at the latest. This means that if anyone is still using the cache or tmp folder instead of the media folder these past 11 years they need to fix their software a.s.a.p.

The logs folders, of course, contain logs of what your site is doing and MUST remain private. Normally, logs are stored as .php files so even if you access them you get nothing. It's possible that older versions of software uses a plain text format or has to use a non-PHP format for whatever reason — sometimes a very valid reason e.g. business workflows which require periodically sending the log to an external log ingestion service for centralised management. Links to the log files can appear publicly either by accident or, more usually, by malice (remember what I explained above about reconnaissance?). Having legitimate crawlers ignore such attempts to index logs is good practice.

As far as security goes about these folders, using Admin Tools' .htaccess Maker, NginX Conf Maker or Web.Config Maker with the Frontend Protection and Backend Protection features enabled makes sure that the files in these folders DO remain private. Having these folders listed in the robots.txt file is simply a hint to the crawler that these URLs will result in a error so please don't bother trying to index them and please don't tell me they threw an error if you do try to index them and inevitably fail.

In any case, focusing on the robots.txt file so much is a waste of your time. This file will NOT make or break your SEO (unless you do something silly, like disallowing indexing of the entire production site — this only makes sense for and should be used with development and staging sites). The speed of your site is a far more important factor for your search engine ranking, by several orders of magnitude. Removing the robots.txt will have no impact on your SEO but it MAY result in Google Web Master Tools (or whatever they are rebranded to this week...) telling you that you have a lot of URLs throwing errors, only for you to discover these are URLs in these inaccessible folders which were listed in the robots.txt you removed for no reason.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

jjst135

Hi Nicholas, thanks for your extensive reply and explanation.

You are of course right about the cache, tmp and administrator folders. I think I was actually meaning to refer to folder that are public available, like the log and media folders.

What I meant to point out was that when there are no links (href ) to files in this folder Google will not ever find those folder because of the way crawlers operate. The can not just list some folders in your public_html and the index files inside. When there are no links on your site, Google will not be able to index them.

And also the cache, log or administrator folder are protected and not accessible to a Google crawler. So why bother putting them in a robots.txt file. This would only maybe do some good (other the of course fixing this...) when the site is not setup correctly. Right? 

But using a robots.txt might indeed have some effect on the server load because you would have less crawler hits  and indeed also less errors in the webmaster tools. So that would be a good reason just to keep using it. And having all the default folder in them does not do any harm. I also do not have any issues getting sites indexed with the default robot.txt so why change this ;-)

We did have some issues with JCH optimize on out sites because the tmp files (js and css files used on the site) were stored in some subfolder in the cache folder. But this has been changed to the media folder. So that issue has been solved. I don't think we use any other extensions that

So it's not a huge issue, I was just wondering what's what with this robots.txt. I've heard a Google guy say using a robots.txt is only helpful when you need to have a (public) part of your website outside the search results. That made me wonder about the robots.txt file in Joomla. 

Kind regards,
Jip

 

nicholas
Akeeba Staff
Manager

And also the cache, log or administrator folder are protected and not accessible to a Google crawler. So why bother putting them in a robots.txt file. This would only maybe do some good (other the of course fixing this...) when the site is not setup correctly. Right?

As I explained, you might get links to these inaccessible files from sites outside your control. Having the forbidden folders in the robots.txt prevents search engines from telling you there was an error trying to index these inaccessible files.

We did have some issues with JCH optimize on out sites because the tmp files (js and css files used on the site) were stored in some subfolder in the cache folder. But this has been changed to the media folder.

Yeah, this was an issue that went on for a decade or so. The other repeat offender was WidgetKit but I believe they finally fixed it a couple years ago. These are the ones we see time after time. There are a few other extensions but they're nowhere near as popular. 

I've heard a Google guy say using a robots.txt is only helpful when you need to have a (public) part of your website outside the search results. That made me wonder about the robots.txt file in Joomla. 

Let me guess, the audience was marketers and WordPress site owners? Because that kind of oversimplification would work with that audience. The robots.txt file is also useful for removing the errors Google's webmaster tools report for inaccessible folders, prevent indexing your transient media files (everything in the media folder), prevent indexing static media you'd rather people not easily find and hotlink (e.g. PDFs in your images folder) etc. Of course Google is in the business of selling ads which is contingent on their ability to index as much of your site as they can so they're able to more accurately classify web visits coming to your site so of course they will oversimplify and mislead you about a technology which gets into their way of doing that.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

jjst135

Thanks Nicholas.

I'll try to ty to sum it up...

  • The robots.txt is useful to prevent some parts of your website from being indexed by search engines. Wrong usages of the robots.txt might cause SEO issues.
  • The robots.txt is useful to prevent indexing from publically accessible folder when you don't want certain files (like images of PDFs or static  media files) to be indexed.
  • The robots.txt is useful to prevent crawling parts of the website that are (mostly) already inaccessible for security reasons, to prevent errors in Webmaster Tools and prevent unnecessary server load from serving content to crawlers.
  • The robots.txt is a guideline for search engines. This does not force the indexing behaviour. Some search engine crawlers may ignore it.
  • The default joomla robots.txt will work fine for most sites. No need to remove it or to worry to much about it's usefulness ;-)
  • The default joomla robots.txt will not cause any SEO issues. It will also not improve the SEO of the site.

Thanks for your insights.

Kind regards,
Jip

nicholas
Akeeba Staff
Manager

Correct!

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

Support Information

Working hours: We are open Monday to Friday, 9am to 7pm Cyprus timezone (EET / EEST). Support is provided by the same developers writing the software, all of which live in Europe. You can still file tickets outside of our working hours, but we cannot respond to them until we're back at the office.

Support policy: We would like to kindly inform you that when using our support you have already agreed to the Support Policy which is part of our Terms of Service. Thank you for your understanding and for helping us help you!