All Grav sites broken - InvalidArgumentException cache directory not writable, but it is

My site ran flawlessly for months - and all of a sudden stopped working.

The error message says that a cache/doctrine directory is not writable, but it is. I checked the setup of the site:

  • The user rights are correct.
  • The server quota is not exceeded.
  • The configuration (FPM, PHP 7.3) is unchanged.
  • All Grav sites are affected, all Non-Grav sites are not.

The only thing that has changed a few days ago has been an update of the PHP packages. This is currently my only suspect, though it is not very likely, as all other PHP sites are working.

Any idea, how I could get this fixed?

Thank you very much in advance!

@Vince42, Did you go through the Troubleshooting docs?

Yes, i did. I also tried reapplying permissions (although they are all correctly set), but the problem persists.

In the Apache error log it says in addition “broken pipe” and “connection reset by peer” with “AH01075: Error dispatching request to : (passing brigade to output filters)”. But that did not help me either.

The error message on the web site shown to the world is “cache directory not writable” - but it is set to be writable and I did not change any configuration for months on that site.
The fact that all Grav sites are broken and all other PHP sites are running make me assume that there might be some incompatibility with the installed software versions of Apache, PHP and the current Grav version.

What could I scrutinize next?

Did you try clearing cache completely so that the cache folder is empty and with the proper permissions?

I just cleared the cache via CLI and reset all permissions to all files and folders: The site is up again.

This still leaves me puzzled: How could that mismatch between the current state and the obviously required state regarding cache and permissions happen? Nothing has been changed manually. Did the permissions change in recent releases?

I am not entirely sure what exactly happened but my suspicion is on a corrupted cache file that must have happened during a Grav update, probably between 1.6 and 1.7 where the doctrine library was also upgraded, among internal cache logic and so on.

To be honest I haven’t seen this before and it is certainly weird but cache is a very volatile and dynamic set of files not to get too attached to, so whenever odd things happen i would always try and clear cache completely first thing.

Okay, that is a possible explanation, that I have been looking for - thank you very much!

And after one / two days of proper operation all sites are broken again.

How can I keep the doctrine cache from crashing my sites? What should I scrutinize?

As a side note: All sites are set to clear the cache daily - why isn’t it cleared?
Of course I could set up cron jobs doing it - but Grav is supposed to do that by itself automatically, right? And it worked for months - now something weird is happening.

@Vince42, I have no idea what is going on, but when searching for ‘grav doctrine’, Google gave me the following post with a similar error as yours:

Permission error, cache/doctrine is not writable

Permissions seem alright at first sight, but apparently not for SELinux when setenforced.

I also came across that post. Unfortunately the command does not work (“chcon: can’t apply partial context to unlabeled file”) and I am also not running SELinux but Ubuntu 16.04 LTS - and I do not remember ever having set up something regarding security, especially changed anything.

I will continue to try to find the root cause for this strange behaviour, as I dislike the idea of creating cron jobs for each and every domain that runs Grav.

@Vince42, I have no knowledge of infra, but If I understand correctly, SELinux is an security application that can be installed on different Linux distros.

I will continue to try to find the root cause for this strange behaviour, as I dislike the idea of creating cron jobs for each and every domain that runs Grav.

That sounds healthy… I wouldn’t consider it a solution to inflate my car’s tires every morning because they are all flat every few days…

I have no idea what is happening, but these questions come to mind:

  • Do all Grav sites crash?
  • And none other?
  • Do all sites run the same Grav version?
  • Do all sites have one or more plugins in common?
    Does any of these touch cache or any file in ‘/user/**’ folder
  • Do all sites crash at approx. the same time?
  • Do all show the same Grav and Apache error?
  • Do they all have a similar total number and diskspace of cache files?
  • Are there cron jobs running just before the crashes?
  • Are there cron jobs running that touch any file in ‘/user/**’ folder?
  • Do you know what the hoster has changed recently?
  • Could a single site be tested if it will crash when caching is switched off?

Yes.

Correct. All WordPress, MediaWiki and some others work fine.

Most 1.75, two 1.73.

The simplest site that crashes does only have the standard plugins and contains only one word on the homepage.

The “simple site” should not touch anything and I am quite sure that the others do not do that as well.

Yes, I think so. I resetted them all at the same time and they were all not available the next time that I checked their health status 24 to 48 hours later.

Absolutely identical errors “Crickey!” on every site.

I did not check that the last time, but will do it during the next crash in … hmmm … probably twelve hours. :slight_smile:

Nothing unusual / nothing special. I just put up an Uptime Robot monitor for all sites - hopefully I will get more detailed data about the timeline of crashes during the upcoming week.

Definitely not.

I am the hoster myself. The only thing that has been changed that could possibly related to Apache are updates of PHP packages (those were my first suspect, but as all PHP sites are working pretty well, I meanwhile doubt it, that they could have an effect on the cache/doctrine error).

Hmmm … interesting point! I updated two sites to 1.76 now. I will wait until they crash, clear the cache and disable caching on one site afterwards.

I really appreciate your thoughts on my weird problem! :hugs:

@Vince42,

  • Considering your answers I rule out you’re using multi-site setup.
  • Could you reverse the upgrade of the suspicious PHP packages?
  • Could you try setting the following in ‘/user/config/system.yaml’ for one site:
    cache:
      enabled: true
      check:
        method: none                       
    
    And then clear cache manually.
    I believe this cache setting will create the cache once, but does not check if pages have been changed and need re-caching.

    If automatic re-caching of changed pages is not critical to you (or if your site is rather large), then setting this value to none will speed up a production environment even more. You will just need to manually clear the cache after changes are made. This is intended as a Production-only setting.

Since you haven’t shared the stacktrace, I presume the exception is thrown by ‘vendor\doctrine\cache\lib\Doctrine\Common\Cache\FileCache.php’ at line 90: (Please check in the stacktrace of the Exception).

if (! is_writable($directory)) {
   throw new InvalidArgumentException(sprintf(
      'The directory "%s" is not writable.',
      $directory
   ));
}

When googling for “is_writable false”, there are quite a few post claiming the permissions are set correct, but is_writable returns false anyway. You might consider reading a few of these.

What’s bothering me is, if, as you say, no configs or pages are being touched (not even their last_modified date), why would Grav (doctrine) try to write to ‘/cache/doctrine’

I have multiple sites with Grav installed, each site has its own installation. There is nothing shared between the sites, which would be the case assumingly when having a multi-site setup (I think I once read something about it). The sites are all created over an admin panel (Vitualmin).

As many domains run on the server and a rollback of packages is quite a PITA: I am unfortuately not able to roll back the packages.

I will try your tip on next occasion.

In order to show you the full stack trace, please visit http://all24.net - you will see the full beauty of it. :slight_smile: (side note: is it wise to show the full stack trace to the world? I knew other systems, that simply logged the stack trace and just showed a “Oops” page)

Your assumption about the error-causing file was correct. I will try to find the posts and read a bit about them. What makes me wonder is, that the sites crash after a certain amount of time. If is_writable is returning false all the time, the site should crash all the time, shouldn’t it?

I totally agree to your last point: I simply don’t get it. As always, the root cause is probably something small and stupid; but Grav ran like charm for a long while on all sites - and now (since a few weeks) all sites have shown weird Doctrine behaviour.

The good news (for me for the moment) is, that four of five sites are still available. I am not quite sure, whether it is the disabeling of the cache feature (at least I hope so) or something with the different versions. Next time I clear the cache, I will upgrade them all to the same version and try to fiddle around with the cache settings.

I will keep the all24.net in broken state until you have found the time to see the whole stacktrace online. Afterwards I will clear its cache again and stick to the plan outlined above.

Addendum
I just ran my “fix permissions” script - and afterwards the site is available again.
This little new insight leaves me again a bit puzzled.

The only files and folders that might have been added to the site are the doctrine directories and files, I assume.

Before running the script, I checked the cache and doctrine directory:

drwxr-sr-x  6 all24.net all24.net 4096 Feb 20 17:59 compiled
drwxr-sr-x  7 all24.net all24.net 4096 Feb 20 18:03 doctrine
-rw-rw-r--  1 all24.net all24.net    0 Jan 21 23:16 .gitkeep
-rw-r--r--  1 all24.net all24.net 2804 Feb 20 18:15 problem-check-g-f04999f6.json
drwxr-sr-x 13 all24.net all24.net 4096 Feb 20 18:03 twig

drwxr-sr-x 174 all24.net all24.net 4096 Feb 20 18:05 0778f736
drwxr-sr-x 174 all24.net all24.net 4096 Feb 20 18:04 278c70e3
drwxr-sr-x  59 all24.net all24.net 4096 Feb 20 18:05 82c519a8
drwxr-sr-x  10 all24.net all24.net 4096 Feb 20 18:00 eb5cf0a4
drwxr-sr-x   8 all24.net all24.net 4096 Feb 20 18:15 f04999f6

after fixing the permissions they look as follows

drwsrwsr-x  6 all24.net all24.net 4096 Feb 20 17:59 compiled
drwsrwsr-x  9 all24.net all24.net 4096 Feb 27 00:25 doctrine
-rw-rw-r--  1 all24.net all24.net    0 Jan 21 23:16 .gitkeep
-rw-rw-r--  1 all24.net all24.net 2804 Feb 27 00:29 problem-check-g-283d1621.json
drwsrwsr-x 13 all24.net all24.net 4096 Feb 20 18:03 twig

drwsrwsr-x 174 all24.net all24.net 4096 Feb 20 18:05 0778f736
drwsrwsr-x 177 all24.net all24.net 4096 Feb 27 00:29 278c70e3
drwsrwsr-x  90 all24.net all24.net 4096 Feb 27 00:33 283d1621
drwsrwsr-x  59 all24.net all24.net 4096 Feb 20 18:05 82c519a8
drwsrwsr-x   6 all24.net all24.net 4096 Feb 27 00:25 938f22fa
drwsrwsr-x  10 all24.net all24.net 4096 Feb 20 18:00 eb5cf0a4
drwsrwsr-x   8 all24.net all24.net 4096 Feb 20 18:15 f04999f6

I do not see anything suspicious that might have caused the cache error. :thinking:

Hi @Vince42, Interesting puzzle… :wink:

In order to show you the full stack trace, please visit http://all24.net

What’s wrong with copying the stacktrace, removing sensitive data and copy it here with a bit of formatting to make it more readable?

Should all24.net show an error right now? It is working fine and no error is thrown…

Untitled

What makes me wonder is, that the sites crash after a certain amount of time. If is_writable is returning false all the time, the site should crash all the time, shouldn’t it?

AFAIK, is_writable is only being called when Grav needs to write to the cache. If the system is completely stable (no changes to configs/pages), Grav wouldn’t need to. So, it seems like something is causing a change.

The good news (for me for the moment) is, that four of five sites are still available. I am not quite sure, whether it is the disabeling of the cache feature (at least I hope so)

If you disabled cache for 1 site only, as suggested, and only that site remained alive, you would have a clear indication.

Question:

  • Why are your permissions using setuid. Yes, I had to look it up, since I have no idea what it is/does…

    The following are permission set by my hoster.
    drwxrwxr-x  6 <account> <account> 4096 Feb 23 19:01 compiled 
    drwxrwxr-x  3 <account> <account> 4096 Feb 23 19:01 doctrine 
    drwxrwxr-x  4 <account> <account> 4096 Feb 23 19:01 gpm 
    drwxr-xr-x 14 <account> <account> 4096 Feb 23 19:01 twig
    
    NB. <account> is my accountId.

  • Have you been able to figure out if all sites fail in a certain time window?
  • I don’t think Grav changes any permissions, so how do they get changed overtime?
    • Are you sure there is no cron running?
    • Or could you have been infected?
      • Do the sites have Admin?
      • What if you remove it from all24.net?

Nothing wrong with that, will do that the next time.

all24.net is currently working.

That is my understanding of caching as well. Interesting side note: After I set up all24.net—and you can see that the whole site has only one word of content—it blew up until it exceeded the virtual server quota. Also a little miracle.

I have five sites in production with Grav. I disabled it on two sites and kept the other three with cache enabled. Only all24.net crashed.

I took the permissions from the permission section of the Grav troublesheeting section.

At the moment this is totally unclear. I will update all sites to the most recent version of Grav and then start watching them.

I am sure that no cron jobs are running. In fact I would be happy, if the cache clearing would work. :wink: I am also quite sure that my system is not infected in any way, as the Grav sites are the only ones, that had been crashing - everything else works like charm. The sites are running in the user context of the virtual server.

What should I remove from all24.net?

So, what are my next steps? I will

  • update all sites to the most recent version and
  • try to check the sites on a regular basis, until I see at least one site crashing, and
  • provide the stack trace.

Apart from that I will consider commenting the setuid flag from the permissions script.

Thank you for being my travelling companion on this interesting journey. :hugs:

Update
After switching the site from FCGId to FPM (all other sites are using FPM) I got Crickey! again:

All sites are now 1.7.7 and all plugins are updated.

I will do an ls -laR for the cache directory, fix permissions like I did the last time (omitting suid flag though) and make an ls -laR again. In case the site should be working afterwards, I hope that the diff will show some hint.

Update 2
Here comes a new phenomenon (sorry): the ownership of the files and folders for all24.net is www-data:all24.net instead of all24.net:all24.net. All other sites’ files are properly owned. Geez …

After running find . -type d -exec chmod 775 {} \; the site was working again. I also removed all setuid flags - no problem.

Diffing the changes before and after setting the directory permissions reveals, that many directories were indeed not writable.

Okay, let’s think about something new: As I dislike fixing things manually and producing new inconsistencies (like the directory permissions for the twig directory), I will now remove Grav and re-install it. At least all permissions should be fine then and the ownership should also be fine (unless this specific domain should have some other weird configuration setting).

@Vince42, No questions popping up right now… I guess things have to sink in first.

Will update when questions popup…

Three weeks gone … time for an update.

Back then, I fixed some Apache inconsistencies, where some servers listened to * and others to the IP of the server, and made a new fresh install of alll24.

The good: All sites are still operating. I wonder, whether the Apache inconsistencies had a bad side effect on caching. At the moment I lack the imagination to think of a scenario, where this could happen and how. But well, it works.

The odd: The famous all24, which only consists of one page, has grown again and recently hit the virtual server quota of 2 GiB. I really don’t get it, why it is growing. Somehow the daily cache clearance is not working - while it seems to work on all other sites.

I will do a manual cache clearance now and by that win another three weeks to think about that problem. Maybe I should start adding content to the site … maybe Grav thinks, that a one page / one word site is not worth cache-clearing. :joy:

I really don’t get it: The site consists of one simple naked page and the cache keeps growing day by day. This time the 2 GB quota was exceeded already after five days. :frowning:

I wish I were able to track that problem. I were recommending Grav to my customers, as it is so easy and fail-safe—and at the moment I am speechless.