Automating website error checks

hughbris · July 1, 2022, 7:24am

Not a Grav specific question. Thought I’d ask here, some smart people about

Do any of you use a script or tool to check websites for error pages? I had a client report some erroring pages recently, and to my embarrassment they had been that way for a few months. (I did an incomplete theme refactor so a partial template was missing.)

It got me thinking that it’s about time I had some kind of automated check. It probably doesn’t require continual monitoring, I can just use it as part of the deployment pipeline.

I know I can get there using wget or curl in a shell or PHP script, but thought I’d see if that’s already been done. I’d like to be notified of anything non-200 probably. Search isn’t turning much up for me.

There’s the awesome Uptime Kuma and I have an app for checking that sites are up (which just hit the homepage). Needing something a little more dynamic. Can anyone point me in the right direction?

OleVik · July 1, 2022, 8:28am

I use Linkinator as part of a deployment pipeline, to check for dead or incorrect links. I like it because it’s a JS-script, so it can be ran against various contexts - both locally and against production sites.

In my experience there’s a lot of these out there, but almost all require some involved setup to be ran in a pipeline. I settled on Linkinator because I get a simple list of status codes out of it, and can easily filter in any editor to see which links are my responsibility and which are theirs, ie. externally controlled.

pamtbaau · July 1, 2022, 10:26am

@hughbris, Not aware of any tool, but some approaches come to mind…

Cronjob scanning Apache/Grav log files:
Running a cronjob on the server that:
- checks apache’s log and Gravs log file and
- sends you an email if issues have occurred.
Looping through pages:
- Premise:
  - You might not know all pages in advance (eg. if company has own authors)
  - Not all pages may be reachable through internal links, which makes link crawling unreliable.
- A plugin might be needed that:
  - gets fired on a hidden route (and maybe requires api key)
  - creates a list of urls of all published pages and their language variants
  - returns a json containing urls.
  - Since this might be expensive, it might need to run off-hour and cache its results for later retrieval.
- A wget/curl loop to access all urls returned by plugin
  - This might run on the server, or locally on dev’s computer.
  - It could be triggered by above plugin when urls have been collected.

hughbris · July 2, 2022, 4:55am

This is the benefit of asking someone else: of course (duh!), this is functionally very close to a link checker and something many link checkers will provide and why not check those links anyway? So now I have better search term. Linkinator looks like a great start. One problem might be the odd deliberately orphan page. I can only think of one of those that I’m responsible for. Planning to talk about that when I respond to @pamtbaau’s thoughts.

Thank you!

OleVik · July 2, 2022, 1:13pm

In the scenario where there might be orphan-pages, it would be fairly straightforward to add those to the mix by reading in what pages should exist, given a coherent pages’ structure like Grav or any SSG provides. For dynamic routes - ie., those created on run-time - I have sometimes opted to expose where and when they will exist through a unified interface, a simple JSON-file for example. Thus I can check what exists in a production-like environment, what should exist given the content, and what may exist from generated routes.

This very much goes to pamtbaau’s point 2, wherein testing a remote system would warrant an exposed index or API. There’s several examples of those for Grav. I do prefer the “look for and test” premise though, for which WGet and CURL are good options.

I point out external links because they often do exist, but they don’t always resolve to a 200 OK. This is much to do with server-handling and user-agent emulation.

hughbris · July 5, 2022, 12:36am

Yeah the ideal for internal links would use Grav’s API (or similar). The sitemap plugin is good reference point (although there have been a few issues with its multilanguage route discovery).

I still haven’t had a chance to give @pamtbaau’s thorough analysis the response it deserves!

hughbris · July 6, 2022, 9:09am

I wasn’t lacking ideas (I can think of many pieces to put together too), just wanted to save some time and avoid some traps. I have found these scripts often contain at least one part that is way trickier than expected. That said, your outline is a welcome expression of ideas …

Hadn’t considered this, it’s interesting. (I don’t run Apache, BTW, but there are logs.) The problem here I think is that users need to hit the error before they appear here, unless we initiate a crawl just for this purpose.

I think Ole’s suggestion to use a linkchecker that allows other statuses to be picked out is ideal except for the edge case of orphaned pages. Running that in JS indeed takes out the alerting requirements.

The thorough option is to make a PHP script that uses Grav’s API. Sitemap already does this, so there is a basis there for the API calls. The script could be a useful addition to Grav’s bin directory of CLI scripts. However, it wouldn’t pick up broken internal links, but that wasn’t my aim here and there are good tools for that.

I like the idea of building that in future, but for now I will try Linkinator.

Thanks for your contribution, it’s helped me think it through

Topic		Replies	Views
Grav website down (HTTP 500) - can't troubleshoot solution Installation & Hosting first-time	5	797	January 27, 2020
404s on Grav generated pages like '/sitemap' Archive	1	288	July 8, 2016
Malware scanning needed? Support	5	752	January 24, 2018
Problem with 'Servermonitor' for GRAV-based website Support	0	461	March 18, 2021
File Manager Needed Archive	1	426	May 10, 2016

Automating website error checks

Related topics