Social media sites refuse to scrape any metadata

Please keep in mind the fact that I’m rather green in regards to the art of maintaining a website.

Stupefying. That’s the closest term I could come up with, in order to explain my current predicament.

Here’s the skinny of it. I have a very simple personal blog, called Watch The Future, using the Purity theme, with inheritance setup and working correctly. In fact, the whole website performed flawlessly over the first 2 weeks of its operation… and then it didn’t. After a few hours of fruitless attempts to fix it, In all my unfathomable wisdom, I decided to employ the cms version of “turn it off and on again”, deleted the entire site and did a fresh install of grav… to no avail.

Here’s the problem… Metadata scraping worked perfectly up until last week (at least), when I successfully shared my latest publication on Facebook and Twitter. Now, no matter how many tags I add to the headers of new pages, social media sites just refuse to scrape any of them. EDIT: In the process of writing this post, I set up one of the old posts and added enough metadata to the point where Twitter actually successfully validated the share card, but Facebook very much still refuses to scrape any of the open graph tags. Here are said post’s metadata tags:

title: 'БездNEWS #0001 - 2018-02-09'
media_order: bezdnews_ftr_img.jpg
published: true
date: '2018-02-09 20:00'
metadata:
    author: 'Пламен Добрев'
    description: 'Успешен полет за Falcon Heavy, мистериозен глобален катаклизъм покосил планетата ни преди 13,000 години, първите екзопланети открити в друга галактика, прогреса в работата по телескопа Джеймс Уеб, един изгубен сателит живеещ втори живот и други.'
    'og:type': article
    'og:title': 'БездNEWS #0001 - 2018-02-09'
    'og:description': 'Успешен полет за Falcon Heavy, мистериозен глобален катаклизъм покосил планетата ни преди 13,000 години, първите екзопланети открити в друга галактика, прогреса в работата по телескопа Джеймс Уеб, един изгубен сателит живеещ втори живот и други.'
    'og:image': 'http://wtf.controlplusd.com/user/pages/01.blog/02.bezdnews0001/bezdnews_ftr_img.jpg'
    'og:site_name': 'Watch The Future'
    'twitter:card': summary_large_image
taxonomy:
    category:
        - аудио
        - бездnews
    tag:
        - бездnews
author: 'Пламен Добрев'

and the result in Facebook’s URL debugger:

My limited knowledge pool has been quickly depleted. I need advice on how to proceed. If not a solution to the problem, I’ll take notes on further diagnostics methodologies which might reveal more about the issue at hand.

Hi,
I just took a quick look, but it seems like your website lacks a proper <html> tag.
It has one, but it is wrapped inside if, and is only displayed on IE8.

Fixing this could help you troubleshoot the next steps, and your page should be properly parsed.

I made a plugin to manage metadata if you want: https://github.com/paulmassen/grav-plugin-seo

It could help you write valid metadata.

Let me know if adding the <html> tag fix your problem.

Paul

1 Like

I followed your advice, @paul… Removed all conditionals, which the theme developer set specifically for support of IE. Alas, that did not help either. And I was half expecting that, since this issue only arose after two weeks of proverbial smooth sailing. Still, thank you for pointing out the html tag issue.

As far as your plugin, I already had it injected into the original GRAV install, and it was the last measure I attempted to employ before performing a fresh install. I tried overwriting the necessary metadata with it, in hopes that it will work above everything else and just “sweep under the rug” any defects caused by other elements of the site. EDIT: It did not work.

By the way, I am using GRAV mainly through the Admin module.

EDIT 2: Twitter’s metadata scraper now works consistently.

EDIT 3: I wonder… Is it possible that these twig statements aren’t outputting the html tags and values fast enough, so Facebook’s scraper sees no data, as it’s not been input yet?

EDIT 4: The w3c validator outputs an IO Error on both the home url and blog post urls, “the most likely cause of which is that it can’t access the necessary files on the web server”… why that would be the case if neither I nor anyone else seems to have a problem accessing my site, I do not know.

EDIT 5: Upon further investigation, I discovered that the description and og:description tags, which I have manually set, just get overwritten. Whatever content I give them, the cms just auto-generates a string from the available body content, and outputs that instead of the content I set.

Success! I ran one of the problematic urls through the Facebook open graph debugger- NOT the sharing debugger, and something caught my eye. One of the errors referenced the content encoding, and explained that only deflate and gzip are recognised.

I went back to the Admin panel and under the System tab in Configuration, enabled gzip compression. It immediately resolved all my issues.

3 Likes

Well done for sharing the solution.