Export from Blogger to Grav

Muut · May 10, 2016, 10:29am

Hello,

I can see that there are ways to export blog posts from WordPress to Grav, I have also found that there is a script that will export from Blogger to Jekyll, but can’t find anything that would allow me to painlessly export from Blogger to Grav.

I have a few years of blog posts on Blogger that I’d like to move and at the moment it seems like a very tedious affair! Any tips?

Best,
Gabi

Muut · May 10, 2016, 11:54am

A painless exporting workflow is largely missing from most CMS’, but in this case if exporting to Jekyll works fairly flawlessly then that is a good base for what you are trying to do. Jekyll uses the same formats - Markdown and YAML - for its pages, and should largely be compatible with Grav.

Barring a few variables in FrontMatter, depending on what the script you mention exports and how it formats the resulting pages, it should work. If all necessary information, such as page title and content is included then correcting the FrontMatter is mostly a matter of search-and-replace on a large number of files, which most decent text-editors can do on all of them with a single click.

Muut · May 10, 2016, 12:01pm

From my reading it seems like you can export Blogger posts as an XML feed and the Jekyll scripts convert those to static HTML files. Would I then need to convert that once again (how?) to Markdown and YAML before it would be usable in Grav?

I have no experience with any of the tools or coding for that matter, so just wondering whether I’m shooting myself in the foot by trying to migrate to Grav.

Muut · May 10, 2016, 12:17pm

I took a look at Blogger, and as I understand it the default template can return an XML-feed for basically any of its blogs. Mind sharing the link to your blog? I assume it’s public. I can take a look at an elegant way of exporting it.

Once you have the Markdown, your further process will be much easier.

Muut · May 10, 2016, 12:22pm

Thank you for taking the time, I really appreciate it. Keen to learn more, but it’s all a bit overwhelming at the start and seeing that there are quite a few posts an automated way to migrate them would be of huge help.

The blog in question is: http://amidstscience.blogspot.co.uk

Muut · May 10, 2016, 12:23pm

Oh, I also found this and wondered whether this could help/be adjusted to work in this scenario: https://gist.github.com/larsks/4022537

Muut · May 10, 2016, 12:54pm

The blogger xml-feed seems to limit the result a year back, so the “default” Backup Option would at least ensure that all content is retrieved in a structured format that can be iterated over. larsks’s script is a bit mangled in terms of API-use, but given a full export it should not be much of a hassle churning it into Markdown.

You could upload the XML anywhere (it contains, after all, just public information) or PM it to me (not sure if Muut supports attachments though).

Muut · May 10, 2016, 1:06pm

I exported the blog and since it doesn’t look like attachments are supported here I popped it into Dropbox and PMed you a link to the file.

Muut · May 10, 2016, 5:30pm

Alright, this is far from an ideal solution, and fairly specific to the HTML-structure of your Blogger-blog. What it does is scrape the posts based on the sitemap, which I downloaded manually rather than interactively.

Further, it is a PHP-solution which uses Composer for dependencies. Thus, we want this structure at our root (for example /scraper/):

posts (to hold posts)
vendor (to hold dependencies)
composer.json (see below)
composer.lock
index.php (the actual script)
sitemap.xml (from Blogger)

Composer.json looks like this:

{
    "require": {
        "fabpot/goutte": "^3.1",
        "pixel418/markdownify": "^2.1"
    }
}

It iterates over the URL’s found in the sitemap.xml using Large XML Parser, and retrieves data from this using the Goutte HTML-parser. Then this is written to /posts/DATE-TITLE/page.md, including simple FrontMatter and the content parsed as Markdown.

It is sensitive to connection, as naturally we are actually opening each page to retrieve contents, and on execution-time. For a small amount of pages (74 in this case), it will only take about a minute to execute fully. Because of the messy structure of content, weird HTML and images are ignored. Thus all pages should be reviewed to check whether content is as expected.

index.php:

<?php
require __DIR__ . '/vendor/autoload.php';
require_once('vendor/simplelargexmlparser/SimpleLargeXMLParser.class.p hp');
use Goutte\Client;

/* http://cubiq.org/the-perfect-php-clean-url-generator */
function toAscii($str, $replace=array(), $delimiter='-') {
	if( !empty($replace) ) {
		$str = str_replace((array)$replace, ' ', $str);
	}
	$clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
	$clean = preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $clean);
	$clean = strtolower(trim($clean, '-'));
	$clean = preg_replace("/[\/_|+ -]+/", $delimiter, $clean);
	return $clean;
}

$xml = dirname(__FILE__)."/sitemap.xml";
$parser = new SimpleLargeXMLParser();
$parser->loadXML($xml);
$array = $parser->parseXML();

echo "<pre>";
echo 'Start...<br />';

foreach ($array as $item) {
	$url = $item["loc"][0];
	$post = array();
	$client = new Client();
	$crawler = $client->request('GET', $url);
	$crawler->filter('h3.post-title')->each(function ($node) use (&$post) {
		$post['title'] = $node->text();
	});
	$crawler->filter('h2.date-header')->each(function ($node) use (&$post) {
		$post['date'] = $node->text();
	});
	$crawler->filter('div.post-body')->each(function ($node) use (&$post) {
		$post['post'] = $node->html();
	});
	$crawler->filter('span.post-labels > a')->each(function ($node) use (&$post) {
		$post['tags'] = $node->text();
	});
	$folder = date("Y-m-d", strtotime($post['date'])).'-'.toAscii($post['title']);
	$content = '';
	$content .= '---'."
";
	$content .= 'title: '.trim($post['title'])."
";
	$content .= 'date: '.date("d-m-Y", strtotime($post['date']))."
";
	if (isset($post['tags'])) {
		$content .= 'taxonomy'."
";
		$content .= '  tags: '.$post['tags']."
";
	}
	$content .= '---'."
";
	$markdown = new Markdownify\Converter;
	$content .= $markdown->parseString($post['post']);
	$file = 'post.md';
	if (!file_exists('posts/'.$folder)) {
		mkdir('posts/'.$folder, 0777, true);
		file_put_contents('posts/'.$folder.'/'.$file, strip_tags($content));
		echo 'Wrote '.$folder.'<br />';
	} else {
		file_put_contents('posts/'.$folder.'/'.$file, strip_tags($content));
		echo 'Wrote '.$folder.'<br />';
	}
}
echo 'Stop...';
echo "</pre>";
?>

I’ll send you the resulting posts-folder.

Muut · May 10, 2016, 5:36pm

Thank you for putting in the time to help me, I honestly didn’t expect to get this much help. Most communities are not so keen on newbies!

Muut · May 11, 2016, 1:00am

I think you will find the Grav community a little different from others out there Make sure you join our Gitter.im chat too!

Muut · May 11, 2016, 9:33am

That’s great to hear :] I will join the chat too, have quite a few questions and probably will have more! Looks like the export resulting from the above will need some love before it is displayed correctly, but it’s a great start!

Topic		Replies	Views
Migrating from another CMS Archive	1	303	January 26, 2015
Converting Blog Posts Archive	1	282	October 31, 2014
Massive Wordpress migration to Grav Installation & Hosting	6	3489	September 13, 2017
Importing content from flat files? first-time , admin	3	70	February 1, 2025
Exporting from Jekyll to Grav first-time	2	1082	June 11, 2018

Export from Blogger to Grav

Related topics