Alright, this is far from an ideal solution, and fairly specific to the HTML-structure of your Blogger-blog. What it does is scrape the posts based on the sitemap, which I downloaded manually rather than interactively.
Further, it is a PHP-solution which uses Composer for dependencies. Thus, we want this structure at our root (for example /scraper/
):
posts (to hold posts)
vendor (to hold dependencies)
composer.json (see below)
composer.lock
index.php (the actual script)
sitemap.xml (from Blogger)
Composer.json
looks like this:
{
"require": {
"fabpot/goutte": "^3.1",
"pixel418/markdownify": "^2.1"
}
}
It iterates over the URL’s found in the sitemap.xml using Large XML Parser, and retrieves data from this using the Goutte HTML-parser. Then this is written to /posts/DATE-TITLE/page.md
, including simple FrontMatter and the content parsed as Markdown.
It is sensitive to connection, as naturally we are actually opening each page to retrieve contents, and on execution-time. For a small amount of pages (74 in this case), it will only take about a minute to execute fully. Because of the messy structure of content, weird HTML and images are ignored. Thus all pages should be reviewed to check whether content is as expected.
index.php
:
<?php
require __DIR__ . '/vendor/autoload.php';
require_once('vendor/simplelargexmlparser/SimpleLargeXMLParser.class.p hp');
use Goutte\Client;
/* http://cubiq.org/the-perfect-php-clean-url-generator */
function toAscii($str, $replace=array(), $delimiter='-') {
if( !empty($replace) ) {
$str = str_replace((array)$replace, ' ', $str);
}
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
$clean = preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $clean);
$clean = strtolower(trim($clean, '-'));
$clean = preg_replace("/[\/_|+ -]+/", $delimiter, $clean);
return $clean;
}
$xml = dirname(__FILE__)."/sitemap.xml";
$parser = new SimpleLargeXMLParser();
$parser->loadXML($xml);
$array = $parser->parseXML();
echo "<pre>";
echo 'Start...<br />';
foreach ($array as $item) {
$url = $item["loc"][0];
$post = array();
$client = new Client();
$crawler = $client->request('GET', $url);
$crawler->filter('h3.post-title')->each(function ($node) use (&$post) {
$post['title'] = $node->text();
});
$crawler->filter('h2.date-header')->each(function ($node) use (&$post) {
$post['date'] = $node->text();
});
$crawler->filter('div.post-body')->each(function ($node) use (&$post) {
$post['post'] = $node->html();
});
$crawler->filter('span.post-labels > a')->each(function ($node) use (&$post) {
$post['tags'] = $node->text();
});
$folder = date("Y-m-d", strtotime($post['date'])).'-'.toAscii($post['title']);
$content = '';
$content .= '---'."
";
$content .= 'title: '.trim($post['title'])."
";
$content .= 'date: '.date("d-m-Y", strtotime($post['date']))."
";
if (isset($post['tags'])) {
$content .= 'taxonomy'."
";
$content .= ' tags: '.$post['tags']."
";
}
$content .= '---'."
";
$markdown = new Markdownify\Converter;
$content .= $markdown->parseString($post['post']);
$file = 'post.md';
if (!file_exists('posts/'.$folder)) {
mkdir('posts/'.$folder, 0777, true);
file_put_contents('posts/'.$folder.'/'.$file, strip_tags($content));
echo 'Wrote '.$folder.'<br />';
} else {
file_put_contents('posts/'.$folder.'/'.$file, strip_tags($content));
echo 'Wrote '.$folder.'<br />';
}
}
echo 'Stop...';
echo "</pre>";
?>
I’ll send you the resulting posts
-folder.