Image
Sketch of wireframes on white paper.

A lot of our redesign projects at Rad Campaign require migrating content. Drupal's Feeds module is a great tool for this. Between the multitude of supported field mappings and the flexibility provided by Feeds Tamper, you can consume pretty much any structured data the world might throw at you. However, inline images (those in body content created with WYSIWYG fields, for example) present a challenge. img tags that reference images hosted on your old site pass through silently, resulting in broken images in the content of your new site.

In contrast, Feeds’ image field mapper automatically downloads images added through image fields and gets them into Drupal’s managed file database. Let’s make this same thing happen with images embedded in text fields containing HTML.

This post assumes that you’ve already set up a Feeds importer mapping HTML text content to a field in Drupal, most likely the body field. Ok? Let’s go!

First we have to get our hands on the data being imported via Feeds. Fortunately Feeds provides a multitude of hooks to intervene at several points in the import process. I chose to implement hook_feeds_after_save because I wanted to make sure I was only going through this process for successfully imported nodes.

/**
 * Implements hook_feeds_after_save().
 */
function my_module_feeds_after_save(FeedsSource $source, $entity, $item, $entity_id) {

$item contains all the data from our import source, and in this case the 'body' element contains the string we want to search for image tags. But how to parse this string and find those image tags? Enter Simple HTML DOM, a PHP library that makes parsing and finding content in HTML really easy. (Much better than regular expressions.) Once we install the simplehtmldom API contrib module and its corresponding library we’re good to go:

  // Create new simplehtmldom object.
  $html = new simple_html_dom();
  // Parse body field markup.
  $html->load($item['body']);
  // Find each image in the body markup.
  $imgs = $html->find(‘img’);

WOW that was easy. Next, we’ll loop over our image tags, parse the src attribute using parse_url(), and make a couple checks to see if we should proceed to download the file.

  foreach ($imgs as $img) {
    $url_parts = parse_url($img->src);
    // No host means the path is relative, i.e. the image is
    // hosted locally.
    $no_host = !isset($url_parts['host']);
    // Somehow raw image data ended up in some of this content, we
    // don’t want that.
    $not_data = !(isset($url_parts[‘scheme’]) && $url_parts[‘scheme’] == ‘data’);

If the src meets all our criteria, now we have to parse the 'path' element of our parsed URL to figure out the filename of the image, the path we’re going to save it to, and the existing URL where we can find the image.

    if ($no_host && $not_data) {
      $path = $url_parts['path'];
      $path_parts = explode('/', $path);
      // The leading slash results in an extra empty path element
      // at the front of the array, lose it.
      array_shift($path_parts);
      // Decode encoded characters in the filename so the filename
      // doesn't get double-encoded on save.
      $filename = urldecode(array_pop($path_parts));
      // Join the path back together and lop off the
      // sites/default/files bit,leaving the directory under the
      // public schema.
      $filepath = str_replace('sites/default/files/', '', implode('/', $path_parts));
      // The image we want is at the original full path under the
      // domain we’re migrating from.
      $img_url = 'http://myoldsite.org' . $path;

Now, we get Drupal-y. file_build_uri() creates a stream wrapper (e.g 'public://images') for our target directory. file_prepare_directory() checks that the target directory exists and is writeable, and with the FILE_CREATE_DIRECTORY flag, creates it if not. Finally, system_retrieve_file() downloads the file. Note the third argument is TRUE, which indicates that the file should be managed, which is the point of this whole exercise, remember?

      // Build a stream wrapper for the destination directory.
      $uri = file_build_uri($filepath);
      // Ensure destination directory exists and is writeable.
      if (file_prepare_directory($uri, FILE_CREATE_DIRECTORY)) {
        $destination = $uri . '/' . $filename;
        // Retrieve image.
        system_retrieve_file($img_url, $destination, TRUE, FILE_EXISTS_REPLACE);
      }  // end if
    }  // end if
  }  // end foreach
}  // end my_module_feeds_after_save

Check out the full implementation below.

Have you faced this problem and come up with a different solution? Can this code be improved? Let us know in the comments!

/**
 * Implements hook_feeds_after_save().
 */
function my_module_feeds_after_save(FeedsSource $source, $entity, $item, $entity_id) {
  // Create new simplehtmldom object.
  $html = new simple_html_dom();
  // Parse body field markup.
  $html->load($item['body']);
  // Find each image in the body markup.
  $imgs = $html->find(‘img’);
  foreach ($imgs as $img) {
    $url_parts = parse_url($img->src);
    // No host means the path is relative, i.e. the image is
    // hosted locally.
    $no_host = !isset($url_parts['host']);
    // Somehow raw image data ended up in some of this content, we
    // don’t want that.
    $not_data = !(isset($url_parts[‘scheme’]) && $url_parts[‘scheme’] == ‘data’);
    if ($no_host && $not_data) {
      $path = $url_parts['path'];
      $path_parts = explode('/', $path);
      // The leading slash results in an extra empty path element
      // at the front of the array, lose it.
      array_shift($path_parts);
      // Decode encoded characters in the filename so the filename
      // doesn't get double-encoded on save.
      $filename = urldecode(array_pop($path_parts));
      // Join the path back together and lop off the
      // sites/default/files bit,leaving the directory under the
      // public schema.
      $filepath = str_replace('sites/default/files/', '', implode('/', $path_parts));
      // The image we want is at the original full path under the
      // domain we’re migrating from.
      $img_url = 'http://myoldsite.org' . $path;
      // Build a stream wrapper for the destination directory.
      $uri = file_build_uri($filepath);
      // Ensure destination directory exists and is writeable.
      if (file_prepare_directory($uri, FILE_CREATE_DIRECTORY)) {
        $destination = $uri . '/' . $filename;
        // Retrieve image.
        system_retrieve_file($img_url, $destination, TRUE, FILE_EXISTS_REPLACE);
      }  // end if
    }  // end if
  }  // end foreach
}  // end my_module_feeds_after_save