How do you create a custom image pipeline that can manipulate the images with PIL?

1,105 views
Skip to first unread message

Andreas Bloch

unread,
Nov 7, 2013, 4:53:32 AM11/7/13
to scrapy...@googlegroups.com
I'm looking to create a image pipeline that can: 

1) Download the original image ( from item['image_urls] )
2) Generate resized thumbs by width only ( I do not want square thumbs )
3) Be able to control the quality of the image with PIL

How would you do this?

Rolando Espinoza La Fuente

unread,
Nov 7, 2013, 2:48:57 PM11/7/13
to scrapy...@googlegroups.com


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Andreas Bloch

unread,
Nov 7, 2013, 3:06:31 PM11/7/13
to scrapy...@googlegroups.com
Yes, I have seen it. 

I'm able to download images and resize them (but only to square thumbs).

This is what I've done so far:

in settings.py:

ITEM_PIPELINES = [
    'project.pipelines.MyImagesPipeline',
]

IMAGES_STORE = '/images'

IMAGES_THUMBS = {
    'thumb': (200, 200),
    'mobile': (320, 320),
}


in pipelines.py:

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

In the pipeline, I want to be resize the images to a specified width (not length) and control the image quality with PIL.

How can I do this?

Rolando Espinoza La Fuente

unread,
Nov 7, 2013, 11:54:22 PM11/7/13
to scrapy...@googlegroups.com
Which versions of scrapy and pil are you using?

Because scrapy uses the thumbnail method which keeps the aspect ratio. If you are sure you are getting square thumbnails for non-square images, can you provide an image sample to pin down the problem?

Regarding image quality, scrapy doesn't provide a setting to change the resampling parameter (i.e. bicubic, bilinear). So you should override and replicate the convert_image method.


Regards
Rolando

--

Andreas Bloch

unread,
Nov 8, 2013, 8:09:54 AM11/8/13
to scrapy...@googlegroups.com
Thanks for pointing me in the right direction. I overwrote the convert_image class from the imagesPipeline

Here's the code if I ended up with. 

in settings.py I specify IMAGES_THUMBS for all thumb generation. When I want to scale by width only, I leave the height param out and handle it in pipelines.py

settings.py

ITEM_PIPELINES = [
    'project.pipelines.MyImagesPipeline',
]

IMAGES_STORE = '/images'

IMAGES_THUMBS = {
    'thumb': (200,),
    'mobile': (320,),
    'icon': (50,50)
}

pipelines.py:

from scrapy.contrib.pipeline.images import ImagesPipeline
from cStringIO import StringIO
import PIL
from PIL import Image

class MyImagesPipeline(ImagesPipeline):

    def convert_image(self, image, size=None):
        if image.format == 'PNG' and image.mode == 'RGBA':
            background = Image.new('RGBA', image.size, (255, 255, 255))
            background.paste(image, image)
            image = background.convert('RGB')
        elif image.mode != 'RGB':
            image = image.convert('RGB')

        if size:
            image = image.copy()
            # if height is specified
            try:
                size[1]
                image.thumbnail(size, Image.ANTIALIAS)
            except:
                basewidth = size[0] # the size from the settings.py
                wpercent = (basewidth/float(image.size[0]))
                hsize = int((float(image.size[1])*float(wpercent)))
                image.thumbnail((basewidth,hsize), Image.ANTIALIAS)

        buf = StringIO()
        image.save(buf, 'JPEG', quality=85)
        return image, buf

    #Name download version
    def image_key(self, url):
        image_guid = url.split('/')[-1]
        return 'full/%s.jpg' % (image_guid)

    #Name thumbnail version
    def thumb_key(self, url, thumb_id):
        image_guid = thumb_id + url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        # yield Request(item['images'])
Reply all
Reply to author
Forward
0 new messages