Can you parse data-uris and extract images from HTML fragments with nokogiri?

559 views
Skip to first unread message

William Flanagan

unread,
Mar 30, 2015, 8:51:08 AM3/30/15
to nokogi...@googlegroups.com
I'm working on parsing an HTML fragment that is received from a WYSIWYG editor (summernote). 

The unique situation is that the images come not as files, but rather as data URI encoded images in the HTML itself.  (i.e. 

<img src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABhUAAAa...)

To be compatible with most email clients, I have to convert these data URI images back into traditional images for insertion into the email message. 

My problem might NOT be Nokogiri-based. But, the crux of the problem is that I'm getting "Encoding::UndefinedConversionError: "\x89" from ASCII-8BIT to UTF-8" ( the character changes based on the image) when trying to parse out and convert the inline data-uri-based image back to a file. 

So, the net of my question.. is does Data URIs work with Nokogiri? I know it converts everything to UTF8 internally. I'm wondering if that's clobbering data URIs.

Anyone? 

takeharu

unread,
Mar 31, 2015, 1:05:46 PM3/31/15
to nokogi...@googlegroups.com
It may be a little different from the  crux of the problem.
But,how about something like this?:

require 'nokogiri'

raw_html = <<"EOS"
<body>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot">
</body>
EOS

doc = Nokogiri.HTML(raw_html)
doc.xpath('//img[@src]').each do |e|
  m=/data\:image\/(\w+)\;base64\,(.+)/.match(e['src'])
  next unless m
  File.open(e['alt']+'.'+m[1],'wb'){|f| f.write(m[2].unpack('m')[0])}
end

__END__

Mike Dalessio

unread,
Apr 28, 2015, 7:11:06 PM4/28/15
to nokogiri-talk
Hi,

Thanks for asking this question!


On Sat, Mar 28, 2015 at 2:43 PM, William Flanagan <wfla...@audienti.com> wrote:
I'm working on parsing an HTML fragment that is received from a WYSIWYG editor (summernote). 

The unique situation is that the images come not as files, but rather as data URI encoded images in the HTML itself.  (i.e. 

<img src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABhUAAAa...)

To be compatible with most email clients, I have to convert these data URI images back into traditional images for insertion into the email message. 

My problem might NOT be Nokogiri-based. But, the crux of the problem is that I'm getting "Encoding::UndefinedConversionError: "\x89" from ASCII-8BIT to UTF-8" ( the character changes based on the image) when trying to parse out and convert the inline data-uri-based image back to a file. 


I'm unable to reproduce what you're seeing. Can you provide a document and script to help me reproduce it?

Here's my environment:

```
$ nokogiri -v
# Nokogiri (1.6.6.2)
    ---
    warnings: []
    nokogiri: 1.6.6.2
    ruby:
      version: 2.2.2
      platform: x86_64-linux
      description: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-unknown-linux-gnu/libxml2/2.9.2"
      libxslt_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-unknown-linux-gnu/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      compiled: 2.9.2
      loaded: 2.9.2
```

Here's how I'm trying to reproduce it:

```ruby
#! /usr/bin/env ruby

require 'nokogiri'

raw_html = <<"EOS"
<body>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot">
</body>
EOS

doc = Nokogiri.HTML(raw_html)

puts doc.css('img').first["src"]
# => "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="

puts doc.encoding
# => "UTF-8"
```



So, the net of my question.. is does Data URIs work with Nokogiri? I know it converts everything to UTF8 internally. I'm wondering if that's clobbering data URIs.

Anyone? 

--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-tal...@googlegroups.com.
To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at http://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages