Cleaning up text from Web pages

19 views
Skip to first unread message

bobembry

unread,
Oct 17, 2009, 10:33:26 AM10/17/09
to TextSoap
I have a custom cleaner I use for cleaning up articles I copy from
Safari to DEVONthink Pro. They do a fantastic job of preserving the
style info and images.

On occasions I want to preserve the style info but delete the images.
Is there a cleaner that will do this?

Thanks

Bob

Mark Munz

unread,
Oct 17, 2009, 2:07:24 PM10/17/09
to text...@googlegroups.com
I assume these are embedded images. Currently, there is no cleaner to
remove attachments (which are how images are stored) from rich text.
I'll add the feature request.
--
Mark Munz
unmarked software
http://www.unmarked.com/

bobembry

unread,
Oct 18, 2009, 8:00:08 AM10/18/09
to TextSoap
Mark, yes I'm talking about images embedded in the selected text.
Replacing each image with a "new line" (aka return) would work nicely.

Thanks

Bob

On Oct 17, 1:07 pm, Mark Munz <unmar...@gmail.com> wrote:
> I assume these are embedded images. Currently, there is no cleaner to
> remove attachments (which are how images are stored) from rich text.
> I'll add the feature request.
>

Tom White

unread,
Oct 18, 2009, 12:30:32 PM10/18/09
to text...@googlegroups.com
I have the reverse problem. When I forward an email with pictures and
try to remove forwarding (>) characters, it also takes out any
pictures embedded in the text. Try the "Remove forwarding (>)
characters" cleaner and see if it doesn't take out the pictures.

Tom

Mark Munz

unread,
Oct 18, 2009, 12:53:51 PM10/18/09
to text...@googlegroups.com
There is nothing special about Remove forwarding characters vs. other cleaners.

There was an issue in an earlier release where cleaning did not
maintain attachments, particularly when using Services or the
contextual menu in Cocoa apps. Make sure you have the latest version
(6.3.1). I believe the issue I'm referring to was fixed in 6.2.2.

Mark

Tom White

unread,
Oct 18, 2009, 1:19:11 PM10/18/09
to text...@googlegroups.com
Mark,

I am using TextSoap version 6.3.1 and Apple Mail. If I forward an
email with an embedded picture in the text and then try to take off
the forwarding characters with the "Remove forwarding characters"
cleaner, the pictures disappear from the text. I am inserting a
picture of a bluff on the Buffalo River in Arkansas below.

P5100031.jpeg

Mark Munz

unread,
Oct 18, 2009, 2:23:51 PM10/18/09
to text...@googlegroups.com
Ah, Apple Mail.

I think the issue is actually Apple Mail. Apple Mail stores its emails
in an HTML archive. When copying the text from Apple Mail, I believe
it includes HTML, RTF, Plain text versions of the format. TextSoap
doesn't current read HTML archives, so it goes to the next "richest"
format available which is RTF (not RTFD, which includes attachments).

Dealing with HTML archives is something I've been trying to figure out
how to tackle. It's a much tougher issue as the standard text
processing doesn't work. As HTML, example, "forwarded text" doesn't
include any forward marks. Instead, the text is simply placed in a
<blockquote> tag (that you can't see) and the formatting is such that
is has forwarding vertical bar.

Likewise, instead of hard returns, the text may actually have <br/>
tags in it. When you look at the text, it never shows up because it's
part of the HTML structure, not the underlying text. And what appears
like a single paragraph may in fact be multiple chunks of text put
together.

It might be possible to hardcode support for forwarding, because I
know exactly what to look for, but the other stuff gets much more
difficult.

I believe that's where the attachment loss is coming from for you,
although I will retest this under Snow Leopard to make sure nothing
has changed from Apple's end of things.

Mark
> Forward this email and before you send do a Command A to select the
> whole body and then try TextSoap cleaner "Remove forwarding
> characters. When I do it, the picture goes away.
>
> Tom
>> --~--~---------~--~----~------------~-------~--~----~
>> You received this message because you are subscribed to the Google
>> Groups "TextSoap" group.
>> To post to this group, send email to text...@googlegroups.com
>> To unsubscribe from this group, send email to textsoap+u...@googlegroups.com
>> For more options, visit this group at http://groups.google.com/group/textsoap?hl=en
>> -~----------~----~----~----~------~----~------~--~---
Reply all
Reply to author
Forward
0 new messages