Need help with BBEdit script to delete large blocks of text

170 views
Skip to first unread message

shir...@earthlink.net

unread,
Jan 4, 2011, 11:14:37 AM1/4/11
to BBEdit Talk
Hello:

Just joined the group. I hope someone here can help.
The problem: when I archive my e-mail Inbox (Apple Mail), images and
graphics are saved as enormously long, unintelligible strings of
alphanumeric characters. I want to keep the archived "mbox" text files
but remove these big blocks of text, to reduce the files' sizes.
Can BBEdit, on its own or using Unix command-line operations, do
this?
I suppose the solution is something like this: many of the text
strings begin with the same set of three or four characters. I want
the BBEdit script or command to search for these sets, then delete
them and everything following until it reaches the string "--Apple-
Mail."
I'll insanely appreciate any help.

Cheers,

shirasagi

LuKreme

unread,
Jan 4, 2011, 2:06:04 PM1/4/11
to bbe...@googlegroups.com
On 4-Jan-2011, at 09:14, shir...@earthlink.net wrote:
>
> Can BBEdit, on its own or using Unix command-line operations, do
> this?

Yes, but there are better tools, specifically MIME-aware tools. Or even perl.

A grep for lines that are 76 characters long and contain no spaces or punctuation would match the actual encoded attachments, but grabbing the MIME boundaries is trickier.

^[+a-z\/0-9]{76}$

This will match all the lines in the attachment except the last one.

> I suppose the solution is something like this: many of the text
> strings begin with the same set of three or four characters. I want
> the BBEdit script or command to search for these sets, then delete
> them and everything following until it reaches the string "--Apple-
> Mail."

That way madness lies. Either get a MIME aware tool that will strip the MIME attachments from the mbox file, or simply strip the encoded lines and don't worry about the boundary lines (or at least deal with those in a second step).

your attachment will look something like this:

--Apple-Mail-2-649664677
Content-Disposition: inline;
filename*=iso-8859-1''GMT%%A0Receipt.pdf
Content-Type: application/pdf;
name="=?iso-8859-1?Q?GMT=A0Receipt.pdf?="
Content-Transfer-Encoding: base64

JVZERi0xLjMKJcTl8uXrp/Og0MTGCjQgMCBvYmoKPDwgL0xlbmd0aCA1IDAgUiAvRmlsdGVyIC9G
bGF0ZURlY29kZSA+PgpddHJlYW0KeAHNW1lz3LgRfsevQLn8QLkyIx7g5afIWtlxsutL46Qq2TzI
knVsNBrZI23if5+vG2gQIClrOONUxXYNSRBEd3994vAX/V5/0Vmj23peFZXRZV7Ym6qu5gWa9NfP
+m/6Ru8frjN9utYZ/12ffvcjhY/Ox9uDhj3X+68w6MVap/MqLdo8q8buFBFr56Y0ZalNwXeVXuo8

BUT, that --Apple-Mail line will appear multiple times in the email (In one message with a single attachment, "--Apple-Mail-" appeared 8 times), so you cannot just willy-nilly delete everything up to one of those line.


--
Elves are wonderful. They provoke wonder. Elves are marvellous. They
cause marvels. Elves are fantastic. They create fantasies. Elves are
glamorous. They project glamour. Elves are enchanting. They weave
enchantment. Elves are terrific. They beget terror.

shirasagi

unread,
Jan 4, 2011, 3:07:05 PM1/4/11
to BBEdit Talk
LuKreme:

Thanks for your help.

> Yes, but there are better tools, specifically MIME-aware tools. Or even perl.

Can you recommend a "MIME-aware tool"? I did a Google search and
quickly found PINE, but I'd like to know if there are others better
suited to my task. I'll look into PERL.

> A grep for lines that are 76 characters long and contain no spaces or punctuation would match the actual encoded attachments, but grabbing the MIME boundaries is trickier.
>
> ^[+a-z\/0-9]{76}$
>
> This will match all the lines in the attachment except the last one.

Thanks for the sample script. My best guess was something similar, but
I would never have thought of including a limit to the number of
characters in a line.

>
> your attachment will look something like this:
>
[...]
>
> BUT, that --Apple-Mail line will appear multiple times in the email (In one message with a single attachment, "--Apple-Mail-" appeared 8 times), so you cannot just willy-nilly delete everything up to one of those line.

Interesting. Is it thus too difficult—or impossible—in BBEdit to write
a script that will find lines, e.g., 76 characters long, with no
spaces or punctuation, delete them until comes to a line that begins
"--Apple Mail," move on to the next matching lines, and repeat the
process until the end of file?
Thanks again for your help.

Cheers,

shirasagi

Matt Martini

unread,
Jan 4, 2011, 5:19:05 PM1/4/11
to bbe...@googlegroups.com, shir...@earthlink.net
Shirasagi,

I agree with LuKreme that you probably don't want to start mucking with
the mbox files without a MIME aware tool lest you run the risk of
corrupting the mobx. I would not even attempt this w/o perl and
MIME::Tools (or equiv.)

It might be simpler for you to change the Prefs in Apple Mail so that
the "Keep copies of messages for offline viewing" option (under
Accounts->Advanced) was set to "All messages, but omit attachments"

Good Luck
Matt

David Alexander

unread,
Jan 4, 2011, 8:03:30 PM1/4/11
to bbe...@googlegroups.com
On Tue, 4 Jan 2011 08:14:37 -0800 (PST), "shir...@earthlink.net"
<shir...@earthlink.net> wrote:

>Hello:
>
>Just joined the group. I hope someone here can help.
>The problem: when I archive my e-mail Inbox (Apple Mail), images and
>graphics are saved as enormously long, unintelligible strings of
>alphanumeric characters. I want to keep the archived "mbox" text files
>but remove these big blocks of text, to reduce the files' sizes.

It might be a lot easier to remove the attachments before archiving.
In Mail create a new smart folder with the rule "Contains
Attachments". You could restrict it to searching the folders you
intend to archive if needed. Then go into that smart folder, select
all the emails, go to the "Messages" menu and select "Remove
Attachments". Done.

Then archive them.

LuKreme

unread,
Jan 4, 2011, 8:24:50 PM1/4/11
to bbe...@googlegroups.com
On 4-Jan-2011, at 13:07, shirasagi wrote:
>
> Interesting. Is it thus too difficult—or impossible—in BBEdit to write
> a script that will find lines, e.g., 76 characters long, with no
> spaces or punctuation, delete them until comes to a line that begins
> "--Apple Mail," move on to the next matching lines, and repeat the
> process until the end of file?


The trouble is, those MIME lines are BOUNDARIES, so they exist at the beginning and end of each MIME part. Also, the message is marked as multipart. If you simply delete the content of the mime part and the ending boundary, you will effectively destroy the message from being properly read by most programs.

I don't have specific recommendations as the tools to do this sort of manipulation on messages are 1) command-line tools or libraries 2) tricksy 3) dangerous.

I would start over with thinking about exactly what the problem is you're trying to solve (personally, I don't want to keep emails without keeping all their contents, but that's not to say that others might feel differently).

Just as an example, if your actual need is that you want a mbox of just plain text emails without HTML, attachments, or any 'extraneous' data, then I would pipe the mbox through formail -s procmail and call a simple procmail recipe that called the command-line web browser links (or lynx) with a -dump option. I use to do this automatically for HTML email back 15 years ago or so.

$ links -dump www.google.com
_________________________________________
_________________________________________
_________________________________________
_________________________________________
_________________________________________
_________________________________________
_________________________________________
Web Images Videos Maps News Shopping Gmail more >>
iGoogle | Settings | Sign in
Google

__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] SearchLanguage
Tools

Advertising ProgramsBusiness SolutionsAbout Google

(c) 2011 - Privacy

There is MIMEdefang, which is a tool designed to work with sendmail (or sendmail replacements that support milters, like postfix) and also demime which may or may not help. But as I said, these are low level tools designed to be used by people who really REALLY know what they are doing and I don't recommend them. And they are beyond the scope of this list.

SHort answer: other than writing a perl script that you execute from within BBEdit anything other than simply deleting the data lines is likely to screw up the mbox file. Deleting the data lines should not alter the messages other than to remove all encoded content. Be aware that some emails will ONLY be encoded content, however. It is possible you will lose the entire body of the message doing this, depending on how the messages were encoded.

--
Be careful what you wish for. You never know who will be listening. Or
what, for that matter.

Marc Reavis

unread,
Jan 5, 2011, 5:18:04 PM1/5/11
to bbe...@googlegroups.com
LuKreme:

Thanks for your reply. There's much useful information there.
I'll look into writing a PERL script from within BBEdit, as you suggest. I'm interested in removing only the gobbledygook text into which Mail renders attached or embedded graphics, not the (legible) content of the e-mail message itself.
I've already looked over formail's man page, and will give that option a try.
As for MIMEdefang and other MIME-aware tools, they look pretty formidable and beyond my needs (not to mention my comprehension).


Regards,

shirasagi

On Jan 4, 2011, at 5:24 PM, LuKreme wrote:
> The trouble is, those MIME lines are BOUNDARIES, so they exist at the beginning and end of each MIME part. Also, the message is marked as multipart. If you simply delete the content of the mime part and the ending boundary, you will effectively destroy the message from being properly read by most programs.
>
> I don't have specific recommendations as the tools to do this sort of manipulation on messages are 1) command-line tools or libraries 2) tricksy 3) dangerous.
>
> I would start over with thinking about exactly what the problem is you're trying to solve (personally, I don't want to keep emails without keeping all their contents, but that's not to say that others might feel differently).
>
> Just as an example, if your actual need is that you want a mbox of just plain text emails without HTML, attachments, or any 'extraneous' data, then I would pipe the mbox through formail -s procmail and call a simple procmail recipe that called the command-line web browser links (or lynx) with a -dump option. I use to do this automatically for HTML email back 15 years ago or so.
>

lumin...@gmail.com

unread,
Jan 8, 2011, 12:04:53 PM1/8/11
to bbe...@googlegroups.com
Hi Marc

On 1/5/11 at 2:18 PM, shir...@earthlink.net (Marc Reavis) wrote:

>I'm interested in removing only the gobbledygook text into
>which Mail renders attached or embedded graphics

I was browsing an instructional video site and saw the following
snippet. Perhaps it will be helpful for you.

>Enhanced AppleScript for Extracting Email Attachments

http://www.screencastsonline.com/index_files/SCO0250-macmontage15.php


-Said

Reply all
Reply to author
Forward
0 new messages