extracting text between markers from a text file

30 views
Skip to first unread message

Beate

unread,
Sep 12, 2010, 12:11:14 PM9/12/10
to TextSoap
I have a long text file containing several parts of text of interest
embedded between markers:

---here_it_comes--->text of interest<---this_was_it---

The challange is to export as individual text files the text between
the markers. I hope it is possible from within TextSoap to do that.
Are there any hints how to achieve this? Many thanks in advance!

Mark Munz

unread,
Sep 13, 2010, 2:31:44 AM9/13/10
to text...@googlegroups.com
There is no completely generic answer to this. How easy it is to
extract text sort of depends on what the markers are, but there should
be a couple options that should make it fairly straightforward.

For example, if you're trying to extract the textual portion of raw
html, there's a cleaner built in to do that.

Otherwise, you'll need to set up a custom cleaner with an extract text
action. You'll need to a regular expression to match the "markers"
(both beginning & ending) and the text. Depending on the complexity of
the markers, it can be a simple expression or something much more
involved. For example: <b> </b> are fairly easy to find.

Can you offer some additional context to indicate the type of markers
you're dealing with?

Mark

> --
> You received this message because you are subscribed to the Google Groups "TextSoap" group.
> To post to this group, send email to text...@googlegroups.com.
> To unsubscribe from this group, send email to textsoap+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/textsoap?hl=en.
>
>

--
Mark Munz
unmarked software
http://www.unmarked.com/

Beate

unread,
Sep 13, 2010, 2:47:47 AM9/13/10
to TextSoap
the text snippets to be extracted are placed within plain text files
(1 snippet/file or several/file).
the markers are:

---begin--->
here is the text snippet to be extracted
<---end---

I can change the markers easily to something else if necessary. The
aim is to have plain text files with only the extracted text snippet
(1/file)

I am still not sure how to set this up in TextSoap, so any more
information would be very helpful!
> > For more options, visit this group athttp://groups.google.com/group/textsoap?hl=en.

Mark Munz

unread,
Sep 13, 2010, 2:58:42 PM9/13/10
to text...@googlegroups.com
Assuming the two markers you indicated, you want to create a 3-action
custom cleaner:

First action: Extract Text

Expression: (?s)---begin--->.*<---end---

Second action: Find/Repace

Find: ---begin--->
Replace:

Third action: Find/Replace

Find: <---end---
Replace:

The first action will extract the text, including the markers. The
remaining two steps simply find the beginning and end markers and
delete them (by replacing them with nothing).

The (?s) at the beginning of the regular expression means that .* will
match newlines (returns). Otherwise, you won't get the desired
results.

Mark

> For more options, visit this group at http://groups.google.com/group/textsoap?hl=en.

Beate

unread,
Sep 13, 2010, 4:12:34 PM9/13/10
to TextSoap
ok, thank you, when I open a file, select all text and run this
cleaner, I get the text snippet of interest.

Now I just wonder how I could apply this to many text files in a
folder without opening each one, selecting, cleaning, saving? is this
possible?

Mark Munz

unread,
Sep 13, 2010, 5:41:28 PM9/13/10
to text...@googlegroups.com
You can accomplish that with the help of Automator:

Create a workflow or application (droplet) with a "Clean Text Files"
action. This will allow you to specify the type of files that can be
processed as well as the TextSoap cleaner to use to process the files.
There are also options to preserve the original copies for safe
cleaning.

With a droplet, you can just drag-n-drop the files you want to process
onto the droplet icon and they'll be cleaned for you.

Mark

> For more options, visit this group at http://groups.google.com/group/textsoap?hl=en.

Beate

unread,
Sep 14, 2010, 3:40:39 AM9/14/10
to TextSoap
thank you, very helpful!

Beate

unread,
Sep 14, 2010, 11:29:00 AM9/14/10
to TextSoap
just succeeded in getting this to work (i.e. cleaning my synced
Simplenotes files for import into writing application), you helped a
great deal, thanks a lot!
Reply all
Reply to author
Forward
0 new messages