Finding and Deleting Specific Number of Characters from Transcript

33 views
Skip to first unread message

Cody Nitcher

unread,
Sep 7, 2023, 5:17:26 PM9/7/23
to TextSoap
Hello,

I'm completely new to TextSoap and regex specifically, but I believe TextSoap can really speed up my workflow in processing transcripts from a podcast I edit. The transcripts come to me in this format:

00;01;07;14 - 00;01;18;07
Evan
Hi. 75 episodes. What a feat. And I feel so honored to have been chosen to be a part of your 75th episode. Super exciting. Thanks for having me.

00;01;18;09 - 00;01;43;01
Abbey
Sure. Thank you for being here. It has been very exciting. And like I said, we've been shocked and pleased with ourselves that we've been able to stick with it for 75 episodes. We had a big celebration on our 50th, so we'll have another big celebration on our 100th episode and we'll even invite you to join it. So, Evan, first, let's start off today for our listeners.

00;01;43;04 - 00;01;50;28
Abbey
Tell us a little bit about you and how you came to be where you are today and what you're doing today.

00;01;51;04 - 00;02;15;18
Evan
Sure. Happy to do that. You know, it's an interesting story and we could we could have a podcast that lasts all weekend. You know, I was diagnosed very early with ADHD back when it was the inattentive type was A.D.D., Right. Somehow I got a little hyperactive toward high school. So years later, somehow I got a little hyperactive and so it really was ADHD.

//

Is there a way to utilize TextSoap/regex to search for and remove the blank line plus the timecode stamps? I've tried looking into regex syntax but it's pretty overwhelming. I appreciate any help anyone has to offer!

Mark Munz

unread,
Sep 8, 2023, 1:05:50 PM9/8/23
to text...@googlegroups.com
You'll need to create a custom cleaner.
To remove the timestamp lines, use something like this:

image.png

This regular expression will:
^ - anchor to the start of the line (since Match Lines option is set)
\d+ - match one or more digits (0-9)
[\d;\x{20}-]+ - match one or more digits (\d), semicolons (;), spaces (\x{20}), and dashes (-). You could write the space as a " ", but \x{20} is more cleaner if you come back to it. The [ ] allow you to specify characters and character classes. See Help > Regex Reference in TextSoap app for additional information.
\n - matches the EOL (TextSoap 9 uses \n to represent all end of line (EOL) characters internally, no matter what the line-ending style was originally)

The replace is blank, which will delete it.
So this will delete all lines with the timestamps (in the format you specified)

If you want to also delete any remaining blank lines, you can use an action like this:
image.png

This regular expression will
^ - anchor to the start of the line (since Match Lines option is set)
\n - matches the EOL

Again, the replace is blank, which will delete the match.

Hope that works for you.

--
You received this message because you are subscribed to the Google Groups "TextSoap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to textsoap+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/textsoap/bfd30125-9d61-40ce-b761-c786bfbbb391n%40googlegroups.com.


--
Mark Munz
unmarked software
https://textsoap.com/

Reply all
Reply to author
Forward
0 new messages