Prompt engineering ChatGPT into formatting raw OCR poems

133 views
Skip to first unread message

Weijia Cheng

unread,
Apr 5, 2023, 9:43:29 PM4/5/23
to Standard Ebooks
I'm quite far from being a ChatGPT enthusiast, but while working on the Rihani poetry collection I decided to experiment with ChatGPT to see if it could format raw OCR poems for me.

About a quarter of the poems I'm working on are untranscribed so I have to dig up the raw OCR from Internet Archive and add HTML tags. Now it's not that hard for me to manually wrap stanzas in <p> tags and then wrap the verse in <span>s (I have a bash script with a bunch of replace commands that helps speed up the verse wrapping). The most annoying thing I have to do is fix the stanzas, since the OCR will insert lots of extra line breaks into the raw text. I was curious to see how much of this ChatGPT could do for me with the right instructions (after all, this is kind of text processing it's made for) and came up with this sort of prompt.

If you look at the raw OCR text inside the prompt, you can see that it's a mess. There are supposed to be four verses in each stanza, but there's no consistency with the spacing at all. For comparison, this is the original scan.

The output is accurate. It gives me the correct grouping of verses into stanzas and inserts the <p> and <span> tags I specified, following the example.

It also works when the number of verse in each stanza is not the same. I could prompt it slightly differently by telling it how many verses are in each stanza.

Again, the output is accurate to what I specified.

It doesn't do great with long inputs (I think more than a page or two of poetry at a time causes it to lose sight of the goal) or more complicated instructions (I played with getting it to insert indentation classes, but that tended to cause formatting errors). But it does seem to make this first pass of transcription go much faster. I've checked for errors by using a diff tool and I haven't seen any cases so far of ChatGPT rewriting the text. Since this is raw OCR and I have to proofread closely anyways, even that isn't that big of a deal.

While in general, since our project is all about human editorial judgement, AI tools like ChatGPT aren't super relevant to what we do (yet, at least), I think they might potentially be able to help more experienced producers speedrun transcriptions in the future and not have to rely on existing transcriptions for the first step of the pipeline. With GPT-4 combining image with text capabilities, I imagine you could just feed GPT-4 some page scans and ask it to output the text in a basic HTML form with <p> and <i> tags and such.

Anyways, I hope you found my little experiment with AI-assisted ebook development interesting :)

David at Standard Ebooks

unread,
Apr 5, 2023, 11:03:21 PM4/5/23
to Standard Ebooks
That's very interesting and promising, Weiji. Not that it's directly relevant, but I thought you might be interested in my experience with "A.I." in a different domain.

I've been experimenting with Whisper, the OpenAI audio transcription program, mostly on episodes of my podcast (https://two chairs.website).

The original version of Whisper is remarkable, but it does take a while to work, taking several hours at least to transcribe one episode of my podcast, on average taking about three times the length of the source audio to complete. The original Whisper was all written in Python, something I find quite remarkable, as a Python coder myself. But Python is an interpreted language and so a bit slow.

Well, a German developer recently ported all the Whisper code into the C++ language, compiled it and optimised it to run on Apple Silicon chips (like my iMac has). In that form, it is able to transcribe a ninety-minute podcast in two-and-a-half minutes!!

The results are not perfect, of course, but they are pretty good, good enough so that after some light editing they can be published.

What I found particularly interesting, though, was that in editing the transcription of a recent discussion, listening along with my headphones as I carefully read through the transcribed text, I discovered that Whisper doesn’t generate a straight word-for-word transcription of the audio. Instead, like ChatGPT, it sometimes makes stuff up. Just the odd word or two inserted which is not there in the original, generally making sense of what is being said. It’s not directly transcribing sounds: instead, guided by the audio, it’s putting in what you would be likely to say at a particular moment. Mostly that is an accurate transcription of what you actually did say, but not always.

All of these LLM tools work by predicting what the next word or phrase is likely to be on average, and lo and behold the same principles apply when it is doing audio transcription.

It will also leave out words even when the audio of them is perfectly clear. Again, this is usually useful. If a speaker says the same word or phrase more than once (a bad habit I suffer from), it will only include one of those instances in the transcript. Kind of freaky, though, and it does mean you need to carefully check what it’s done.

With those reservations in mind, the speed with which I can now generate a transcription has led to a change in how I edit the audio of the podcast. I can now use it to generate timecode markers in my audio editing program (think closed captions), which makes it a breeze to scan through and spot things which need to be fixed.

None of this is directly relevant to SE, of course, but perhaps supports the feeling Weijia expresses that these tools, used with discretion, could make life much easier for our producers.

What I'm hoping for is that these LLM tools can be used to greatly improve OCR, but given what I say above, it would be hard to be confident how accurate the resulting transcription would be. But that's what proofreading is for, I guess.
--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/e7f3ff20-c870-4d00-aa78-17a3f5ac5ce0n%40googlegroups.com.

Brian

unread,
Apr 6, 2023, 5:03:56 AM4/6/23
to standar...@googlegroups.com
None of this is directly relevant to SE, of course, but perhaps supports the feeling Weijia expresses that these tools, used with discretion, could make life much easier for our producers.

Wow. I feel like what you just described could be very relevant! It
seems like creating an initial transcription base by reading aloud
could be a welcome alternate approach. (It probably wouldn't be faster,
but like, my hands have a limited amount of transcription work they
can do in one sitting.)

What I'm hoping for is that these LLM tools can be used to greatly improve OCR, but given what I say above, it would be hard to be confident how accurate the resulting transcription would be. But that's what proofreading is for, I guess.

I also think it's good for SE that the project is
volunteer-driven. If there were any money involved, I would worry
about people coming in and trying to use "AI" tools to do half-assed
work. I imagine nearly everyone here has noticed the slow rise of
obvious editorial errors in professional publications over the last
few decades, as human editors get replaced by grammar-checking
programs? It seems likely that this trend is going to accelerate in
the future in for-profit organizations, though the errors may
become less superficially obvious. But the fact that there's nothing
(pecuniary) to be earned in volunteer organizations like this one will
hopefully protect it from people trying to misuse such tools.

Alex Cabal

unread,
Apr 6, 2023, 3:44:15 PM4/6/23
to standar...@googlegroups.com
That's very impressive. I've been continually impressed with what
ChatGPT can do.

Maybe there's room to integrate it into our workflow somewhere, to
assist producers in working on scaffolding like this. On the other hand,
it would be a problem if GPT's output changed correct prose into
incorrect prose as part of its processing.

If anything, we just have to wait and see how GPT continues to develop.

On 4/5/23 8:43 PM, Weijia Cheng wrote:
> I'm quite far from being a ChatGPT enthusiast, but while working on the
> Rihani poetry collection I decided to experiment with ChatGPT to see if
> it could format raw OCR poems for me.
>
> About a quarter of the poems I'm working on are untranscribed so I have
> to dig up the raw OCR from Internet Archive and add HTML tags. Now it's
> not that hard for me to manually wrap stanzas in <p> tags and then wrap
> the verse in <span>s (I have a bash script with a bunch of replace
> commands that helps speed up the verse wrapping). The most annoying
> thing I have to do is fix the stanzas, since the OCR will insert lots of
> extra line breaks into the raw text. I was curious to see how much of
> this ChatGPT could do for me with the right instructions (after all,
> this is kind of text processing it's made for) and came up with this
> sort of prompt <https://pastebin.com/fcwAS8XY>.
>
> If you look at the raw OCR text inside the prompt, you can see that it's
> a mess. There are supposed to be four verses in each stanza, but there's
> no consistency with the spacing at all. For comparison, this is the
> original scan
> <https://archive.org/details/chantofmysticsot00riha/page/50/mode/2up>.
>
> The output <https://pastebin.com/4bD0FCz9> is accurate. It gives me the
> correct grouping of verses into stanzas and inserts the <p> and <span>
> tags I specified, following the example.
>
> It also works when the number of verse in each stanza is not the same. I
> could prompt it slightly differently <https://pastebin.com/ESsEaYKZ> by
> telling it how many verses are in each stanza.
>
> Again, the output <https://pastebin.com/BapxMQDk> is accurate to what I
> specified.
>
> It doesn't do great with long inputs (I think more than a page or two of
> poetry at a time causes it to lose sight of the goal) or more
> complicated instructions (I played with getting it to insert indentation
> classes, but that tended to cause formatting errors). But it does seem
> to make this first pass of transcription go much faster. I've checked
> for errors by using a diff tool and I haven't seen any cases so far of
> ChatGPT rewriting the text. Since this is raw OCR and I have to
> proofread closely anyways, even that isn't that big of a deal.
>
> While in general, since our project is all about human editorial
> judgement, AI tools like ChatGPT aren't super relevant to what we do
> (yet, at least), I think they might potentially be able to help more
> experienced producers speedrun transcriptions in the future and not have
> to rely on existing transcriptions for the first step of the pipeline.
> With GPT-4 combining image with text capabilities, I imagine you could
> just feed GPT-4 some page scans and ask it to output the text in a basic
> HTML form with <p> and <i> tags and such.
>
> Anyways, I hope you found my little experiment with AI-assisted ebook
> development interesting :)
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/e7f3ff20-c870-4d00-aa78-17a3f5ac5ce0n%40googlegroups.com <https://groups.google.com/d/msgid/standardebooks/e7f3ff20-c870-4d00-aa78-17a3f5ac5ce0n%40googlegroups.com?utm_medium=email&utm_source=footer>.

king....@gmail.com

unread,
Apr 7, 2023, 6:33:09 PM4/7/23
to Standard Ebooks
I've used it a little to help identify OCR errors where many existed in a text, to speed up that process. It seemed like a good fit: list letter groups that are very similar to real words but are not actually words. It found many, left a good number un-found, but kept trying to "improve" the author's word choice.
Also, where many foreign words were sprinkled through the text, it was fairly good at finding them, though it produced many false positives; still it was somewhat useful.
Paul
Reply all
Reply to author
Forward
0 new messages