Escape Sequences

48 views
Skip to first unread message

Tony Proctor

unread,
Apr 6, 2013, 5:48:07 AM4/6/13
to root...@googlegroups.com
There are many schemes for 'escape sequences' that we have to deal with in software systems. For instance the '\' syntax in C-like languages, character entities and entity references in XML, UCF (Uncertain Character format) for transcribing uncertain characters, etc.
 
I'm particularly interested in the requirement of UCF for representing evidence in a data model. However, reserving a whole bunch of extra characters such as [{*?_ is very messy so I'm looking for some neater approach.
 
If data is being represented in XML then you could argue that the UCF sequences be enclosed in a specific element, e.g. <ucf>123?[78]</ucf>. That is workable for data although it limits the usage of such sequences in attribute values. From a purist POV, it can be argued that putting such sequences in an attribute value is wrong, but we're getting down to specifics there. The representation may not even be XML.
 
OK, so you might be thinking 'just reserve one character and use it as a quote-unquote'. For instance: &123?[78]&. Again, this is workable but it's a private approach, and you have to define how tor represent embedded quoting characters (e..g pairing them up or something else).
 
There is a strong parallel between UCF sequences and markup, although the data and markup are not clearly separated in this case.
 
I know a number of you (like me) are old enough to fondly remember ANSI escape sequences. The standard is rarely mentioned these days, and is generally thought of as a way of communicating with h/w terminals. However, you could argue that is was a type of standardised markup, and was used for representing presentational markup (such as bold, underline, colours) in documents. more than that, though, it accommodated private sequences and the APC (Application Program Command) that could - and occasionally were - used for semantic and procedural markup.
 
In summary, can anyone think of a problem using the syntax of ANSI escape sequences to embrace special character sequences such as UCF? Calling it a "standard way" may not be strictly accurate but it is using an existing standard as a precedent. Has anyone seen this done before?
 
    Tony Proctor

Ben Laurie

unread,
Apr 6, 2013, 7:07:41 AM4/6/13
to root...@googlegroups.com
The obvious issue with ANSI escape sequences is that they start with
ESC - a character that's difficult to persuade many tools to include
in text strings.

UCF already borrows from a standard, namely regular expressions. It
does have the slightly odd characteristic that regexes are normally
used in the search string, not in the target string. This is actually
a problem for search, and one thing I keep musing about is a library
for doing regex-to-regex matching.

Tony Proctor

unread,
Apr 6, 2013, 7:23:13 AM4/6/13
to root...@googlegroups.com
RegEx is in common use, although not strictly a standard AFAIK, but UCF is
only a small subset of it. Sites like FreeBMD also distance themselves from
the similarities with RegEx.

Regarding <ESC>, in what way is it difficult to get into a text string Ben?
I'm assuming you're thinking other than the difficulty with keyboard entry.
I don't see any need for manual entry of such characters if you're using an
appropriate editing tool. Similarly with output - you wouldn't expect to see
raw <ESC> displayed.

What did you mean by a regex-to-regex mapping library?

Tony Proctor
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "rootsdev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rootsdev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Ben Laurie

unread,
Apr 6, 2013, 7:30:02 AM4/6/13
to root...@googlegroups.com
On 6 April 2013 12:23, Tony Proctor <to...@proctor.net> wrote:
> RegEx is in common use, although not strictly a standard AFAIK, but UCF is
> only a small subset of it. Sites like FreeBMD also distance themselves from
> the similarities with RegEx.

They do?

> Regarding <ESC>, in what way is it difficult to get into a text string Ben?

Obviously I can get <ESC> into a string, here it is. But the actual
character ESC (0x1b) is not so easy. For example, I can't put it in
this email.

> I'm assuming you're thinking other than the difficulty with keyboard entry.
> I don't see any need for manual entry of such characters if you're using an
> appropriate editing tool. Similarly with output - you wouldn't expect to see
> raw <ESC> displayed.

In practice, people like to put transcriptions into all sorts of
things - like emails, or spreadsheets. Specifying a format that
doesn't work with these media is a problem.

> What did you mean by a regex-to-regex mapping library?

I said "matching", not "mapping", and I mean something that knows that
a[bc]d matches [ab]cd, for example.

Tony Proctor

unread,
Apr 6, 2013, 7:57:34 AM4/6/13
to root...@googlegroups.com
This note appears at: http://freebmd.rootsweb.com/Format.shtml#UCF

"Technical note: Although this UCF format has many similarities to
regular expressions (e.g. Perl, Unix) it is not identical and in particular
there is no escape mechanism."

Ben Laurie

unread,
Apr 6, 2013, 7:58:50 AM4/6/13
to root...@googlegroups.com
On 6 April 2013 12:57, Tony Proctor <to...@proctor.net> wrote:
> This note appears at: http://freebmd.rootsweb.com/Format.shtml#UCF
>
> "Technical note: Although this UCF format has many similarities to
> regular expressions (e.g. Perl, Unix) it is not identical and in particular
> there is no escape mechanism."

Yeah, only because the data sets we transcribe don't include the UCF
characters :-)

I would not be averse to defining an escape mechanism.

Tony Proctor

unread,
Apr 6, 2013, 9:04:40 AM4/6/13
to root...@googlegroups.com
Sorry about the matching/mapping confusion Ben. If there were a standardised
syntax for UCF, and possibly even a way of escaping it, then such a library
would be relatively easy to provide.

I hate to say this but it sounds like something FHISO would be good at.
Rather than bundle it in with a Data Model standard, it would be a small
standard in its own right.

I see nothing fundamentally wrong with the FreeBSD syntax - other than there
being no explicit escaping - but how many competing schemes are there? What
do libraries and archives use for uncertain transcriptions?

Tony Proctor

unread,
Apr 6, 2013, 9:06:18 AM4/6/13
to root...@googlegroups.com
I meant "FreeBMD", not "FreeBSD". Spelling corrector's mistake :-)

Ben Brumfield

unread,
Apr 6, 2013, 9:43:05 AM4/6/13
to root...@googlegroups.com
So far as I'm aware, UCF is significantly more precise than any notations used by libraries, archives, or documentary editors. 

The print notations described in _A Guide to Documentary Editing_ and _Editing Historical Documents: A Handbook of Practice_ are relatively poor, and not well suited to records.  They tend to be compact but vague, and are supplemented with explanatory footnotes.  If you like, I can quote section 5.23.

About a year ago I tweeted about UCF, and got enthusiastic, "I wish I'd known about this during my dissertation" responses from people who are very well informed indeed.  (I believe that the editor of the Old Bailey Online was one of them, but would need to dive through Twitter archives.)

I just asked on the code4lib IRC channel and was told that "certain cataloging rules use square brackets to indicate inferred or interpreted information", which is essentially an application of print notations used by editors, it seems to me.  Otherwise they use TEI and its UNCLEAR and CHOOSE tags.

Ben
http://manuscripttranscription.blogspot.com/

Tony Proctor

unread,
Apr 6, 2013, 9:52:22 AM4/6/13
to root...@googlegroups.com
This stuff is essential for all transcribed records. I think FHISO would be happy to be involved with pushing this towards a standard Ben. How would you feel about that?
 
I also added this question to your blog-post at: http://www.blogger.com/comment.g?blogID=5730930067468816440&postID=4469496210545595842&page=1&token=1365255501002. It relates to other aspects of transcription than uncertain characters:
 
    How does FreeUKGEN represent text that was struck-out in its original form Ben?

    This seems to be separate from UCF but is still relevant when transcribing original data.

    Another case would be 'side notes', say in the margin or between the lines.

Ben Laurie

unread,
Apr 6, 2013, 12:57:33 PM4/6/13
to root...@googlegroups.com
On 6 April 2013 14:04, Tony Proctor <to...@proctor.net> wrote:
> Sorry about the matching/mapping confusion Ben. If there were a standardised
> syntax for UCF, and possibly even a way of escaping it, then such a library
> would be relatively easy to provide.

I'm not so sure about that, but I'd be very happy to be proven wrong.

Tony Proctor

unread,
Apr 6, 2013, 1:50:35 PM4/6/13
to root...@googlegroups.com
It's only a variation of a standard wildcard algorithm. There are efficient
sequential solutions in addition to the inefficient recursive ones used in
student text books :-)

Ben Laurie

unread,
Apr 6, 2013, 2:34:21 PM4/6/13
to root...@googlegroups.com
On 6 April 2013 18:50, Tony Proctor <to...@proctor.net> wrote:
> It's only a variation of a standard wildcard algorithm. There are efficient
> sequential solutions in addition to the inefficient recursive ones used in
> student text books :-)

I'll believe it when I see it :-)

Ben Brumfield

unread,
Apr 6, 2013, 3:05:03 PM4/6/13
to root...@googlegroups.com
On Saturday, April 6, 2013 8:52:22 AM UTC-5, Tony Proctor wrote:
This stuff is essential for all transcribed records. I think FHISO would be happy to be involved with pushing this towards a standard Ben. How would you feel about that?

I'd say that I'm very interested in
1) helping prevent a 'wrong' standard
2) publicizing and documenting best practices among family historians, whether that means pointing people towards the _Guide to Documentary Editing_, the TEI Guidelines or UCF

I came to RootsTech prepared to argue for TEI at the FHISO booth and was delighted to see the CfP.  Then I read a bunch of stuff about how to prepare a proper cover letter for a FHISO standard submission and decided I had other things to do that were more important.  Maybe you can suggest the next step?
 
 
I also added this question to your blog-post at: http://www.blogger.com/comment.g?blogID=5730930067468816440&postID=4469496210545595842&page=1&token=1365255501002. It relates to other aspects of transcription than uncertain characters:
 
    How does FreeUKGEN represent text that was struck-out in its original form Ben?

    This seems to be separate from UCF but is still relevant when transcribing original data.

    Another case would be 'side notes', say in the margin or between the lines.

This gets to the heart of the difference between UCF and TEI.  TEI describes an entire document, using XML tags for indicating interpretation or features of the original.  (For example http://www.tei-c.org/release/doc/tei-p5-doc/en/html/MS.html#mspham talks about how marginalia are modeled, while http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COEDADD discusses strike-throughs as a subset of deletions.)  The goal of TEI is to model the whole document in a way that will be generally useful for different kinds of analysis, display, and extraction.

UCF on the other hand exists only within the context of individual fields within structured records in a database.  There is neither a need nor an opportunity to use the kinds of plain-speech description that can be possible with a footnote (in print) or attributes like REASON in TEI, since the database will be searched and displayed within the context of the limits imposed by its schema.  If a strike-through replaces "George" with "Bill", we want "Bill" to be the searchable text. 

If we were building a genetic edition (cf http://www.tei-c.org/Activities/Council/Working/tcw19.html) database, it would be a different matter -- we'd want to preserve George within a time-frame and Bill within another, but that's not the goal of most databases of genealogical records.

To be honest, I'm still thinking about this.  If I understand correctly, a lot of the theory behind genetic editions is still being developed -- though there are some neat uses like the Proust viewer prototype at http://research.cch.kcl.ac.uk/proust_prototype/ .  It complicates matters to consider that it's possible to transform UCF into TEI into print notation, and sometimes back again.

Ben Brumfield
http://manuscripttranscription.blogspot.com/

Tony Proctor

unread,
Apr 6, 2013, 3:31:04 PM4/6/13
to root...@googlegroups.com
It sounds like you may have been put off by a misconception Ben. FHISO (and especially me) would love you to make a submission to their CfP. However, they're not looking for complete standards or entire solutions at this stage.
 
There are 3 main categories of proposal: Technical & Functional (referring to proposals of a technical nature or functional requirements for the main Data Model standard), or Standardisation (referring to a proposal for a standard other than the Data Model).
 
For instance, in order to propose a standard type of markup based on your UCF then you'd only need a page or two, and one of those would probably be a simple header page. There are some examples on the FHISO CfP web page (http://fhiso.org/call-for-papers/) but they don't include a proposal from that category yet, although I have submitted some. Obviously it needs to get the gist across, ideally justified with use-cases if necessary and including some references, but it doesn't have to be a large document at all. This is just getting the proposal "on the table" so that corresponding teams can be set up (which I hope you'd be involved with).
 
One of my functional requirements for the Data Model is that it have support for uncertainty in transcriptions. I didn't specify a technical solution but I would hope that the selected one would either be something else done within FHISO or a recognised standard. I prefer FreeUKGEN's UCF over things like TEI alternatives for many of the reasons you give. Plus a RegEx syntax is powerful from a real processing perspective, as opposed to simply being expressive. RegEx algorithms are well-established.
 
    Tony Proctor
----- Original Message -----
Sent: Saturday, April 06, 2013 8:05 PM
Subject: Re: [rootsdev] Escape Sequences

--

Ben Brumfield

unread,
Apr 6, 2013, 9:20:58 PM4/6/13
to root...@googlegroups.com
On Sat, Apr 6, 2013 at 2:31 PM, Tony Proctor <to...@proctor.net> wrote:
> It sounds like you may have been put off by a misconception Ben. FHISO (and
> especially me) would love you to make a submission to their CfP. However,
> they're not looking for complete standards or entire solutions at this
> stage.
>
The online CfP contains examples, which don't look too intimidating.
Let me mull over what I think the case would be (and get through the
current crunch time on Open Source Indexing) and we'll see what we
come up with.

> One of my functional requirements for the Data Model is that it have support
> for uncertainty in transcriptions. I didn't specify a technical solution but
> I would hope that the selected one would either be something else done
> within FHISO or a recognised standard. I prefer FreeUKGEN's UCF over things
> like TEI alternatives for many of the reasons you give. Plus a RegEx syntax
> is powerful from a real processing perspective, as opposed to simply being
> expressive. RegEx algorithms are well-established.
>
Regarding TEI, I really think it's important to separate the
underlying concepts (which are the product of decades of discussion by
the people with the most experience working with manuscripts) from the
idea of TEI as a data-entry format. There are a lot of objections to
having users type angle brackets, but hand-editing XML is not a
requirement for TEI. (I gave a talk on the topic at the TEI meeting
[transcript here:
http://manuscripttranscription.blogspot.com/2012/11/what-does-it-mean-to-support-tei-for.html
] and will be doing a similar one on your side of the pond at the end
of this month.) Programs may store transcripts in other formats and
use TEI for export and interchange, or (as is the case with
Papyri.info and Itinera Nova) they may use TEI internally but present
traditional print apparatus to editors and readers.

So I wonder what is the appropriate yardstick to use is? It's very
easy to imagine scripts to convert the "narrative" concept in STEMMA
to a subset of TEI and back again, or my own FromThePage wiki syntax.
The same is likely true of most print notations.

I don't think it would be possible to convert most genealogical record
"indexing" transcription projects into TEI or printable editions,
however.
To use the example I know best--the FreeREG database of parish
registers--it seems nearly impossible to reconstitute the original
text of an entry from the abstracts we have transcribed in our
database. You cannot get from a baptismal entry from
http://www.freereg.org.uk/howto/enterdata.htm#baptisms into an
approximation of the words on the original register -- there's no way
to know that the original was in Latin in order to print "baptizata"
instead of "baptized", for example. This is why I struggle with
whether to consider an indexing database as a peculiar kind of edition
or as an edition-like abstract. Once you add the needs of querying,
it seems like a different animal from a digital edition of family
letters altogether, though they both require support for uncertainty.

(Meanwhile the TEI folks are really struggling with ways to represent
financial records and other tabular material, though I'm not sure a
structured DB approach would work any better.)

Ben

Tony Proctor

unread,
Apr 7, 2013, 12:16:36 PM4/7/13
to root...@googlegroups.com
As with a standard wildcard match algorithm, there are really only 3 basic
ways of matching:

1) character-for-character match
2) an optional character
3) a series of zero-or-more characters

In a RegEx-to-RegEx, these may be in both search-str and target-str, and
those 3 cases have a little more functionality, but not enough to make it
radically different.

The [] syntax is just extending a case of (1) to make it a character-set
match instead of a mere character match, but this is not that different to a
case-blind or accent-blind match, e.g. [AaBbCc...].

OK, either search-str or target-str could have the equivalent of a "*" in
it. This only needs to be effective in the search-str, and can be treated as
a simple character in the target-str, iff the operation is reciprocated,
i.e. S1 matches S2 and S2 matches S1.

NB: In the situation where a case of (3) occurs in both search-str and
target-str at the current match position, you can imagine that treating the
occurrence in the target-str as anything more than a simple character is
superfluous.

Ben Laurie

unread,
Apr 7, 2013, 12:54:31 PM4/7/13
to root...@googlegroups.com
On 7 April 2013 17:16, Tony Proctor <to...@proctor.net> wrote:
> As with a standard wildcard match algorithm, there are really only 3 basic
> ways of matching:
>
> 1) character-for-character match
> 2) an optional character
> 3) a series of zero-or-more characters
>
> In a RegEx-to-RegEx, these may be in both search-str and target-str, and
> those 3 cases have a little more functionality, but not enough to make it
> radically different.
>
> The [] syntax is just extending a case of (1) to make it a character-set
> match instead of a mere character match, but this is not that different to a
> case-blind or accent-blind match, e.g. [AaBbCc...].
>
> OK, either search-str or target-str could have the equivalent of a "*" in
> it. This only needs to be effective in the search-str, and can be treated as
> a simple character in the target-str, iff the operation is reciprocated,
> i.e. S1 matches S2 and S2 matches S1.

Don't get that. Search string is "abc", target string is "a*c", for example.

Luther Tychonievich

unread,
Apr 8, 2013, 12:44:27 PM4/8/13
to root...@googlegroups.com
Regex-toRegex is a well-known problem in discrete math, equivalent to determining if the intersection of two finite automata is empty.

The basic steps of the algorithm are
(1) turn each regular expression into an NFA
(2) compute the compliment of both NFA
(3) union the compliments
(4) compliment the union
(5) search for paths that reach an accepting state: if any, the two regexs "match".

Noticably more difficult than regular expression vs text, and doesn't include non-regular features like back-references. See http://qntm.org/greenery for one example effort to make it work in practice.

Back-references actually mean that matching is not even context-free, though I do not know if it is still be tractable or not.


Back to the U+1b question (which, by the way, you can embed in an email, though many clients don't display it; e.g., the ANSI "cancel the color" sequence is the four characters U+1b U+5b U+30 U+6b: " [0m"). I personally think a better fit would be using unicode language tags (http://www.fileformat.info/info/unicode/block/tags/list.htm) like
U+E0001 U+E0072 U+E0065 U+E0067 U+E0065 U+E0078 re[gG]ul.r U+E007f
instead of U+1b because
(1) U+1b uses normal printable characters for the rest of its tag, while unicode language tags are unprinted;
(2) U+1b doesn't have a common end-of-escape marker built in, while unicode language tags do;
(3) there is built-in room in a language tag version to have, e.g., regex-pl vs regex-posix vs freebmd, etc.;
(4) unicode language tags have an accepted xml version (lang="regex")
…although I admit that this does conflict with the fact that UCF could be in many different natural languages…

Tony Proctor

unread,
Apr 8, 2013, 12:58:23 PM4/8/13
to root...@googlegroups.com
Re: Unicode Language Tags, this would be an abuse of a deprecated set of tags Luther since this is not a language issue (http://en.wikipedia.org/wiki/Tags_(Unicode_block)

Luther Tychonievich

unread,
Apr 8, 2013, 1:20:21 PM4/8/13
to root...@googlegroups.com
I am not pretending that this is the purpose of the unicode tags. It's also not the purpose of the output-centric ANSI codes. I hardly see using a set of codes for things like changing color or moving the cursor any less of an abuse.

If we are worried about not stepping on toes, and if we want something in-text-stream rather than a data-format tag, then wouldn't the private use control codes U+91 and U+92 or the private use areas be the right place to go?

Tony Proctor

unread,
Apr 8, 2013, 1:39:40 PM4/8/13
to root...@googlegroups.com
The ANSI standard provides for an Application Program Command which is basically just an application-defined string. In 8-bit mode, this only requires one leading character and one trailing character. There is no problem with uncertain terminators, and no risk of embedded quoting characters to deal with. The string payload is not defined, and is unrelated to "changing colour or moving the cursor" of a hardware terminal.
 
In a scheme like XML, representation of non-printables is not a problem.
 
So, there are just two entities needed:
 
    APC = 0x9F (or ESC _ in the old 7-bit mode): Application Program Command
    ST = 0x9C (or ESC \ in the old 7-bit mode): String Terminator
 
I know I have used this somewhere in my past to encode something akin to procedural mark-up but it's too long ago for me to remember now. For instance
 
    APC ?12[68] ST

Ben Brumfield

unread,
Apr 8, 2013, 2:22:37 PM4/8/13
to root...@googlegroups.com
Can you folks remind me which problem we're trying to solve?  If it's a definition of mark-up for indicating unclear text, that's one thing.  If it's "how do we escape special characters in UCF" or "how do we escape tags in XML", that's a different problem altogether.

Ben

Tony Proctor

unread,
Apr 8, 2013, 2:54:50 PM4/8/13
to root...@googlegroups.com
My original post was related to: "how do we escape special characters in UCF". That is why I elaborated on the ANSI escape sequences below.
 
Re: "definition of mark-up for indicating unclear text", This is not directly related. I strongly believe that FHISO can help with this. There are not many modern computer-readable schemes out there, and ones like UCF are well=established.
 
Re: "do we escape tags in XML", that was a response to the argument of entering non-printable characters being a problem.
 
...are we all back now  :-)

Ben Brumfield

unread,
Apr 8, 2013, 3:01:25 PM4/8/13
to root...@googlegroups.com
Excellent -- thanks.

I'm in favor of using backslash to escape special characters in UCF,
and will be happy to propose that as part of a UCF standard. The
argument for backslash is clear: it's visible on the keyboard sitting
in front of me, which means that it can be used anywhere I'm doing
data-entry. If someone wants to convert it to an invisble ASCII or
Unicode character behind the scenes for internal processing, that's
fine by me, but if standards are used as interchange formats, the
resulting formats will eventually be hand-edited by someone down the
line. (I speak having just manually converted some horrible
DOCX-created, OOffice-converted HTML to legitimate XHTML, by hand.)

Ben

Ben Laurie

unread,
Apr 11, 2013, 5:01:12 AM4/11/13
to root...@googlegroups.com
On 8 April 2013 17:44, Luther Tychonievich <tychon...@gmail.com> wrote:
> Regex-toRegex is a well-known problem in discrete math, equivalent to
> determining if the intersection of two finite automata is empty.
>
> The basic steps of the algorithm are
> (1) turn each regular expression into an NFA
> (2) compute the compliment of both NFA
> (3) union the compliments
> (4) compliment the union
> (5) search for paths that reach an accepting state: if any, the two regexs
> "match".
>
> Noticably more difficult than regular expression vs text, and doesn't
> include non-regular features like back-references. See
> http://qntm.org/greenery for one example effort to make it work in practice.

Thanks for that. Perhaps we should let him know that there _is_ a use
for regex-to-regex. You seem to hint there are other implementations,
Do you know of any?

> Back-references actually mean that matching is not even context-free, though
> I do not know if it is still be tractable or not.

He suggests not:

"Strictly speaking, regular-in-the-mathematical-sense languages cannot
include back-references. The inclusion of back-references moves us out
of the regular languages and into what Wikipedia seems to call the
"language of squares". For example, the "regular expression" (.*)\1
describes the set of repeating strings: "aa", "papa", "wikiwiki", and
so on. This is not a regular language. Because it is not a regular
language, it does not have an equivalent finite state machine and the
algorithms employed by this program cannot be applied.

In fact, "square" languages require a full-blown Turing machine to handle.

Turing machines can of course be combined, but turning a Turing
machine back into a regular expression with backreferences is
basically impossible."

I actually slightly doubt that last statement (because you can always
build a universal Turing machine), but I'm prepared to believe it's
not possible to turn it onto a _useful_ regex with backreferences :-)

Tony Proctor

unread,
Apr 12, 2013, 3:41:30 AM4/12/13
to root...@googlegroups.com
Do you really mean regex-to-regex, as in the full implementation Ben? I have
assumed in this thread that you were only concerned with the UCF subset of
the regex syntax.

If so then what's the use-case for the full implementation?

Tony Proctor

----- Original Message -----
From: "Ben Laurie" <b...@links.org>
To: <root...@googlegroups.com>
Sent: Thursday, April 11, 2013 10:01 AM
Subject: Re: [rootsdev] Escape Sequences


Ben Laurie

unread,
Apr 12, 2013, 7:31:08 AM4/12/13
to root...@googlegroups.com
On 12 April 2013 08:41, Tony Proctor <to...@proctor.net> wrote:
> Do you really mean regex-to-regex, as in the full implementation Ben? I have
> assumed in this thread that you were only concerned with the UCF subset of
> the regex syntax.

One side should be full regex, the other could be a subset
corresponding to UCF ... I'm not sure that leads to any useful
simplification though., Nor even UCF to UCF.

Tony Proctor

unread,
Apr 12, 2013, 7:41:27 AM4/12/13
to root...@googlegroups.com
This could be a huge bottleneck if you're doing it in database queries, Ben,
even with stored procedures. Even traditional */? wildcards - which have
intrinsic function support - kill query performance because they cannot use
any indexes.

I'm not suggesting that there's a way to avoid it - only that the
implementation has to be extremely efficient for it to be practical.

Luther Tychonievich

unread,
Apr 12, 2013, 8:21:44 AM4/12/13
to root...@googlegroups.com
You seem to hint there are other implementations, Do you know of any?

Most computational theory text books have toy implementation for pedagogical purposes. The one I linked to is the only one I know of that tries to be efficient in any useful sense of the word.

There might be room to make a much simpler and more space-efficient version using Thompson NFAs instead of DFAs. I might look into that.

Back-references actually mean that matching is not even context-free, though I do not know if it is still be tractable or not.

He suggests not: [...]

Even matching regular expressions with back-references to text is NP-complete (and thus intractable in theory), but in practice people write very simple back-reference expressions that are easily evaluated. If back-references are important (which, incidentally, I think for transcription they are; I often find myself thinking "I don't know what this letter is but whatever it is it's the same as that other letter") then the goal is to develop something that is fast for common use cases of back-references.
 
"Turing machines can of course be combined, but turning a Turing
machine back into a regular expression with backreferences is
basically impossible."

I actually slightly doubt that last statement (because you can always
build a universal Turing machine), but I'm prepared to believe it's
not possible to turn it onto a _useful_ regex with backreferences :-)

This is, as far as I know, an open question. I have not seen any construction for taking an arbitrary Turing Machine and creating a RegEx+BR that matches the same language, nor a proof that such does not exist.


Do you really mean regex-to-regex, as in the full implementation Ben? I have assumed in this thread that you were only concerned with the UCF subset of the regex syntax.

If you have no unbounded repetition constructs (i.e., no * or + though {n,m} is still allowed) then the problem becomes much simpler (at least in theory) because every regex represents only a finite (though very large) set of finite (and relatively small) strings. I expect this restriction is sufficient for every transcription problem. I do not know of existing work on matching this restricted set when augmented by back-references, but I might look into it next time I feel like messing with computational theory.

One side should be full regex, the other could be a subset corresponding to UCF ... I'm not sure that leads to any useful simplification  though., Nor even UCF to UCF.

If one string contains character sets (e.g, [ceo] or .) but no other regex operations then we can match full regexs to it directly using slight tweaks of existing Thompson NFA algorithms. If we add in alternation (| and {m,n}) or back references then it is not immediately obvious how to make matching simpler, but I expect some simpler-than-full-regex-matching algorithm does exist.
Reply all
Reply to author
Forward
0 new messages