Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Please help me name a module for lossy text compression

6 views
Skip to first unread message

Ben Deutsch

unread,
Dec 19, 2012, 2:56:10 PM12/19/12
to module-...@perl.org
Hello,

I'm writing a small module to apply "lossy" filters to text, to enable
better subsequent lossless compression. For example, "Hello, World!"
would become "hello, world!" with the "lowercase" filter, or "Hello
World" with the punctuation removal filter. This does not apply the
actual compression, it just reduces the entropy of the text in question.

As a working title, I'm using

Text::Lossy

as the module name. But "Text" is quite a large and well-known top-level
namespace, so I'm asking if this is a good fit, and if not, what I might
call the module instead.

One thing I do *not* want to do is place it in the "Acme" namespace –
the module may sound a bit silly, but it strives to do exactly what it
says on the tin: every filter reduces the entropy while still retaining
most of the meaning. For example, reducing the entire text to the empty
string (while great for compression) is straight out.

Thanks for your time,
Ben Deutsch

Brian Katzung

unread,
Dec 19, 2012, 10:26:35 PM12/19/12
to module-...@perl.org
Ben,

How about creating Text::Filter::LowerCase and Text::Filter::Unpunctuate
as derived classes of Text::Filter?

- Brian
--
Brian Katzung, Kappa Computer Solutions, LLC
Offering web, client/server, open source, and traditional
software development and mixed operating system support
for business, education, and science
Phone: 847.412.0713 http://www.kappacs.com

Brian Katzung

unread,
Dec 20, 2012, 11:09:38 AM12/20/12
to module-...@perl.org
On second thought... Text::Filter::NoPunctuation is probably better than
::Unpunctuate.

However, a more general solution might be Text::Filter::Transliterate
(using "tr" with "from" and "to" mappings passed to the filter) and
Text::Filter::Delete (deleting characters specified according to a
string or regex).

- Brian

On 2012-12-19 21:26, Brian Katzung wrote:
> Ben,
>
> How about creating Text::Filter::LowerCase and
> Text::Filter::Unpunctuate as derived classes of Text::Filter?
>
> - Brian
>
> On 2012-12-19 13:56, Ben Deutsch wrote:
>> Hello,
>>
>> I'm writing a small module to apply "lossy" filters to text, to
>> enable better subsequent lossless compression. For example, "Hello,
>> World!" would become "hello, world!" with the "lowercase" filter, or
>> "Hello World" with the punctuation removal filter. This does not
>> apply the actual compression, it just reduces the entropy of the text
>> in question.
>>
>> As a working title, I'm using
>>
>> Text::Lossy
>>
>> as the module name. But "Text" is quite a large and well-known
>> top-level namespace, so I'm asking if this is a good fit, and if not,
>> what I might call the module instead.
>>
>> One thing I do *not* want to do is place it in the "Acme" namespace �

Paul Bennett

unread,
Dec 20, 2012, 11:08:19 AM12/20/12
to module-...@perl.org
I have a similar conundrum. I'm writing a suite of preprocessors for Rozenshtein delta functions (aka Encoded Characteristic functions). The basic idea is really simple, and I imagine embedded quite deeply in the Perl psyche. The idea is that the delta function δ[x⊜y], where ⊜ is any comparison operator and the expression x⊜y returns boolean true or false, evaluates the comparison and returns 1 if true, or 0 if false.

The great thing (for me) is that there are rewrites from delta functions to plain ol' math for all the common comparisons -- rewrites that can be expressed in languages without a built-in "IF" or "CASE" statement. For example, δ[x=y] can be expanded to 1 - abs(sign(x - y)). The Perl expansion of any delta function δ[x⊜y] is simply !!(x⊜y), but not all languages are as forgiving.

Anyway. Enough exposition. The problem I'm solving is to provide a suite of rewrite engines for various programming languages that detect embedded δ functions and rewrite them using the appropriate syntax for the containing language.

Right now, I'm considering the Text::Rewrite::DeltaExpression::(.*) namespace where $1 is the target language. Any suggestions from y'all about better places to put it? Does Text::Filter::DeltaExpression::(.*) more closely fit what the community would be expecting?

Thanks,

--
Paul Bennett (aka PWBENNETT)

Darren Chamberlain

unread,
Dec 20, 2012, 11:31:47 AM12/20/12
to module-...@perl.org
On Thu, Dec 20, 2012 at 11:09 AM, Brian Katzung <bri...@kappacs.com> wrote:
> On second thought... Text::Filter::NoPunctuation is probably better than
> ::Unpunctuate.

::StripPunctuation would be even more descriptive.

--
Darren Chamberlain <d...@sevenroot.org>

Ben Deutsch

unread,
Dec 20, 2012, 2:15:08 PM12/20/12
to module-...@perl.org
Hello everyone,

> How about creating Text::Filter::LowerCase and
> Text::Filter::Unpunctuate as derived classes of Text::Filter?

I had peeked at Text::Filter before, and had deemed it great as a
transport mechanism (handling various in- and outputs, which my module
deliberately would *not* cover), but I had left with the impression that
this was strictly line-based, which is not enough for some of the
filters I'm planning, and couldn't handle filter parameters.

However, I've had another look, and discovered the peek-and-pushback
mechanisms. So I'll subclass Text::Filter.

I'm now going with Text::Filter::Lossy, since I want to keep (most of
the) filters in one place, and also to enable easy programmatic
sub-filter selection, for example from command line arguments.

Thanks for the pointers, everyone!

Ben Deutsch

Ben Deutsch

unread,
Dec 20, 2012, 2:25:07 PM12/20/12
to module-...@perl.org
Hello Paul,

> I have a similar conundrum. I'm writing a suite of preprocessors for
> Rozenshtein delta functions (aka Encoded Characteristic functions).
> The basic idea is really simple […]
>
> Right now, I'm considering the Text::Rewrite::DeltaExpression::(.*)
> namespace where $1 is the target language. Any suggestions from y'all
> about better places to put it? Does
> Text::Filter::DeltaExpression::(.*) more closely fit what the
> community would be expecting?

As Brian pointed out in a reply to my question, there's a "Text::Filter"
module, which is meant for subclassing. Text::Filter::DeltaExpression::
would sound like a subclass of this to me. Have you had a look at
Text::Filter? If your filters work primarily line-based, it's probably a
good fit.

Regards,
Ben Deutsch

Ben Deutsch

unread,
Feb 3, 2013, 5:35:40 AM2/3/13
to module-...@perl.org
Hello again,

I've had several looks at Text::Filter, and I've decided not to base my
module on it – it's almost exactly orthogonal, so deriving from it would
gain very little but add a dependency.

I'm going with Text::Lossy again; thanks again though.

I'm building a distribution out of this, using ExtUtils::MakeMaker. I'm
including a (Unix-filter-like) script which applies this module to given
texts. However, I wouldn't want to require everyone who installs the
module to also install the script.

How can I best handle this? Make it accessible to everyone while not
adding it to "installed exe-files"? So far, I'm including the script in
the manifest, inside a "script" directory, but don't have it as part of
Makefile.PL.

Thanks for inputs,
Ben Deutsch

Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯

unread,
Feb 3, 2013, 8:17:57 AM2/3/13
to module-...@perl.org
> I'm building a distribution out of this, using ExtUtils::MakeMaker.
One usually establishes the requirements first, then picks the
appropriate build tool. When there is any customisation, EUMM loses out
in favour of M::B.

> I wouldn't want to require everyone who
> installs the module to also install the script.
my $build = Module::Build->new(…);
$build->script_files([keys %{$build->script_files}, glob 'examples/*'])
if 'y' eq $build->prompt('Install the example scripts shipping in
this distribution? (y/n)', 'y');
$build->create_build_script;

I advise you to install by default. Think of the consequences. If the
distro installs the scripts and the user doesn't them, they are easily
deleted. Most of the time, though, they are not noticed and just take
up a tiny bit of disk space.

However, if they are not installed and the user wants them, he must run
the installation manually again, which is super-annoying. This takes
up time, which is a more precious resource. One bad example of picking
the wrong default is in libwww-perl, need to call Makefile.PL --aliases
to install the scripts that used to be installed in earlier versions
without doing anything special.





signature.asc

Ben Deutsch

unread,
Feb 6, 2013, 10:42:27 AM2/6/13
to module-...@perl.org
Hello,

Thanks for the reply, and sorry for the delay.

On 02/03/2013 02:17 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 wrote:
>> I'm building a distribution out of this, using
>> ExtUtils::MakeMaker.
> One usually establishes the requirements first, then picks the
> appropriate build tool. When there is any customisation, EUMM loses
> out in favour of M::B.

It wasn't so much a conscious choice than the default of module-starter.
I'd been planning to use Module::Builder for any future work; thanks for
the additional point of view.

>> I wouldn't want to require everyone who installs the module to also
>> install the script.
> […]
> I advise you to install by default. Think of the consequences. If
> the distro installs the scripts and the user doesn't them, they are
> easily deleted. Most of the time, though, they are not noticed and
> just take up a tiny bit of disk space.
>
> However, if they are not installed and the user wants them, he must
> run the installation manually again, which is super-annoying. This
> takes up time, which is a more precious resource.[…]

True. I'll install the script; it's not an example so much as the
de-facto command line interface to the module.

Thanks,
Ben Deutsch

0 new messages