Any easy way to remove accents?

16 views
Skip to first unread message

James Mansion

unread,
Dec 30, 2015, 4:56:28 PM12/30/15
to Shake build system
I have a Shake system which copies the directory tree:

WMA/<artists>/<albums>/<stuff>
to
FLAC/<artists>/<albums>/<stuff>

where the copy is direct except in the case where the file has extension
.wma, in which case I use ffmpeg to convert the wmalossless to flac and
change the extension.

This is running on a FreeBSD 10.1 system.

The wma files (and the album art) are written directly using a Samba
mount from Windows, using Windows Media Player.

Periodically the process fails, because ffmpeg refuses to open the
output file. This happens when there are accented characters.


Ideally, I'd like to process the file path strings so that (for example)
an accented e is reduced to an e and so on, for some 'sane' mapping that
leaves the input recognisable.

Is there a library to do such a thing?


Also, my nas has two CPUs, but I find that Shake seems to be serialising
the cmd executions of ffmpeg and the machine is running at just under
50% load. I didn't do anything fancy to enable threading - I thought
that was unnecessary. Is it expected?

James

Neil Mitchell

unread,
Dec 30, 2015, 5:29:46 PM12/30/15
to James Mansion, Shake build system
Hi James,

> Ideally, I'd like to process the file path strings so that (for example) an
> accented e is reduced to an e and so on, for some 'sane' mapping that leaves
> the input recognisable.
>
> Is there a library to do such a thing?

The build in base library Data.Char module contains
isUpper/toUpper/isAscii, which you can certainly use to spot the
problematic characters, and easily replace with something like "_".

Using the text-icu package looks like exactly what you want, but
installing it on Windows is a real pain - I managed it with the steps
at http://stackoverflow.com/a/34538201/160673. Once you have that I am
sure there is a function to do the conversion, but I've no real idea
where.

If it were me, I'd just grab the huge conversion table from
http://stackoverflow.com/a/34272324/160673 and code that up - given
your use case sounds personal, I suspect a pretty good solution with a
few fixups is probably fine.

> Also, my nas has two CPUs, but I find that Shake seems to be serialising the
> cmd executions of ffmpeg and the machine is running at just under 50% load.
> I didn't do anything fancy to enable threading - I thought that was
> unnecessary. Is it expected?

Are you using shakeThreads=2 or -j2? By default Shake uses 1 CPU, but
-j2 will make it use 2, and -j will make it pick appropriately based
on your hardware.

Thanks, Neil

David Turner

unread,
Feb 6, 2016, 7:23:02 AM2/6/16
to Shake build system
Hi James,

On Wednesday, December 30, 2015 at 9:56:28 PM UTC, James Mansion wrote:
I have a Shake system which copies the directory tree:

WMA/<artists>/<albums>/<stuff>
   to
FLAC/<artists>/<albums>/<stuff>

where the copy is direct except in the case where the file has extension
.wma, in which case I use ffmpeg to convert the wmalossless to flac and
change the extension.

This is running on a FreeBSD 10.1 system.

The wma files (and the album art) are written directly using a Samba
mount from Windows, using Windows Media Player.

Periodically the process fails, because ffmpeg refuses to open the
output file.  This happens when there are accented characters.


Ideally, I'd like to process the file path strings so that (for example)
an accented e is reduced to an e and so on, for some 'sane' mapping that
leaves the input recognisable.

Is there a library to do such a thing?

I had almost exactly the same problem and I used iconv successfully:

https://hackage.haskell.org/package/iconv-0.4.1.3/docs/Codec-Text-IConv.html#v:convertFuzzy

However this relies on the GNU libiconv which may take some effort to install on BSD (although I'm sure it is possible).

Cheers,

David
Reply all
Reply to author
Forward
0 new messages