How to alter a static html filename encoding in order to shorten filename length

138 views
Skip to first unread message

oleghbond

unread,
Nov 18, 2016, 2:18:24 AM11/18/16
to TiddlyWiki
The problem is not something new. Namely, while generating a static html filename from TW5 having cyrillic titles in a result one can sort of this:
A single cyrillic letter is encoded to 6 (!) ASCII signs. Bearing in mind a typical path name length of 30-40 signs and limitation of many OS on filename length of say 255, one obtains the title length limit approximately around 35, which can be annoying.

The point is when during static html generation the full path-and-filename length exceeds 255 the process accidently stops with error message sort of "Too long filename".

I would be great to have a possibility of altering (sort of a checkbox in Settings) a static html filename encoding method in order to shorten filename length. It might be transliteration method, which, by the way, is used in generating ".tid" filenames under Node.js. Or at least altering a static html filename encoding method in JS source code.

Jeremy Ruston

unread,
Nov 18, 2016, 2:38:57 PM11/18/16
to tiddl...@googlegroups.com
Hi Oleg

The problem is not something new. Namely, while generating a static html filename from TW5 having cyrillic titles in a result one can sort of this:
A single cyrillic letter is encoded to 6 (!) ASCII signs. Bearing in mind a typical path name length of 30-40 signs and limitation of many OS on filename length of say 255, one obtains the title length limit approximately around 35, which can be annoying.

The point is when during static html generation the full path-and-filename length exceeds 255 the process accidently stops with error message sort of "Too long filename”.

That makes sense. TiddlyWiki5 already incorporates code to transliterate Cyrillic characters to their Latin equivalents, but it is only used within the file system adaptor for filename generation, and isn’t used for permalinks. Here’s the code:


Here’s the code that generates permalinks:


However, fixing the permalinks isn’t as simple as integrating the transliteration. The problem is that the conversion process for mapping tiddler titles needs to be bidirectional: we need to be able to recover the original, Cyrillic tiddler title from the encoded form used in the permalink.

So, transliteration isn’t an option. It may be worth exploring whether there is a more efficient encoding mechanisms that we could use. And I guess it would be helpful to understand how other sites/apps deal with this problem, if at all.

Best wishes

Jeremy


I would be great to have a possibility of altering (sort of a checkbox in Settings) a static html filename encoding method in order to shorten filename length. It might be transliteration method, which, by the way, is used in generating ".tid" filenames under Node.js. Or at least altering a static html filename encoding method in JS source code.


--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To post to this group, send email to tiddl...@googlegroups.com.
Visit this group at https://groups.google.com/group/tiddlywiki.
To view this discussion on the web visit https://groups.google.com/d/msgid/tiddlywiki/2bd639e4-56a7-4ffd-b273-840f8befb638%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

PMario

unread,
Nov 18, 2016, 3:51:46 PM11/18/16
to TiddlyWiki
On Friday, November 18, 2016 at 8:38:57 PM UTC+1, Jeremy Ruston wrote:
However, fixing the permalinks isn’t as simple as integrating the transliteration. The problem is that the conversion process for mapping tiddler titles needs to be bidirectional: we need to be able to recover the original, Cyrillic tiddler title from the encoded form used in the permalink.

So, transliteration isn’t an option. It may be worth exploring whether there is a more efficient encoding mechanisms that we could use. And I guess it would be helpful to understand how other sites/apps deal with this problem, if at all.

Hi,
Just an idea.
If we implement a simple version of fuzzy search, which imo will be highly useful in general, it could use the transliteration extension, if the exact title isn't found.
-m

Jeremy Ruston

unread,
Nov 18, 2016, 3:59:41 PM11/18/16
to tiddl...@googlegroups.com
Hi Mario

Hi,
Just an idea.
If we implement a simple version of fuzzy search, which imo will be highly useful in general, it could use the transliteration extension, if the exact title isn't found.
-m

Interesting idea. I’d need some persuading that there wouldn’t be unintended consequences. The alternative I’ve wondered about is to support an explicit “slug” field, like Wordpress:


The slug could be optional, defaulting to the URI encoded title as at present. There would need to be a requirement that the slug only contain the limited set of characters permitted in an URL fragment identifier.

Best wishes

Jeremy.




--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To post to this group, send email to tiddl...@googlegroups.com.
Visit this group at https://groups.google.com/group/tiddlywiki.

PMario

unread,
Nov 19, 2016, 7:09:10 AM11/19/16
to TiddlyWiki
On Friday, November 18, 2016 at 9:59:41 PM UTC+1, Jeremy Ruston wrote:
Hi Mario

Hi,
Just an idea.
If we implement a simple version of fuzzy search, which imo will be highly useful in general, it could use the transliteration extension, if the exact title isn't found.
-m

Interesting idea. I’d need some persuading that there wouldn’t be unintended consequences. The alternative I’ve wondered about is to support an explicit “slug” field, like Wordpress:


The slug could be optional, defaulting to the URI encoded title as at present. There would need to be a requirement that the slug only contain the limited set of characters permitted in an URL fragment identifier.

So as I read it, there would be some generic rules:

 1) the slug-field is optional
 2) the slug character set is limited to eg: latin characters, digits, and hyphen "-" ... nothing else
 3) if no slug field is present, use the permalink mechanism as it is today.
 4) The whole slug creation process is manual ... so the OT thoughts about some helper functions doesn't apply.

My concern here is point 4). Since an automatic slug creation process is language specific, it is plugin territory. eg: Cyrillic needs different handling than German.
If we don't provide best practice documentation, we will end up with an unpredictable mess.

As soon as the slug field is implemented, we will get feature requests about auto-creation. So imo it's worth to think about it now already. see OT

------------- OT ----------------

Some plugin implementation thoughts

add 1)
 1.1 If the slug-filed is empty for eg: "Hello World", what happens if the user opens a permalink +#hello-world?
 1.2 how is duplication like: "Hello World" vs "hello world" resolved? it creates the same slug: #hello-world
 1.3 How do we resolve transliteration in a human predictable way. .. This is language dependent. see OP or for German:  Hände vs hande vs haende

add 2)
 2.1 The URI-fragment definition allows URLencoded chars. ... We don't want this.
 2.2 So we will limit it to: ALPHA / DIGIT / "-"
 2.3 how do we resolve duplications? see 1.2
      
links:
 - URI fragment is defined at: RFC3986 see: appendix A for the chars allowed.

------------- OT end ----------------

have fun!
mario

sini-Kit

unread,
Nov 19, 2016, 7:11:50 AM11/19/16
to TiddlyWiki
I had the same problem and I use caption field for real title of my Russian tiddler and title field for url.


my new variant with caption field http://heeg.ru/heeg.html#google_form_connect

пятница, 18 ноября 2016 г., 10:18:24 UTC+3 пользователь oleghbond написал:

PMario

unread,
Nov 19, 2016, 7:12:01 AM11/19/16
to TiddlyWiki
On Friday, November 18, 2016 at 9:59:41 PM UTC+1, Jeremy Ruston wrote:
Hi Mario

Hi,
Just an idea.
If we implement a simple version of fuzzy search, which imo will be highly useful in general, it could use the transliteration extension, if the exact title isn't found.
-m

Interesting idea. I’d need some persuading that there wouldn’t be unintended consequences. The alternative I’ve wondered about is to support an explicit “slug” field, like Wordpress:

I think fuzzy search could avoid all the problems that we will have to deal with, in the future.

-m

Jeremy Ruston

unread,
Nov 19, 2016, 7:39:11 AM11/19/16
to tiddl...@googlegroups.com
Hi Mario

 1) the slug-field is optional
 2) the slug character set is limited to eg: latin characters, digits, and hyphen "-" ... nothing else

Actually, there’s no reason to ban other characters; they will just lead to ugly URLs because of the encoding.

 3) if no slug field is present, use the permalink mechanism as it is today. 
 4) The whole slug creation process is manual ... so the OT thoughts about some helper functions doesn't apply. 

The slug creation process doesn’t need to be manual. We could have a checkbox in the edit template whereby the “slug” field is automatically updated via a transliteration macro each time the draft title is changed.

My concern here is point 4). Since an automatic slug creation process is language specific, it is plugin territory. eg: Cyrillic needs different handling than German.
If we don't provide best practice documentation, we will end up with an unpredictable mess. 

We could, of course, make the transliteration function be part of the translation plugin.

As soon as the slug field is implemented, we will get feature requests about auto-creation. So imo it's worth to think about it now already. see OT

------------- OT ----------------

Some plugin implementation thoughts 

add 1) 
 1.1 If the slug-filed is empty for eg: "Hello World", what happens if the user opens a permalink +#hello-world?

Nothing.

 1.2 how is duplication like: "Hello World" vs "hello world" resolved? it creates the same slug: #hello-world

If multiple slugs match we could actually open all of them. But obviously when using this mechanism in the context of generating static HTML files, then just the first matching slug would be used.

 1.3 How do we resolve transliteration in a human predictable way. .. This is language dependent. see OP or for German:  Hände vs hande vs haende

You’re saying that the same character would be transliterated differently in different contexts? Does that even lend itself to an algorithm?

One benefit of decoupling the slug handling from the creation is that the author would always be able to adjust an inappropriate transliteration.


add 2) 
 2.1 The URI-fragment definition allows URLencoded chars. ... We don't want this. 

Why not? The spec says that the fragment is URL encoded, why wouldn’t we go along with that?

 2.2 So we will limit it to: ALPHA / DIGIT / "-"
 2.3 how do we resolve duplications? see 1.2

As I say, I don’t think we should get too hung up about this when we’re already stuck with HTML’s behaviour in this regard.

Best wishes

Jeremy

       
links: 
 - URI fragment is defined at: RFC3986 see: appendix A for the chars allowed.

------------- OT end ----------------

have fun!
mario


-- 
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To post to this group, send email to tiddl...@googlegroups.com.
Visit this group at https://groups.google.com/group/tiddlywiki.

Jeremy Ruston

unread,
Nov 19, 2016, 7:41:35 AM11/19/16
to tiddl...@googlegroups.com
Hi Mario

I think fuzzy search could avoid all the problems that we will have to deal with, in the future.

I remain extremely doubtful and hence would be interested to see experiments. The approach seems to put more emphasis on the quality of our transliteration algorithms, and I can’t see how it could be adequately fuzzy for the job without also being fuzzy enough to lead to misleading behaviour.

Best wishes

Jeremy


-m

--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To post to this group, send email to tiddl...@googlegroups.com.
Visit this group at https://groups.google.com/group/tiddlywiki.

oleghbond

unread,
Nov 19, 2016, 10:36:02 AM11/19/16
to TiddlyWiki
Jeremy, thanks a lot for quick reply.

Based on my earlier post https://goo.gl/P3INQY (in Ukrainian) I may suggest both phonetical and reprocical (to my believe) transliteration pairs for the Ukrainian (including Russian) alphabet as follows:
Ukrainian-Russian
А => A, Б => B, В => V, Д => D, Ж => ZH, З => Z, Й => J, К => K, Л => L, М => M, Н => N, О => O, П => P, Р => R, С => S, Т => T, У => U, Ф => F, Х => X, Ц => C, Ч => CH, Ш => SH, Щ => W, Ь => Q, Ю => JU, Я => JA, ' => ~
Ukrainian-only
Г => GH, Ґ => G, Е => E, Є => JE, И => Y, І => I, Ї => JI, ' => ~
Russian-only
Г => G, Е => JE, Ё => JO, Ъ => ~, Ы => Y, Э => E

It'd be great if it helps.

Olegh

пʼятниця, 18 листопада 2016 р. 21:38:57 UTC+2 користувач Jeremy Ruston написав:

oleghbond

unread,
Nov 19, 2016, 11:09:23 AM11/19/16
to TiddlyWiki
Jeremy,

I'd like to stress another point - there is a rationale in splitting the generation of permalinks for tiddler by two cases:
  1. for the TW environment (/#) and 
  2. for multiple static files creation (https://goo.gl/JKjpnr).
In the 2nd case conversion reprocicality is not necessary.

That is the very case where I've suggested transliteration as an option.

I'd be grateful for your consideration and advice.

Olegh

пʼятниця, 18 листопада 2016 р. 21:38:57 UTC+2 користувач Jeremy Ruston написав:
Hi Oleg

Jeremy Ruston

unread,
Nov 19, 2016, 12:10:48 PM11/19/16
to tiddl...@googlegroups.com
Hi Oleg

I'd like to stress another point - there is a rationale in splitting the generation of permalinks for tiddler by two cases:
  1. for the TW environment (/#) and 
  2. for multiple static files creation (https://goo.gl/JKjpnr).
In the 2nd case conversion reprocicality is not necessary.

That is true, and I’d certainly open to a fix that only worked for the second case, but would much prefer an approach that we can use universally, if that is possible.

That is the very case where I've suggested transliteration as an option.

It would be fairly straightforward to explore with a fork. You’d need to expand the link widget to offer transliteration as one of the encoding options:


Let me know if I can help,

Best wishes

Jeremy

oleghbond

unread,
Nov 19, 2016, 4:14:46 PM11/19/16
to TiddlyWiki
Jeremy, thanks for discussion.

My conclusion: 
  1. percent-encoding is neither informative (human readable) nor elegant for permalink use,
  2. percent-encoding makes serious restriction on the title length in Cyrillic (about 6 times shorter than in Latin) in case of multiple static file generation from TW5,
  3. a possible way out, what I personally would prefer, is in applying a universal approach to permalink generation based on "title-to-slug" conversion with transliteration use.
Meanwhile possible changes comes true many users like me would need sort of patch to cover the issue of "too long filename" during static multifile site generation.

So I'd be grateful for a concise instruction for such dummies like me of how to sort out the issue.

Sincerely
Olegh


субота, 19 листопада 2016 р. 19:10:48 UTC+2 користувач Jeremy Ruston написав:

oleghbond

unread,
Nov 28, 2016, 10:27:32 AM11/28/16
to TiddlyWiki
For the time being we applied the following patch:
  • core\modules\widgets\link.js: 86
/*
wikiLinkText = wikiLinkText.replace("$uri_doubleencoded$",encodeURIComponent(encodeURIComponent(this.to)));  // ORIGINAL 
*/
wikiLinkText = wikiLinkText.replace("$uri_doubleencoded$",this.to); // CHANGED
  • core\modules\commands\rendertiddlers.js: 60
/*
var finalPath = exportPath || path.resolve(pathname,encodeURIComponent(title) + extension);  // ORIGINAL 
*/
var finalPath = exportPath || path.resolve(pathname,title + extension); // CHANGED

Olegh


субота, 19 листопада 2016 р. 23:14:46 UTC+2 користувач oleghbond написав:
Reply all
Reply to author
Forward
0 new messages