convert website to Sinhala Unicode

1,948 views
Skip to first unread message

Bhikkhu Mettavihari

unread,
Nov 22, 2007, 2:43:20 AM11/22/07
to sinhala...@googlegroups.com
Dear Friends

I have some many years ago generated the Sinhala tipitaka using the
tipitaka font.
I would like to convert this web site to also have a unicode copy of the site.

Please give me a few hints on what to do

If any of you like to help in this please let me know

The site is

MettaNet.org/tipitaka

with metta
Mettavihari

රෂාන්

unread,
Nov 22, 2007, 2:56:58 AM11/22/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
I think "Sinhala Unicode font conversion utility" available at
http://ucsc.cmb.ac.lk/ltrl/?page=downloads&lang=en&style=default
can be use to ease the conversion of legacy fonts to unicode, but i am
not sure that it supports the tipitaka font. Maybe Asanka
can answer that :)



On Nov 22, 12:43 pm, "Bhikkhu Mettavihari" <d.mettavih...@gmail.com>
wrote:

Chamara Peiris

unread,
Nov 22, 2007, 3:20:17 AM11/22/07
to Sinhala...@googlegroups.com
I can help on the UI.

Cheers,
--
Chamara Peiris | http://apramana.com | http://sewuma.com | +94 712 967 967

tidalbobo

unread,
Nov 22, 2007, 3:50:29 AM11/22/07
to Sinhala...@googlegroups.com
it is not too complicated.

the psudo code would be

get a file [source]
create another file [destinatio]

while there are unread characters
          read a character
          look up the matching unicode char
          write the char to destination

when no more characters in source file, u r done

get in touch with me if u need any help on this. Glad to lend a hand.

විශ්ව කුමාර

unread,
Nov 22, 2007, 11:46:02 AM11/22/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
I would like to help in this project. I have made the Unicode
converter just for this purpose, but the program I wrote on VB6
started giving errors for long files. I have downloaded and have read
a lot from the Tipitaka you are referring to. It is a Great project!

Now I'm on VB.net

I have successfully made a Vijesekara to Unicode converter. But the
text (font) in the Tipitaka translation is different from the DL
fonts.

This is a long forgotten project I have abondened of the lack of
support.

I'm Vishva Kumara. I hope you remember me.

The program I have written takes a Vijesekara (DL) text to one text
box on the user interface and when a button is clicked, the unicode
sinhala text appears to the other text box. So basically a user have
to copy paste paragraph by paragraph.
Therefor I have to modify the program to seek for files and translate
one by one as a batch process. This can be done! Almost done!!

The problem I have encountered was that from place to place in the
Tipitaka there is this page number in another text format like [p/
001]. This was the majour bottleneck I faced on the first attempt.

So if you have a version without the page numbering like that (in
square brackets) it will be easier to process.

විශ්ව කුමාර

unread,
Nov 22, 2007, 2:45:04 PM11/22/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Look : http://groups.google.com/group/singlish-typewriter/web/TipitakaSinhalaToUnicode.PNG
what is left to do is to automate the process for all the files.

But before that, we have to make sure that the program is working
properly.

I'll put the program with source code very soon. It is in VB.net
(2005)

Oh no!
> This is a long forgotten project I have abondened of the lack of
> support.

Sorry, if I have written something wrong. I tried this, but then I had
not enough knowledge.

There is a small problem in the web pages of the Tipitaka. The Font
tag opens before the paragraph tag and ends before the paragraph tag.
This will give errors when parsing files as a batch...

And... what is the letter in Tipitaka Sinhala for "Gayanu Kitta". In
DL fonts it was on "!"

Tyronne Wickramaratne

unread,
Nov 22, 2007, 3:13:06 PM11/22/07
to Sinhala...@googlegroups.com
On Nov 23, 2007 1:15 AM, විශ්ව කුමාර <vishva...@gmail.com> wrote:
Look : http://groups.google.com/group/singlish-typewriter/web/TipitakaSinhalaToUnicode.PNG
what is left to do is to automate the process for all the files.
 
actually the output is very nice. it does the job to the perfection. well, it seems like we can have a look at the Tipitaka sinhala unicode version quite soon :)
one more question, what's the unicode font do you use ?

-TW

.
-~----------~----~----~----~------~----~------~--~---

--
/usr/local/tyronne
http://labs.jboss.com/jbossmessaging/
ändern ist gewesen geändert worden

නිරංජන්෴

unread,
Nov 22, 2007, 5:06:18 PM11/22/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group

I have writtern a converter times ago tripitaka font to Sinhala
Unicode, does the job well,
you can have it, but it runs on windows

Niranjan

විශ්ව කුමාර

unread,
Nov 22, 2007, 9:26:23 PM11/22/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
> what's the unicode font do you use ?
I use Iskoola Potha and Potha.

What we have to do is to write the program to do as a batch process.
Because we cannot assign one person to copy paste all the text, and
back to web pages.

My program is written as a module and comes as an API. Therefor it can
be used inside any other program as a plug in (in programming).

It is written for DL fonts. There were lots of bugs when using with
Tipitaka Font 1. Even in the screen shot there were two majour errors
I have found and corrected yesterday.

#) We have to select one software (converter) and make sure it is 100%
consistant with Tipitaka Sinhala 1.

One problem was the TS1 fonts "DU" comes to DLs "DI" (covered)

Another problem is... Where is the "Gayanukitta"

> I have writtern a converter times ago tripitaka font to Sinhala
> Unicode, does the job well,
> you can have it, but it runs on windows
It is good! Can we see some screen shots...

Mine is written on VB.net. So it will run even on Linux with Mono
installed.

We will need lots of people to read proof. It is a lot of text we have
to work with.

One idea I have is to put all the Tipitaka text to a Database and make
the web page layout as a seperate layer, (presentation layer). Then it
will be easier to use this texts for many other purposes easily. For
an example, we can use the same database to build web pages and later
change the layout etc... and use in an externel software. Searching
wil be easier.

So... what do you think about using a small XML database to store all
the data and later convert to web pages.

රෂාන්

unread,
Nov 23, 2007, 1:54:15 AM11/23/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
hi,
So it seems that already there are several tools to do the conversion
(Excellent!).

I personally agreed with the fact that the converted thripitaka text
should be saved in a separate storage where we can retrieve them using
open standards.

as far as i can see, this project seems to have several main tasks;
1. Convert existing text to Unicode equivalent ( we have to develop/
improve a tool(s) )
2. Save the text (we may need to come up with a good data structure to
store them)
3. Implement the presentation layout ( starts with the existing web
site, but there can be more in future)

Since already there are people willing to help on every task of the
above, so lets plan on executing them.

I would like to give my hand to at least to the first two (since i
think i am not good at GUI designing :( )

විශ්ව කුමාර

unread,
Nov 23, 2007, 5:32:43 AM11/23/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
I can do the GUI. I made a user interface to some extent. Still there
are lots of work in coding

The interface
http://groups.google.com/group/singlish-typewriter/web/OpenAPage.PNG
Opens a page from file-folder tree

http://groups.google.com/group/singlish-typewriter/web/EditParaByPara.PNG
Edit paragraph by paragraph.


http://groups.google.com/group/singlish-typewriter/web/Kinti-s.htm
A sample output page


http://groups.google.com/group/singlish-typewriter/web/Kinti-s.xml
I suggest this XML Database Schema. We have to add more fields to this

The XML Database
http://groups.google.com/group/singlish-typewriter/web/Database1.PNG
http://groups.google.com/group/singlish-typewriter/web/Database2.PNG


N.B.

Please download the source code of the above work from
http://groups.google.com/group/singlish-typewriter/web/TipitakaTranslation.zip

What we have to do.
Find what are the bugs in the translation. The output is not always
according to the input.

රෂාන්

unread,
Nov 23, 2007, 6:02:37 AM11/23/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
excellent stuf!
i will try this in next few days.

BTW about your suggested XML schema,

it seems your data schema divide the entire thripitaka into paragraphs
which also can be stored easily in one relational table;

t= { id, legacy_content, unicode_content}

what do you think about a small improvement by adding few more tags
( or attributes) to identify the නිකාය, සූත්‍රය, etc.. ;

e.g. representing a sutra in an XML assuming, <sutra> = set of <paras>

<Sutra id="103" name="කින්ති සූත්‍රුය" nikaya="මජක්‍ධීම නිකාය" >
<MetaData>
... some pre-defined meta data
</MetaData>
<Para id="pid" encoding="utf-8">

</Para>
</Sutra>





On Nov 23, 3:32 pm, "විශ්ව කුමාර" <vishva8kum...@gmail.com> wrote:
> I can do the GUI. I made a user interface to some extent. Still there
> are lots of work in coding
>
> The interfacehttp://groups.google.com/group/singlish-typewriter/web/OpenAPage.PNG
> Opens a page from file-folder tree
>
> http://groups.google.com/group/singlish-typewriter/web/EditParaByPara...
> Edit paragraph by paragraph.
>
> http://groups.google.com/group/singlish-typewriter/web/Kinti-s.htm
> A sample output page
>
> http://groups.google.com/group/singlish-typewriter/web/Kinti-s.xml
> I suggest this XML Database Schema. We have to add more fields to this
>
> The XML Databasehttp://groups.google.com/group/singlish-typewriter/web/Database1.PNGhttp://groups.google.com/group/singlish-typewriter/web/Database2.PNG
>
> N.B.
>
> Please download the source code of the above work fromhttp://groups.google.com/group/singlish-typewriter/web/TipitakaTransl...

SRIshanu

unread,
Nov 23, 2007, 8:54:12 AM11/23/07
to Sinhala...@googlegroups.com
අවසරයි හාමුදුරුවනේ,

කන්වර්ටරයෙන් නැතිව, අමතරව ටයිප් කිරීම් අවශ්‍ය නම් ඒ සඳහා දායක වීමට පුළුවන්.

ටයිපිං වේගය අතිවිශාල නම් නැහැ... නමුත් යම් කාලයක් ගෙන කර දෙන්නට හැකියි...

--
ශ්‍රීශානු - SRIshanu
http://srishanu.blogspot.com
http://lankahistory.blogspot.com

විශ්ව කුමාර

unread,
Nov 23, 2007, 11:49:13 AM11/23/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
> it seems your data schema divide the entire thripitaka into paragraphs
> which also can be stored easily in one relational table;

No... This file is for the "කින්ති සූත්‍රුය" and I thought of keeping
seperate file for each...


> what do you think about a small improvement by adding few more tags
> ( or attributes) to identify the නිකාය, සූත්‍රය, etc.. ;

> e.g. representing a sutra in an XML assuming, <sutra> = set of <paras>

Yes it is a good idea. Those are some necessory attributes I forgot to
put.
But this XML file is directly written by the .net XML parser.

See the program; What I have done is to make a DataTable and write it
to an XML file. We can add more columns.
Now "LegacyFontStr" and "UnicodeFontStr" are the two columns of
"TipitakaParagraphs" table.
What we can do in .net is to create tables, Put columns to the tables,
put a set of tables to a DataSet and write the DataSet to an XML file.
Eg: DataSet -> DataTable -> Columns and Rows -> Data
There may be many other ways. But this is the easiest and fastest.

If we put all the Tipitaka in to one table, it will be very big and
harder to process at one go. That is why I thought to keep seperate
file for the structure as an Index file.

We can keep meta data in either the index file or the Sutra file.
And also we can put a seperate table to each sutra file containing
data about what Pitaka, Pannasaka, Nikaya, Sutra it belongs to.



But first we have to ask from Ven Mettavihari. This project is his
idea, and the content of the document should be translated under his
advice.

We are waiting for your descision...

Bhikkhu Mettavihari

unread,
Nov 23, 2007, 10:26:12 PM11/23/07
to Sinhala...@googlegroups.com
I am simply overwhelmed by all this help coming forward.
In our Buddhist ways "Much merit to all of you"

1. I have thought about the database and find it useful for the net

2. We also have to keep in mind that I distribute the entire website on a CD.
In that case the files has to be readable on a regular Win/Linux
machine with Unicode installed

3. We for the first several years will also have to keep it compatible
with Win-98-XP without Unicode
and hence will perhaps have to keep 2 copies of the files.

with metta
Mettavihari

Tyronne Wickramaratne

unread,
Nov 24, 2007, 7:10:37 AM11/24/07
to Sinhala...@googlegroups.com
Ven Bikkhu,

we had some discussions and we want to discuss about the license that you're going to use on the font. we do _not_ want anything .. what we're trying to do is to make those fonts free, open and public.. where they can be used on any Linux distribution as well as on windoez as well.. i know you're well ahead in this conversation

i want to thank you for your grea efforts in this on behalf oif the community. i can spare some time, initiate a conf-call and discuss these things a bit further .. probably after mid december ..

please accept my apology if you find my words are bit harsh or edgy.

with mettha
Tyronne



Bhikkhu Mettavihari

unread,
Nov 24, 2007, 8:34:20 AM11/24/07
to Sinhala...@googlegroups.com
Dear Tyronne,

> we had some discussions and we want to discuss about the license that you're
> going to use on the font. we do _not_ want anything .. what we're trying to
> do is to make those fonts free, open and public.. where they can be used on
> any Linux distribution as well as on windoez as well.. i know you're well
> ahead in this conversation

So far I am planning only ONE font and that I would like to be a
licence like BSD
My logic is simply that I want font makers to take this font
paste their glyphs into the required spaces
and still have the freedom to sell the new font that has been created by them.
It purely as a service to see that we in time to come will have
several new fonts
coming out with the right standard.

If you know of a better licence than BSD for this purpose, then I
would like to know.

We are likely later on coming out with 7 new fonts which will be under
GPL licence.

with metta
Mettavihari

Tyronne Wickramaratne

unread,
Nov 24, 2007, 8:59:04 AM11/24/07
to Sinhala...@googlegroups.com
Ven Bhikkhu,


On Nov 24, 2007 7:04 PM, Bhikkhu Mettavihari <d.mett...@gmail.com> wrote:
Dear Tyronne,

> we had some discussions and we want to discuss about the license that you're
> going to use on the font. we do _not_ want anything .. what we're trying to
> do is to make those fonts free, open and public.. where they can be used on
> any Linux distribution as well as on windoez as well.. i know you're well
> ahead in this conversation

So far I am planning only ONE font and that I would like to be a
licence like BSD
My logic is simply that I want font makers to take this font
paste their glyphs into the required spaces
and still have the freedom to sell the new font that has been created by them.
It purely as a service to see that we in time to come will have
several new fonts
coming out with the right standard.

If you know of a better licence than BSD for this purpose, then I
would like to know.

yes.. we had some discussions on the license and BSD is _ok_ but i was informed another license which i cannot remember now. i can get you all this information by c.o.b 26th monday.

at the same time, i'd like to know that, once we define the license, is it possible for you to create a project and host it under sourceforge ? so we all can monitor the progress and we can also have a mailing list , blog and utilise on many other features with it.. and importantly we (not me) can make some contributions for the font development work as well.

please send your suggestions and following the outcome of the license , we can proceed and let's see what we can do to make things better.

We are likely later on coming out with 7 new fonts which will be under
GPL licence.
i was told that there's an issue with the GPL license , i can update you on this on Monady as said.

with metta,
Tyronne




with metta
Mettavihari

විශ්ව කුමාර

unread,
Nov 24, 2007, 11:19:19 AM11/24/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
How is Creative Commons Licence.
There we have Attribution, Share-Alike and Commercial/Non-Commercial
options
See http://creativecommons.org/

But isn't this about converting the content to Unicode format...
Once we have converted the content, we can use any Unicode Sinhala
font to view the pages.

For the computors that does not have Unicode, can't we generate image
files with the content...


> 2. We also have to keep in mind that I distribute the entire website on a CD.
> In that case the files has to be readable on a regular Win/Linux
> machine with Unicode installed

Yes the files are currently in HTML web pages. Therefor it can be
viewed on any platform with only a web browser.
So that is the main requirement...

I think we have to start from a "Requirement Analysis" (the classical
way).

Using the database does not make it necessory to deploy the database
to the end user.
We can use it as an intermediate data repository.

With a seperate database, we can design the Web Pages layout and
interface, and then intergrate.
Or we can later generate web pages at a one click. The basic idea is
to make the data and the interface independent on development stage
which will make the designers life easier.

When I converted some pages, there were many hidden problems in the
program that reflected on the output. Some of them can be corrected.
The Reepaya comes after the letter in legacy fonts, But this comes
after the letter in Unicode. I have corrected the program to recover
this for few words like "Dharma", "Karma", "Viirya" that were more
probably found in the text. But the reepaya misplaces in some
instances. The correction of this problem may slow down the
processing. Now it takes about 2 seconds on a 1GHz processor for an
average page from Majjima Nikaaya. That means slow...

I think SRIshanu can help in proof reading and correcting those errors
in the output. It means a lot!

One reason of splitting the pages as paragraphs is that then the proof
reader can compare the original text and converted text and make
necessory changes like in http://groups.google.com/group/singlish-typewriter/web/EditParaByPara.PNG

I'm still testing the converter. If someone can proof read and test
the software it will make things easier and faster.

When we are satisfied with the accuracy of convertion and agreed on a
storage format (database) we can start the batch processing.

Have anyone downloaded the converter...
It is at http://groups.google.com/group/singlish-typewriter/web/TipitakaTranslation.zip

Tyronne Wickramaratne

unread,
Nov 24, 2007, 11:46:33 AM11/24/07
to Sinhala...@googlegroups.com
On Nov 24, 2007 9:49 PM, විශ්ව කුමාර <vishva...@gmail.com> wrote:

How is Creative Commons Licence.
There we have Attribution, Share-Alike and Commercial/Non-Commercial
options
See http://creativecommons.org/

thanks for your input. we're going to have a discussion on the licensing issue, again on monday eve or probably in worst case, on Tuesday. right we've got just one FOSS font, which is not enough for ask the general public to get used to this.

so we're looking for someone / organisation , who can develop a few fonts for the community. the ownership of the fonts going to be with the community, which means, the fonts can be used by any person and dirty m$ guys too can use the font. for the general public to have access to such font, it should be free and open source.

when the font is free, anybody can use it. since the sources are free, any one can hack it and improve it at anytime so the quality evolves as time goes on. once the font get's developed, from our side, we can fix the issues related to font as far as the rendering is concerned.. and we could maintain the work. whatever the fixes we make as far as the rendering ++ concerned will go into all the nix based operating systems.

so that's the plan. once this fonts project gets hosted at a _public_ repo like sourceforge , we can see the progress of the work and we can help the developers/community accordingly.

so that's the story about fonts.. guys, if you have anything to add on top of this, you'll welcome . all the critics are welcome !!! without critics/suggestions .. we'll never grow or improve..
-TW



විශ්ව කුමාර

unread,
Nov 25, 2007, 1:08:36 AM11/25/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Can we have a Seperate Wiki for this project. Then it will be easier
for many people collaborate at the same time proof reading.
Where can we host a Wiki for a religious purpose... I mean, do
Wikimedia support such projects...

I don't think that using existing Wikipedia is a good solution. This
will need a seperate Wiki installation.

I have seen UCSC also use a Wiki installation for their work.



It seems that this thread has gone far away from the original
topic... : )

රෂාන්

unread,
Nov 25, 2007, 1:47:51 AM11/25/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
"Using the database does not make it necessary to deploy the database
to the end user. We can use it as an intermediate data repository."

Exactly. We should design the database so that we can save the
tripitaka there and retrieve content from it based on various
requirements.
e.g : suppose the tripitaka text is saved in 'the' database deployed
at a special machine and say we want to prepare a CD regarding
සිඟාලෝවාද සූත්‍රය,
what we have to do is just design the presentation layouts and the
content can be taken from the database. If we wanted to supports a NON-
unicode
version, we can just implement a converter (UNICODE -> Legacy font).
And again say we want to publish the same sutra in a web page, we do
not want to 'type' the content, just the same procedure as above.
who knows, one can even build a small foss app to view tripitaka in
our mobile devises :)

So as a summery, all we have to do is to store the thripitaka in a
repository where we can retrieve/search thripitaka text. That will
open-up new avenues
of presenting/publishing the tripitaka.

විශ්ව, I have seen your product and i think it can be used effectively
for the conversion thing, but it is good to have more discussions with
others (specially ven. bhikku Mettavihari) to decide on a good data
structure to store the thripitaka. Once we have that, i think the
conversion going to be a very simple thing. {of course the proof
reading may take some time, but fortunately we have people :) }

"It seems that this thread has gone far away from the original topic"

Yes. shall we focus our discussion on "conversion of the existing web
site and store it in Unicode" first?

PS: may be විශ්ව you can publish the mapping table, probably as an
image so that every one can easily check that.


On Nov 24, 9:19 pm, "විශ්ව කුමාර" <vishva8kum...@gmail.com> wrote:
> How is Creative Commons Licence.
> There we have Attribution, Share-Alike and Commercial/Non-Commercial
> options
> Seehttp://creativecommons.org/
> necessory changes like inhttp://groups.google.com/group/singlish-typewriter/web/EditParaByPara...
>
> I'm still testing the converter. If someone can proof read and test
> the software it will make things easier and faster.
>
> When we are satisfied with the accuracy of convertion and agreed on a
> storage format (database) we can start the batch processing.
>
> Have anyone downloaded the converter...
> It is athttp://groups.google.com/group/singlish-typewriter/web/TipitakaTransl...

විශ්ව කුමාර

unread,
Nov 25, 2007, 4:09:28 AM11/25/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
What do you mean by the mapping table?
The data structure of DotNet Data set...
The DataSet is a DataStructure made on the Main Memory with
programming. It does not have a graphical view, unless we draw as a
graphic.


So... what do you think about the Wikimedia thing...

The software I made for convertion is Free and Open Source.

I have run the batch process for the whole Tipitaka for a testing. It
ran for a long time, so I turned off the monitor and went away.

I'm now uploading the zipped database of XML files as an example of
how the things will go.

I can make a small Viewer to open the XML files and to Edit them. But
most probably it will be on DotNet. So you will need DotNet platform
(Free to Download) or Mono (if on Linux) to run it.

The Data Files: http://groups.google.com/group/Sinhala-Unicode/web/SiUnicodeTipitakaDataFiles.zip

රෂාන්

unread,
Nov 25, 2007, 4:39:18 AM11/25/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
"What do you mean by the mapping table?"
i mean the corresponding Unicode character(s) for each corresponding
tipitaka font character(s).
i thought it would be easy to others (people who cant look into the
source code) to comment on the conversion. (that is optional)

i too have looked some resources. such as; html2text - http://www.mbayer.de/html2text/
(This converts html pages to plain text)
I thought working with plain text is more easier than with html pages.
any comments?

"So... what do you think about the Wikimedia thing..."
There is a discussion on having a wiki for this group at here;
http://groups.google.com/group/Sinhala-Unicode/browse_thread/thread/4af1ed3ba0aa0cb4

shall we start this in that setup.
> The Data Files:http://groups.google.com/group/Sinhala-Unicode/web/SiUnicodeTipitakaD...

විශ්ව කුමාර

unread,
Nov 25, 2007, 7:50:33 AM11/25/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
The convertion of Legacy font to Unicode is not that easy. It takes
two mapping tables, one to Quantize and other to Map.
First I replace charactors like "du" to "da" and "paapilla"
that ,means breaking them to smaller parts. When there is a charactor
like "rae" we have to make it "ra" and "aelapilla". For "ru" ->
"rayanna" + "paapilla"

Then only we can go to replace legacy charactors to Unicode
charactors. Again there is a long algorythm to identify things like
"kombuva", "aelapilla", "paapilla". Because in legacy fonts we have
seperate, number of kombuva aelapilla ispilla for one letter.
For an example the combination for "koo" has "kombuva", "kayanna",
"aelapilla", "ispilla". Therefor we cannot use a simple mapping table.
When it scans through the input it identify if the current charactor
is a "letter", "consonent" or a "glyph" ("kombu", "aelapili"). All the
"kombu", "aelapili" is retained and added together with a binary tree
like control structure. Then, when the real letter comes, it splushes
the retained two charactors to the output string.

If we use a simple mapping table, the software would be very slow.
There will be more than 1660 combinations to go through (Like Mr.
Donald says). So the program do the convertion in three stage process
which is faster.

Anyway, going through the HTML is not a big problem in DotNet. Even in
Java we have this IndexOf method for a string. I'm using the IndexOf
method in String class to find the Font tags with "Tipitaka_Sinhala1"
font. And this is done inside a while loop. So all the font tags goes
through the loop.

The Quantization table and Convertion table for the program are in an
XML file. : )
I'm used to put everything possible in XML files!
I'll post them soon. Anyway. those files are in the "Data" subfolder
inside the "Debug" folder for the DotNet project.



And about the "Wiki": I mean, if we had a seperate Wiki to keep all
the translated text as seperate pages. Something like a Tipitaka Wiki.

Now... Please appologize if I'm telling something inappropriate...
In English Wikipedia there is a special Template file for the Bible.
All the verses are stored in a seperate domain in en.Wiki and can be
retrived through the template.

But we cannot put Tipitaka inside a Wikipedia. If we take the Sinhala
Wikipedia, it has about 260 pages (the last time I checked). So if we
put Tipitaka in to the existing Sinhala Wiki, it will Quadraple the
existing size.
There are seperate Wiki installations for a variety of purposes like
the WikiHow, FMA.Wikia

So I think, as a seperate project; the Sinhala Unicode Tipitaka should
have a Wiki.

විශ්ව කුමාර

unread,
Nov 25, 2007, 8:08:46 AM11/25/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
The Mapping Table: So far I'm used to call it the Convertion Table.
You can see it as a table by clicking the "Convertion Database" button
at the top right corner of the program.

රෂාන්

unread,
Nov 26, 2007, 2:05:02 AM11/26/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Yesterday i have done a small experiment on Converting html pages to
plain text and the following are my results;

The procedure:

Download html2text.py from http://www.aaronsw.com/2002/html2text/html2text.py

wrote a small bash script called html2text.script to process one file

######################################################
#!/bin/bash

i=`pwd`${1:1}
python html2text.py $i > `echo $i |awk -F. '{print $1.".txt"}'`
echo "$i Processsed Successfully"

####################################################

Then inside the extracted tipitaka folder executed the following;

time find . -name "*.html" -exec ./html2text.script '{}' \;

It took about 4 minutes to process all the html files and the resulted
tipitaka text was about 7.5 MB

So now I had tipitaka in plain text (still in tipitaka font) and
hopefully what i need is another small program which implement the
conversion algo
and runs in batch mode.

Then i converted with the tool i mentioned earlier (only that tool was
available at the machine) and its seems ok

Hope you can do the same and see the results :)

So i estimate the entire process will complete within 10 minutes :)


As for the wiki thing, its better to host in a private server
(administered by ven. metta vihari), as dhamma is a very sensitive
thing.

විශ්ව කුමාර

unread,
Nov 26, 2007, 2:24:36 AM11/26/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
> The procedure:
> Download html2text.py from http://www.aaronsw.com/2002/html2text/html2text.py
> wrote a small bash script called html2text.script to process one file

Great work...
But I do not use C++ very often.

I have thought about Sinhala WikiBooks and made a stub.
See it at
http://si.wikibooks.org/wiki/%E0%B6%AD%E0%B7%8A%E2%80%8D%E0%B6%BB%E0%B7%92%E0%B6%B4%E0%B7%92%E0%B6%A7%E0%B6%9A_%E0%B6%B4%E0%B7%9C%E0%B6%AD%E0%B7%8A_%E0%B7%80%E0%B7%84%E0%B6%B1%E0%B7%8A%E0%B7%83%E0%B7%9A

These are only the content pages. No text is uploaded yet. If we
cannot satisfy by the securuty there, we can delete these content
pages, or redirect them to the private Wiki installation later.


check :http://sinhala-unicode.googlegroups.com/web/
SiUnicodeTipitakaDataFiles.zip for the complete set of converted
files.
But still there are some errors here and there caused of un expected
HTML tags inside paragraphs. It considers the HTML tags as Tipitaka
font and produces gibberish. This happens very rarely and therefor
hard to find.
And also for HTML notations like "&igrave;", "$ugrave;", &amp;" (+lots
of unidentified) that occur rarely it gives gibberish. This is rare,
But not good at all.
If I have the plain text version this will be alright.

You have done something I cannot do right now! So could you please
post the Text files, because I will have the converter ready for plain
text very soon.

Just zip it and upload to this group, post the link.

රෂාන්

unread,
Nov 26, 2007, 3:05:54 AM11/26/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Ok i have uploaded it and the link is;
http://groups.google.com/group/Sinhala-Unicode/web/tipitaka_text.zip

I have converted the text files in to dos format so for linux users
may need to execute the following command (or simmilar one!) after
extraction,

find . -name "*.txt" -exec dos2unix '{}' \;

ps: the file size is about 6.7MB, so i recommend to follow the process
i mentioned above (if you can) to generate these files.

On Nov 26, 12:24 pm, "විශ්ව කුමාර" <vishva8kum...@gmail.com> wrote:
> > The procedure:
> > Download html2text.py fromhttp://www.aaronsw.com/2002/html2text/html2text.py
> > wrote a small bash script called html2text.script to process one file
>
> Great work...
> But I do not use C++ very often.
>
> I have thought about Sinhala WikiBooks and made a stub.
> See it athttp://si.wikibooks.org/wiki/%E0%B6%AD%E0%B7%8A%E2%80%8D%E0%B6%BB%E0%...

විශ්ව කුමාර

unread,
Nov 26, 2007, 8:31:43 AM11/26/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
If we are using a Wiki like si.wikibooks, we can lock the articles to
prevent vandalism.
Or moniter the articles constantly.

It is possible to undo vandalim in wikipedia.

විශ්ව කුමාර

unread,
Nov 26, 2007, 10:55:34 PM11/26/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
I have converted the text files you have sent to Unicode format.

This time it is 99% accurate except that it was almost impossible to
prevent the occuring of "[\q xxx /]". This seems to be the page
numbering. We may have to manually remove these, because this notation
is slightly different in different pages. I tried to use a Regular
expression to find this pattern and eleminate. But it was not a
constant pattern!

The basic convertion is over, and all the paragraphs are in the
database.

Now we can generate HTML files. What I need is the Web page Header,
Footer and Paragraph Header, Footer.

Eg:
Web page Header
<html>
<body bgcolor="lightyellow">
<font face="TipitakaSinhalaUnicode1">

Web page Footer
</font>
</body>
</html>

Paragraph Header
<p align="justify"> &nbsp; &nbsp;

Paragraph Footer
</p><br />

We can include graphics and those will be uniform throughout the whole
site. Or we can make slight variations for each section. A graphic
designer can do this part independent on the page content.


And also we can generate graphics for each paragraph. Then it will be
possible to view these even on a Win98 or Mac OS meachine. I suggest
PNG would suit this purpose well. Also we can include a picture to the
background of the images.

All the processing from now on can go on one click!

With text files you have sent, it was lot easier. The writing of new
Unicode files took only few seconds! Less than a minute...

The converted files can be downloaded from
http://groups.google.com/group/Sinhala-Unicode/web/tipitaka_text_unicode.zip

රෂාන්

unread,
Nov 27, 2007, 1:26:39 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Is it possible to get the program you used to convert the files, even
the binary is OK :)

As for generating HTML (or XHTML), what do you thing about
technologies like XSL and XSLT (given that we have the data already as
XML)

"if we are using a Wiki like si.wikibooks, we can lock the articles to
prevent vandalism. Or monitor the articles constantly. "

Prevention is better that cure :)
fromhttp://groups.google.com/group/Sinhala-Unicode/web/tipitaka_text_unic...i

රෂාන්

unread,
Nov 27, 2007, 1:45:41 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
hey, i have just tested your files, excellent work man!

විශ්ව කුමාර

unread,
Nov 27, 2007, 1:57:16 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Thanks!

I have not thought about other file types. The program goes in two
layers. The convert program converts any "String" from legacy format
to Unicode. The upper program reads the file and give paragraph by
paragraph to the convert program.

The convert program can be compiled to a seperate DLL to be plugged to
a seperate program as an API.

I don't know much about XSL or XSLT. But if there is a parser, things
will be easier. Or else we have to make a parser. Since the meta data
in tags, we can scan for "<" and ">" and make a parser easily. But
these legacy fonts use these two tag charactors as letters! Therefor
it is harder...

So... What do we do next...

රෂාන්

unread,
Nov 27, 2007, 2:19:17 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
XSL and XSLT are not a big deal
just see this for an overview:
http://en.wikipedia.org/wiki/XSL_Transformations
http://www.w3.org/TR/xslt

the idea is;
* we have data in XML files
* we define the page structure in XSLT file using XSL and normal
XHTML, (this is more likely the presentation layer)
* A xslt processor, take the above two and generate the output as XML
(or XHTML)

i think we dont need to implement any parser, dot net already has the
required classes, something like: System.Xml.Xsl
So if we have small program just to take the two inputs and generate
the xhtml, we are done. So later if the layout needs to be changed,
we just change the xslt file and re-run our program to generate the
files.

Note: Depending on the platform, we can do the above on the fly too.
All the leading programming languages like java, php have this
functionality.

BUT, before doing anything, i think it is now time to get further
instructions from Ven. bikku metta vihari :)

රෂාන්

unread,
Nov 27, 2007, 6:35:24 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
see this for an example;
http://groups.google.com/group/Sinhala-Unicode/web/Tipitaka_XSLT.zip

This is just an example :) just extract the two files and view the xml
file using a browser

On Nov 27, 12:19 pm, "රෂාන්" <rashan....@gmail.com> wrote:
> XSL and XSLT are not a big deal
> just see this for an overview:http://en.wikipedia.org/wiki/XSL_Transformationshttp://www.w3.org/TR/xslt

විශ්ව කුමාර

unread,
Nov 27, 2007, 8:43:28 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Thanks, That was cool!

Tyronne Wickramarathne

unread,
Nov 27, 2007, 9:54:56 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
Rashan,

awesome!! i'm a huge fan of XSLT transformations too. glad to see
that you're getting the full use of it.
at the same time if you're planning to create a parser AFAIK, ven
bhikku doesn't have a windoez box.

- TW

On Nov 27, 6:43 pm, "විශ්ව කුමාර" <vishva8kum...@gmail.com> wrote:
> Thanks, That was cool!

විශ්ව කුමාර

unread,
Nov 27, 2007, 10:13:34 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
He does have...

Tyronne Wickramarathne

unread,
Nov 27, 2007, 10:18:43 AM11/27/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
On Nov 27, 8:13 pm, "විශ්ව කුමාර" <vishva8kum...@gmail.com> wrote:
> He does have...

cool. :)

Sapumal Jayaratne

unread,
Nov 27, 2007, 11:26:32 AM11/27/07
to Sinhala...@googlegroups.com
වැඩේ නම් අති විශිෂ්ටයි.  ඉදිරි වැඩකටයුතු සඳහා යම් අදහසක් දීමට කැමතියි.

ප්‍රථමයෙන්ම පරිවර්තනයේ ඇති සුලු සුලු අඩුලුහුඬුකම් ඉවත්කල යුතුයි. මෙම අඩුලුහුඬුකම් සමහරක් මුල් පිටපතේ තිබූ ඒවා ලෙසත් තවත් සමහරක් යුනිකේත පරිවර්තනයේදී සිදුවන ඒවා ලෙසත් මට හැඟෙනවා. කෙසේවුවත් යම් සීමාවක් දක්වා මේවා තාක්ෂණිකව නිවැරදිකරගන්න පුලුවන්. නමුත් අවසානයේදී බොහෝවිට සෝදුපත් බැලීමේ අවශ්‍යතාවයක් මතුවිය හැකියි. සෝදුපත් බැලීම සඳහා අනුගමනය කලහැකි ක්‍රියාමාර්‍ගය කුමක්ද? ඔබලා සඳහන් කල පරිදි විකියකින් මෙම කාර්‍යය කල හැකියි. විකියට ස්වයංක්‍රීයව XML ගොනු වලින් තොරතුරු ඇතුල්කිරීම හා සෝදුපත් බැලීමෙන් පසුව නැවත ඒවා ලබාගැනීමට හා XML ලෙස සැකසීමට යම් තාක්ෂණික ක්‍රමවේදයක් භාවිතා කල හැකියි.

දෙවනුව අවසාන තොරතුරු ගබඩාකරතබාගන්නේ කෙසේද හා ඒම ගබඩාවෙන් ඒවා පිටතට ලබාදෙන තාක්ෂණය කුමක්ද යන්න සාකච්චා කල යුතුයි. XML ගොනු ලෙස හෝ XML දත්ත සමුදායක් ලෙස ගබඩාකිරීමේ අදහස නම් ඉතා අනර්ඝයි. මොකද එවිට තමාට අවශ්‍ය තොරතුරු පමණක් ලබාගෙන විවිධ මාධ්‍ය වලට ඒවා ඉතා ලෙහෙසියෙන් පරිවර්‍තනය කල හැකියි. වෙබ් පිටුවක් ලෙස නිරූපණය කිරීමට අවශ්‍ය විටෙක XSL භාවිතා කලහැකි අතර  PDF ආදී මාධ්‍යන්ටද පහසුවෙන් පරිවර්තනය කල හැකියි. අවශ්‍යනම් Web Service මඟින් දත්ත සේවාවක් සැපයිය හැකියි. එවිට ඕනෑම කෙනෙකුට ඒවා ලබාගෙන තමාට අවශ්‍ය ආකාරයට සකස් කරගෙන බෙදාහැරිය හැකියි.

මෙත්තාවිහාරි හිමියන්ගේ ඉල්ලීම ඉතා කඩිනමින් ඉටුකිරීමට හැකිවීම අනර්‍ගයි. කෙසේවුවත් රෂාන් පැවසූ පරිදි ඉදිරි කටයුතු කෙසේවියයුතුද යන්න ගැන එතුමාගේ අදහස් දැනගැනීමට කැමතියි.

සැරදේවා,
සපුමල්.


රෂාන්

unread,
Nov 28, 2007, 1:49:36 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
"if you're planning to create a parser AFAIK, ven
> bhikku doesn't have a windoez box."

we dont need a windows box. The deployment can be done in many ways,
e.g.
1. Publish the XML files and XSL files directly in a web server - In
this mode the transformation done at the client side by the web
browser.
2. Using some server side scripting technology (PHP, ASP.NET, etc..)
to do the transformation on the fly at the server side and return the
XHTML to the client.
3. Develop a tool to do the transformation once, i.e. XML + XSL ->
XHTML, and host the xhtml files in a web server. i think විශ්ව already
has done some work, so thats why i said to continue on DotNEt, but
since the processing is only one time, we dont need a server with
DotNEt

each one above have both advantages and disadvantages, but the good
news is once we have tipitaka as XML files (we have to discuss about
the structure more) we can implement all the above with little effort.

I have given examples for the case 1 and will try to send some
examples on case 2 in next few days. Maybe විශ්ව can come up with a
solution on case 3.

Then we all can decide on the best one for each application (here i am
not talking only on the metta web site but many other applications)

Tyronne Wickramarathne

unread,
Nov 28, 2007, 2:07:23 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
On Nov 28, 11:49 am, "රෂාන්" <rashan....@gmail.com> wrote:
> "if you're planning to create a parser AFAIK, ven
>
> > bhikku doesn't have a windoez box."
>
> we dont need a windows box. The deployment can be done in many ways,

true .. i was ref to the option 3# , which Vishwa has stated. :)
_personally_ i prefer his suggestion for we can avoid browser related
issues when using option #1, also i have noted several instances where
the transformation engines screw up under a load when we use option
#2. you guys have to decide the best this is just my personal
suggestion, the decision has to come from you you guys.. from those
who do the work :)

we're planning to initiate conf-call with bhikkhu as time permits,
this is to discuss regarding font related issues.. let us know if you
guys are interested , where you can shed some light..

thanks
TW


> Then we all can decide on the best one for each application (here i am
> not talking only on the metta web site but many other applications)
>


> > > Thanks, That was cool!

රෂාන්

unread,
Nov 28, 2007, 2:37:43 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
As promised this is the an example for case 2:

This code is for PHP5 with php-xsl extension;

<?php

$doc = new DOMDocument();
$xsl = new XSLTProcessor();

$doc->load("suta.xsl");
$xsl->importStyleSheet($doc);

$doc->load("mahagopala.xml");
echo $xsl->transformToXML($doc);

?>

just deploy the above file with the other two i uploaded earlier
(http://sinhala-unicode.googlegroups.com/web/Tipitaka_XSLT.zip) , on
the web server and see :)


As you can see one could easily extend the above script (e.g making
xml file names and xsl file names variables -> select them according
to user requests, browser type, action, etc..)



On Nov 28, 12:07 pm, Tyronne Wickramarathne <bersh...@gmail.com>
wrote:

Nalaka Jayasena

unread,
Nov 28, 2007, 3:16:15 AM11/28/07
to Sinhala...@googlegroups.com
රෂාන්, විශ්ව කුමාර,

බොහොමත්ම සතුටුයි වැඩේ ගැන.

රෂාන් එවපු උදාහරණය දාලා බැලුවා.
අගේට වැඩ!

හැබැයි ‍චුට්ටි වෙනසක් කරන්න වුනා-
කෝකටත් කියලා ඔන්න ඒ ගොනු තුනම සිප් කරලා ඇලෙව්වා.

නාලක

--
Nalaka Jayasena (නාලක ජයසේන)

thripitaka-0711281340.zip

Tyronne Wickramaratne

unread,
Nov 28, 2007, 7:02:27 AM11/28/07
to Sinhala...@googlegroups.com
i have done the java cut, you can execute it from the command line if you have java installed. you can download java at : http://java.sun.com/javase/downloads/index_jdk5.jsp
-TW


On Nov 25, 2007 6:38 PM, විශ්ව කුමාර <vishva...@gmail.com> wrote:

---~----~------~--~---




--
/usr/local/tyronne
http://labs.jboss.com/jbossmessaging/
ändern ist gewesen geändert worden
dev.zip

Tyronne Wickramaratne

unread,
Nov 28, 2007, 7:09:00 AM11/28/07
to Sinhala...@googlegroups.com
oops.. i have attached the wrong one.. this is the correct one
-TW
dev_correct_one.zip

රෂාන්

unread,
Nov 28, 2007, 7:34:34 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
excellent, i have tested your new one and that is enough to process
all the files in batch mode (in linux).

Now I think we have required tools to complete the final part of
deploying the web site (i.e. convert the XML files + XSLT to HTML
files).

So to summarize the current status;
1. We can convert the entire web site in tipitaka font to plain
text (Less than 5 minutes) - Complete
2. We can convert the entire plain text files in tipitaka font to
unicode text (Less than 5 minutes ?) - Complete
3. We can convert the entire plain text files in tipitaka font to
XML (Less than 5 minutes ?) - Can be completed after the XML structure
has finalize
4. Proof reading the content by some method (can use some or all of
the above) - To be Done
5. Design few page layouts - To be Done
6. Write some XSLT files based on the page layouts - To be Done
7. we can generate XHTML using the out puts at 3 and 6 (About 5
minutes) - Complete




On Nov 28, 5:02 pm, "Tyronne Wickramaratne" <bersh...@gmail.com>
wrote:
> i have done the java cut, you can execute it from the command line if you
> have java installed. you can download java at :http://java.sun.com/javase/downloads/index_jdk5.jsp
> -TW
>
> On Nov 25, 2007 6:38 PM, විශ්ව කුමාර <vishva8kum...@gmail.com> wrote:
>
>
>
> > ---~----~------~--~---
>
> --
> /usr/local/tyronnehttp://labs.jboss.com/jbossmessaging/
> ändern ist gewesen geändert worden
>
> dev.zip
> 5KDownload

රෂාන්

unread,
Nov 28, 2007, 7:45:38 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
hey Tyrone,
I have changed the following line (of dev_correct_one.zip) should be
corrected in order the default file name to be worked;
if (args.length <= 2 )
by
if (args.length < 2 )


On Nov 28, 5:02 pm, "Tyronne Wickramaratne" <bersh...@gmail.com>
wrote:
> i have done the java cut, you can execute it from the command line if you
> have java installed. you can download java at :http://java.sun.com/javase/downloads/index_jdk5.jsp
> -TW
>
> On Nov 25, 2007 6:38 PM, විශ්ව කුමාර <vishva8kum...@gmail.com> wrote:
>
>
>
> > ---~----~------~--~---
>
> --
> /usr/local/tyronnehttp://labs.jboss.com/jbossmessaging/
> ändern ist gewesen geändert worden
>
> dev.zip
> 5KDownload

Tyronne Wickramarathne

unread,
Nov 28, 2007, 8:01:55 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
hi Rashan,

ok no problemo. it's all yours. :) btw, if you're happy, i can create
a GUI for that. which will accept the input files and generates the
the output file.
-TW

රෂාන් අනුෂ්ක

unread,
Nov 28, 2007, 8:06:30 AM11/28/07
to Sinhala...@googlegroups.com
No need, what we have is more than enough :)

Tyronne Wickramarathne

unread,
Nov 28, 2007, 8:10:39 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group


On Nov 28, 6:06 pm, "රෂාන් අනුෂ්ක" <rashan....@gmail.com> wrote:
> No need, what we have is more than enough :)
>
cool. that saved lot of time :) let's focus on the rest of the items.
-TW
> Rashan Anushka
> eLearning Center
> University of Colombo School of Computing
> Sri Lanka

විශ්ව කුමාර

unread,
Nov 28, 2007, 8:39:45 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
We cannot use sophisticated server technologies, because there is a
requirement to deploy the web site on CDs. So it has to be in plain
HTML in the end.

Option 2 is suitable for information that changes with time, such as
Weather, Stock exchange etc...

If we are using our own installation of Wiki, we can install
phpMyAdmin there to run SQL scripts to fill the database at once. Or
else we will have to upload page by page. And also we can make a PHP
WikiBot to fetch data from XML and fill the database. But wikimedia
will not allow such a bot.

What do you think about the Stub I started on Wikibooks...
http://si.wikibooks.org/wiki/%E0%B6%AD%E0%B7%8A%E2%80%8D%E0%B6%BB%E0%B7%92%E0%B6%B4%E0%B7%92%E0%B6%A7%E0%B6%9A_%E0%B6%B4%E0%B7%9C%E0%B6%AD%E0%B7%8A_%E0%B7%80%E0%B7%84%E0%B6%B1%E0%B7%8A%E0%B7%83%E0%B7%9A

It takes only 5 seconds in VB.net to convert all 733 XML files to HTML
with a given Page Header/Footer Paragraph Header/Footer.

But the convertion from Tipitaka_Sinhala1 to Unicode took nearly 1
hour. I don't think we have to do that again.

We don't have to again convert the text to XML, because first they
were stored on XML files. I'll attach them. But we will need to do
slight changes to those XML files to use with XSLT.

The XML files: http://groups.google.com/group/Sinhala-Unicode/web/SiUnicodeTipitakaDataFiles%283%29.zip

And what do you think about rendering text to image files.

(But before that we have to agree on a font. I use Iskola Potha) And a
new font will be developed for this purpose. We can use Ola leaf
background for the images. Someone with OpenGL skills can develop an
animation to simulate the page flip effect. I can visualize the final
outcome. The page flip can also be done using Flash MX, but I don't
know how.

See: http://www.enya.com/
Go to Roma's Library in the castle and take a book from the shelf, you
can flip pages. It is a good inspiration for a web library.
But what I picture is a bundle of ola leaves attached with a string.
It gives the feeling of reading than a mechanical E-Book.

Danishka Navin

unread,
Nov 28, 2007, 8:57:53 AM11/28/07
to Sinhala...@googlegroups.com

රෂාන්

unread,
Nov 28, 2007, 9:00:34 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
On Nov 28, 6:39 pm, "විශ්ව කුමාර" <vishva8kum...@gmail.com> wrote:
> We cannot use sophisticated server technologies, because there is a
> requirement to deploy the web site on CDs. So it has to be in plain
> HTML in the end.

Fine, we can go with the One-time transformation, as the required
tools are all ready available.
BTW XSL and XSLT has been recommended by W3c since 1999 (http://
www.w3.org/TR/xslt) and
almost all server technologies supports them

> Option 2 is suitable for information that changes with time, such as
> Weather, Stock exchange etc...

And if the layouts changes or need to add a new layout quickly.

> What do you think about the Stub I started on Wikibooks...http://si.wikibooks.org/wiki/%E0%B6%AD%E0%B7%8A%E2%80%8D%E0%B6%BB%E0%...
That is nice, may be we can use a wiki for the metta web site too (by
restricting editors)

> It takes only 5 seconds in VB.net to convert all 733 XML files to HTML
> with a given Page Header/Footer Paragraph Header/Footer.
> But the convertion from Tipitaka_Sinhala1 to Unicode took nearly 1
> hour. I don't think we have to do that again.

If there are errors that can be easily corrected by modifying the
converter, then we should, since that will save more than 1hour when
we proof read them manually. Anyway i will see whether is there a way
to speed up the process.

රෂාන්

unread,
Nov 28, 2007, 9:05:36 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
So we dont need to think about the Pali part now.

On Nov 28, 6:57 pm, "Danishka Navin" <danis...@gmail.com> wrote:
> http://tipitaka.org/sinh/

රෂාන් අනුෂ්ක

unread,
Nov 28, 2007, 9:09:41 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
They too used xml files and do the transformation on the fly.

To see the files:
http://tipitaka.org/sinh/cscd/

May be we can get some help from this to comeup with a XML structure :)



රෂාන් අනුෂ්ක

unread,
Nov 28, 2007, 9:12:08 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
And the style sheet they use is
http://tipitaka.org/sinh/cscd/tipitaka-sinh.xsl

Unfortunately i have some serious work tomorrow, so can you guys check about this ?

විශ්ව කුමාර

unread,
Nov 28, 2007, 9:31:31 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
> If there are errors that can be easily corrected by modifying the
> converter, then we should, since that will save more than 1hour when
> we proof read them manually. Anyway i will see whether is there a way
> to speed up the process.

Yes I agree with that. But I could not find much errors on the
converter; most of the errors came from the original text. Could
someone check for errors that are introduced on the convertion.

The Reepaya problem is solved. But sometimes the Halpilla repeats
twice in some places. When the Diirgha ispilla and Rakaaraanshaya
comes together in legacy fonts, there is no specific order. The
converter does not identify this properly; we have to correct that.

And also we have to remove the page numbering ( "[\q zzz /]" ). In
XML, this comes as a seperate paragraph.

We can introduce Binary search for the Mapping Table Find method.
Currently it uses Linear search since the algorythm is meant for
converting short sentence parts while I'm typing in Singlish. Binary
search will boost up the convertion by 1/N to 1/Log(N) proportion.


> And if the layouts changes or need to add a new layout quickly.

Oh, yeah. I forgot that...

Tyronne Wickramarathne

unread,
Nov 28, 2007, 11:04:09 AM11/28/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
ok . wrote a simple java prog to get the list of files and transform
them into .html
i think this will same some time and we can improve this and use it
for future use as well.

1 ) $ ls > files.txt
2 ) remove the files.txt entry from the files.txt
3 ) move the FileContentReader.java to the folder where you have all
the xml stuff
4 ) copy the *.xslt file to the same directory where you have the .xml
stuff
5 ) compile the java file.. run it
6 ) it will generate html files for corresponding xml stuff

this is just a dirty piece of code i wrote ... you guys can refactor
it so a few of the given steps can beh avoided.
-TW


>
> >http://tipitaka.org/sinh/

Tyronne Wickramaratne

unread,
Nov 28, 2007, 11:10:26 AM11/28/07
to Sinhala...@googlegroups.com
errr. damn it.. here's the attachment  :
dev_1.1.zip

රෂාන්

unread,
Nov 29, 2007, 1:08:44 AM11/29/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
hey, I appreciate your work, but i thought using a bash script and
your previous java prog to do the conversion.
That is the same technique i used to convert html to plain text.

ANW your prog wont be a wast, now we can do the conversion in a
windows box too :)

I will check more on the pali tipitaka link sent by Danishka and
update you on any findings.


On Nov 28, 9:10 pm, "Tyronne Wickramaratne" <bersh...@gmail.com>
wrote:
> errr. damn it.. here's the attachment :
>
> --
> /usr/local/tyronnehttp://labs.jboss.com/jbossmessaging/
> ändern ist gewesen geändert worden
>
> dev_1.1.zip
> 4KDownload

SRIshanu

unread,
Nov 29, 2007, 1:34:35 AM11/29/07
to Sinhala...@googlegroups.com
විකීපීඩියාවේ (විකීපොත් නොවේ) ත්‍රිපිටකය පිළිබඳ අරඹා ඇති තවත් පිටුවක්...

http://si.wikipedia.org/wiki/%E0%B6%AD%E0%B7%8A%E2%80%8D%E0%B6%BB%E0%B7%92%E0%B6%B4%E0%B7%92%E0%B6%A7%E0%B6%9A%E0%B6%BA

--
ශ්‍රීශානු - SRIshanu
http://srishanu.blogspot.com
http://lankahistory.blogspot.com

Tyronne Wickramarathne

unread,
Nov 29, 2007, 1:42:02 AM11/29/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
hey Rashan,

On Nov 29, 11:08 am, "රෂාන්" <rashan....@gmail.com> wrote:
> hey, I appreciate your work, but i thought using a bash script and
> your previous java prog to do the conversion.
> That is the same technique i used to convert html to plain text.
>
> ANW your prog wont be a wast, now we can do the conversion in a
> windows box too :)
>

no problemo. i did the second one while waiting for a junit test to
complete a 30 min work..:) so it's not a waste at all and we can use
it on all platforms.
keep me posted with the proceedings., btw, let me know if you're
interested to join the call with bhikkhu on the new sinhala fonts.

> I will check more on the pali tipitaka link sent by Danishka and
> update you on any findings.

sure thing.
- TW

රෂාන්

unread,
Nov 29, 2007, 1:43:41 AM11/29/07
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
ඔව්. මෙය ත්‍රිපිටකය ගැන හොඳ හැඳින්වීමක් දෙනවා. එම පිටුවේ අග අපේ
පරික්ෂනාත්මක අඩවියට (විශ්ව හදපු) සබැඳියකුත් දාලා තියෙන්නෙ :)

On Nov 29, 11:34 am, SRIshanu <srish...@gmail.com> wrote:
> විකීපීඩියාවේ (විකීපොත් නොවේ) ත්‍රිපිටකය පිළිබඳ අරඹා ඇති තවත් පිටුවක්...
>
> http://si.wikipedia.org/wiki/%E0%B6%AD%E0%B7%8A%E2%80%8D%E0%B6%BB%E0%...

Bhikkhu Mettavihari

unread,
Dec 25, 2007, 11:22:00 AM12/25/07
to Sinhala...@googlegroups.com
Dear Friends

It looks like you have done all the work for me,  while I have been just silent here.

I would like to know what the next step forward is on this subject.

It would be a bit difficult to have texts of this kind of nature on a wiki, since they have to be authentic in nature. What I mean is: we have taken the text from A. P. Soiza's translations and they must be in accordance with his books, so public editing is not possible. However it is possible that we get together and edit the page numbers to be in accordance with the books.

I would again like to mention that there are a few things I would like to have.

  1. That we keep our setup close to the index pages with links to the different languages
  2. That I can make it available on a CD.
  3. That we keep the 8 bit system for at least one year along with the Unicode version. How is that best done ?
  4. Please let me know where I can get a copy of the newly converted text and get the assistance of some monks to proof read the texts.
  5. If some of you like to help in the proof reading then you should have a book to work with. Books are available with me, but it is a bit job.
with metta
and much gratitude for all your help.

Mettavihari

විශ්ව කුමාර

unread,
Jan 31, 2008, 7:33:02 AM1/31/08
to සිංහල යුනිකෝඩ් සමූහය - Sinhala Unicode Group
The Unicode Translation can be downloaded from
http://groups.google.com/group/Sinhala-Unicode/web/SiUnicodeTipitakaDataFiles%283%29.zip

It is currently in XML (Database) files. But these files can be viewed
on a Web Browser.

Those are stored in XML files because it is easier to transform to
"any other" format.

The 8bit system is also stored parellely in those files.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages