FW: ERD comments

4 views
Skip to first unread message

Martin Wunderlich

unread,
Aug 6, 2005, 9:17:13 AM8/6/05
to uw-cre...@googlegroups.com

Hi Gerard,

I hope you're enjoying WikiMania. I tried to listen in earlier on the radio,
but it didn't work. Good luck for the workshop and the talk.

As the subject line says, I had a chance to take a look the ERD and there
are a few things to comment. Now, this is from a translator point of view
and on top of that I am not much of a database designer. Anyway, here's what
came to my mind:

- I presume for the purpose of storing translations, you would use the table
Relations right? You would create a relation called "translation" and use
that to link two words by means of the MeaningID.

- I was wondering, if it would not be possible to treat etymology the same
way, i.e. create a relation for that. This would eliminate the need for the
extra table.

- Also, it seems like at the moment the design does not support different
spellings of the same word in the same languege. In German, for instance,
there has been this attemept at reforming orthography, which was later
re-reformed and it's a big mess right now. So, you have two different
spellings for German words sometimes: old and new orthography.

- Also, if that is supported, it would be possible to store historical data,
and you could trace how words change over the decades and centuries, a bit
like the info in the OED. That would be interesting. Have you checked, if
the data strucuture of the OED could be reflected in UW?

- Another thing I was wondering about is, how to plan to store the part of
speech? Would that be in English or the same language as the word itself? Is
there a standard list perhaps that covers all the possible parts of spech of
all languages?

Right, that's my comments there. Hope I haven't said anything too stupid and
perhaps it helps.

I am copying this to the google list, just in case.

Cheers,

Martin

Gerard Meijssen

unread,
Aug 10, 2005, 1:11:14 PM8/10/05
to uw-cre...@googlegroups.com
Martin Wunderlich wrote:

>Hi Gerard,
>
>I hope you're enjoying WikiMania. I tried to listen in earlier on the radio,
>but it didn't work. Good luck for the workshop and the talk.
>
>As the subject line says, I had a chance to take a look the ERD and there
>are a few things to comment. Now, this is from a translator point of view
>and on top of that I am not much of a database designer. Anyway, here's what
>came to my mind:
>
>- I presume for the purpose of storing translations, you would use the table
>Relations right? You would create a relation called "translation" and use
>that to link two words by means of the MeaningID.
>
>
Translations will go in the table SynTrans; this table will include
synonyms translations and alternative orthographies

>- I was wondering, if it would not be possible to treat etymology the same
>way, i.e. create a relation for that. This would eliminate the need for the
>extra table.
>
>
An etymology is a text explaining how a word came into being. It needs
to be in a seperate table because this text needs translating. In the
Relation table we use templates that are using meta-data; this meta data
has one translation it shares with the data that uses the same meta-data
so therefore etymology is essentially different.

>- Also, it seems like at the moment the design does not support different
>spellings of the same word in the same languege. In German, for instance,
>there has been this attemept at reforming orthography, which was later
>re-reformed and it's a big mess right now. So, you have two different
>spellings for German words sometimes: old and new orthography.
>
>
The table ValidSpelling allows you to specify from what date to what
data a spelling is valid. Historical data CAN be included; particularly
because of the coming Dutch spelling change in August 2006 was
considered in

>- Also, if that is supported, it would be possible to store historical data,
>and you could trace how words change over the decades and centuries, a bit
>like the info in the OED. That would be interesting. Have you checked, if
>the data strucuture of the OED could be reflected in UW?
>
>
I have never seen the OED nor its datadesign. Therefore I cannot make a
comparison.

>- Another thing I was wondering about is, how to plan to store the part of
>speech? Would that be in English or the same language as the word itself? Is
>there a standard list perhaps that covers all the possible parts of spech of
>all languages?
>
>
Parts of speech (the abstract) will be in a seperate table called ..
PartOfSpeech so if we find one more part of speech we will just include
it into the database

>Right, that's my comments there. Hope I haven't said anything too stupid and
>perhaps it helps.
>
>I am copying this to the google list, just in case.
>
>Cheers,
>
>Martin
>
Thanks,
Gerard

Martin Wunderlich

unread,
Oct 23, 2005, 7:47:19 AM10/23/05
to uw-cre...@googlegroups.com
Dear all,

Based on the discussion on this list so far and on some conversations with
Gerard, I've tried to come up with some specs/requirements for the UW client
application (aka reference implementation). Now, this is very basic stuff so
far and there are a lot of details to fill in and open issues to discuss. In
case the RTF attachement doesn't make it through to the list, I'll post the
text in this message below.

This list has been dormant for quite a while now, so I am hoping to revive
it and I am looking forward to feedback and discussion. It would also be
interesting to get some status update on the progress of the UW project. Any
news there, Gerard?

Cheers,

Martin




Specifications for UW reference implementation

1 Product Constraints
1.1 Purpose
The reference implementation for Ultimate Wiktionary provides a client
application with glossary functionality that uses Ultimate Wiktionary as its
back-end.
The two main functions will be the retrieval and addition of terms and their
translations from the UW database. A verification process for new additions
has to be provided in order to ensure high quality of content. The state of
a particular entry (verified or not) will be visible to the user.
This reference implementation will show the basic functionality of UW and
demonstrate the usefulness of UW as a terminology repository for
translators.

1.2 Users
The anticipated users will be translators, both professional and from the
open-source domain, both in-house and freelancers.

1.3 Requirements Constraints
The client application will be connected to UW via the internet. The client
application must allow for slow connections (dial-up modem) and tolerate the
sudden loss of connectivity. A local database will facilitate a certain
independence from an internet connection.
The client application shall be open source and freely available, as well as
platform independent (implementation note: use Java). The timeframe for the
delivery is envisaged to be the end of the year (as a present for
christman/hannuka/winter solistice).

1.4 Naming Conventions and Definitions
- Ultimate Wiktionary (UW): A freely accessible database of dictionary and
terminology entries.
- Reference implementation or client application: The software program that
can be used to retrieve and add UW contents.
- Term or word: Corresponds to an entry in the table “word” in the UW
database.
- Source language: The language from which a translation is made;
- Target language: The language into which a translation is made;
- Glossary: A collection of words with their translation for a specific
subject area. May also contain additional context data for the word(s).
- Subject area: specialist field or field of knowledge that is associated
with a word; corresponds to the UW table “collections”.
- (un)verified: refers to the status of the translation of a term.
“Unverified” means the translation has been added by a user, but hasn’t been
confirmed, yet. “Verified” means the translation has been confirmed by three
other users.
- UW database (UWDB): The database of UW.
- Local database (LDB): The database that is kept on the user’s machine.
This is a subset of the UWDB, based on the user’s preferences for source
language, target language and subject area.
- Use case: A use case is made up of a set of possible sequences of
interactions between the system and the users and related to a particular
goal.

1.5 Use cases
The client application will support the following use cases:
- Look up translation
- Add translation
- Suggest different translation
- Confirm unreviewed translation
- Contest unreviewed translation
- Export as TBX
- Import from TBX
- Synchronise UWDB with LDB.

1.6 Relevant Facts

1.7 Assumptions

2 Funtional Requirements

2.1 Scope of the Product
2.2 Functional and Data Requirements
2.2.1 General
2.2.1.1 The client application shall authenticate the user. (implementation
note: use entry of username and password for this and check against data
stored in UWDB).
2.2.1.2 The client application shall allow a new user to register
him/herself. (open issue: The extent of the data required for registration
needs to be defined. E.g. is a username and an e-mail address enough? How
can you avoid anonymous or fake users?)
2.2.1.3 The client application shall check, if a user is correclty
authenticated (implementation note: use a cookie for this; open issue: need
to clarify when the application checks this; before every operation, such as
look up?)
2.2.1.4 The client application shall store the user’s preferences,
comprising preferred language combination and one or more subject area, save
password y/n, preferred UI language (optional, if blank use OS default);
open issue: It would also be possible to include some statistical data here,
such as entries added, entries verified etc. However, for the reference
implementation this is not necessary;
open issue: A list of specialist fields could be constructed either from the
collections or from a standard list such as the Dewey taxonomy. It has to be
clarified what language that list should be in and how to sychronise the
same list for different languages. All users of UW should be working from
the same list.)
2.2.2 Look up translation
2.2.2.1 The client application shall accept the following input: source
language, target language, subject area (optional), source term.
2.2.2.2 The client application shall provide search for exact entry, search
case (in)sensitive, regex/wildcards search.
2.2.2.3 As an option the user may select related terms, by picking a
relation type from a list.
2.2.2.4 The user’s preferences shall be inserted as default, but they can be
changed by the user.
2.2.2.5 The client application shall retrieve any matching target terms from
the LDB.
2.2.2.6 The client application shall display the retrieved target terms to
the user, including display of their current status in a clear manner
(verified, unverified). (implementation note: a traffic light kind of icon
could be used: red=unverified; yellow= verified by one or two people; green=
verified by three or more people).
2.2.2.7 The client application shall allow the user to easily verify a term
(open issues: does this make sense? After all the user is looking up a term,
how would (s)he know, if the term is correct?)
2.2.3 Add translation
2.2.3.1 The client application shall allow the user to add a translation for
a given term. (open issue: How do you determine the source term for adding a
translation? suggestion: use terms the user has looked up, but for which no
translation exists; however: would we have to store such unsuccessful
lookups?).
2.2.4 Suggest different translation
2.2.4.1 The client application shall allow a user to suggest a different
translation for an existing one. (open issue: should there be a discussion
page, similar to Wikipedia or Leo where people can exchange opinions? How
can conflicts be resolved? Should there be a voting procedure?).
2.2.5 Confirm unverified translation
2.2.5.1 The client application shall provide an easy way to confirm an
unverified translation. (see 2.2.2.7).
2.2.6 Contest unverified translation
2.2.6.1 The client application shall provide a mechanism to contest
unreviewed translations. Note: This does not necessarily mean that the user
has to suggest a different translation. (open issue: should it also be
possible to contest verified translations or are they considered “frozen”?
If they can be contested, should it be harder than for verified
translations?)
2.2.7 Synchronise
2.2.7.1 The client application shall synchronise the LDB with the UWDB.
2.2.7.2 For this purpose, the client application shall (upon user request)
get all changes made to the LDB since the last date/time of synchronisation
and compare them with the changes made to the UWDB.
2.2.7.3 Any additions to the UWDB that do not conflict with the LDB (that is
pure additions) shall be transferred to the LDB.
2.2.7.4 The user shall be able to select, if any changes from the UWDB shall
overwrite the LDB automatically or if the user has to confirm any changes
made to the LDB.
2.2.7.5 The client application shall provide a mechanism to resolve any
conflicts. (A conflict will arise when two or more users have added the same
translation with different target terms; implementation note: this could be
treated as the same use case as “suggest different translation” or “contest
unverified translation”).
2.2.7.6 If several users have added the same translation with identical
target terms, then the status will be set accordingly (e.g. to verified if
three users add the same term at the same time).
2.2.8 Export TBX
2.2.8.1 The client application shall provide a function to export contents
to a TBX file. (open issue: should the application also allow other simple
formats such as CSV or tab-delimited text files?)
2.2.8.2 The default contents will be based on the user preferences, but
other content can be selected.
2.2.8.3 If the user decides to select other content, then it has be
downloaded first from UWDB.
2.2.9 Import TBX
2.2.9.1 The client application shall provide a function to import contents
from a TBX file. (open issue: should the application also allow other simple
formats such as CSV or tab-delimited text files?)
2.2.9.2 Imported data can be added to the LDB or to the UWDB or both.
(contents added to LDB only will be synchronised anyway).
2.2.9.3 This function should have a preview so that the user can check what
will be imported and sent to UW.
2.2.10 API
There should be an open API so that the application can be easily integrated
into other (open source) programs. The application that uses this API should
include a function for marking and adding terms. The application needs to
have two glossaries; a proprietary part and a public part. They stay apart
but they have the same functionality withing the application.

This API should provide as a minimum the following methods:
2.2.10.1 login / authentication
2.2.10.2 getEntry(language)
2.2.10.3 getEntry(language, collection)
2.2.10.4 getTranslation(source language, target language)
2.2.10.5 getTranslation(source language, target language, collection)
2.2.10.6 setEntry(language)
2.2.10.7 setEntry(language, collection)
2.2.10.8 setTranslation(source language, target language)
2.2.10.9 setTranslation(source language, target language, collection)
2.2.10.10 verifyEntry(entry)

3 Non-functional Requirements

3.1 Look and Feel Requirements
The look-and-feel of the client application shall adapt to the OS.
3.2 Usability Requirements
The user shall be able to use the client application for retrieving terms in
less than 10 minutes after installation.
The technology must be hidden from the user as far as possible.
The client application shall be easy to use for non-tech-savvy translators.
3.3 Performance Requirements

3.4 Operational Requirements
Regarding the transfer, it must taken into account that not everyone is
online all the time. To allow for dial-up connections, a queuing mechanism
should be in place that checks for pending transmissions and that operates
automatically or triggered by the user (preference).
3.5 Maintainability and Portability Requirements
3.6 Security Requirements
The transmission of the data should be secure (implementation note: use
https?).
As a minimum the transfer of user data has to be secure. The transfer of
actual content doesn’t have to be secure, since it’s freely available
anyway.
3.7 Cultural and political Requirements
3.8 Legal Requirements

4 Project Issues

4.1 Open Issues (unresolved concerns; proposed changes)
See list above in requirements. Other than this:
4.1.1 Verification process
Users could be sent random non-verified entries, based on their preferences.
This entries pop-up at certain intervals and the user can then accept or
reject the entry. The entry’s status changes accordingly. The interval in
which users receive such requests for verifications can be set by the user,
but should have a certain minimum time span (e.g. once per day).
BTW: The ERD doesn’t seem to allow for assigning a status flag to an entry
4.2 New Problems (caused by the new product)
4.3 Tasks (steps to build the product)
4.4 Cutover (data conversion, collection etc.; tasks for transition)
4.5 Risks
4.6 User Documentation
4.7 Waiting room (for later requirements)
UW Specs v2.rtf

Rodolfo Raya

unread,
Oct 23, 2005, 8:05:18 AM10/23/05
to uw-cre...@googlegroups.com
On 10/23/05, Martin Wunderlich <mart...@gmx.net> wrote:

This list has been dormant for quite a while now, so I am hoping to revive
it and I am looking forward to feedback and discussion. It would also be
interesting to get some status update on the progress of the UW project. Any
news there, Gerard?

Hi Martin,

The specs mention as open issue the exchange of data in CSV format as an alternative to TBX. If translators have data in CSV format, they can convert the data to TBX using TBXMaker, a free download at http://www.maxprograms.com

Conversion from TBX to CSV is easy to do. I can prepare another free tool for that if it is necessary.

There is an VERY important detail that I mentioned previously on this list: It is necessary to define an XCS template for the data.

A TBX document consists of two files: one file with the data and another one with a description of the data. Without the description, you have nothing.

There is a design that specifies the tables required for the database. That design must be converted to XCS format and the XCS published before people can convert existing data to TBX. Further, the XCS can be used to refine the constraints of the database.

I don't have the current database design at hand. Having that, I can prepare a draft for the XCS template and submit it for further analysis.

Regards,
Rodolfo


 


Reply all
Reply to author
Forward
0 new messages