I hope you're all excited to get started on the Online Linguistic
Database project (OLD)! I want to get the discussion started with
this post. It's a bit of a long one, but I hope it can serve as a
reference and a jumping-off point for discussion.
Contents of this post:
- (A) the technologies I think we should use
- (B) the BLD database structure from 40,000 feet: basic concepts
- (C) the BLD database structure on the runway: details about
tables and columns
- (D) Ideas: features I think should be implemented or changed
Reviewing this document will give you a thorough understanding of the
workings of the BLD (Blackfoot Language Database) web application. If
you didn't already know, the BLD is an existing web application for
storing and sharing linguistic fieldwork data (for Blackfoot). The
OLD will build upon the concepts of the BLD, but will be written from
scratch (in Python instead of PHP!).
If you have no idea what the BLD is, the best way to get up to speed
is to check out the BLD screencast video Ian Larson and I made:
Please review this post and
*** GIVE COMMENTS, IDEAS, SUGGESTIONS, CRITICISMS, etc. ***.
If you don't have time to read the whole thing, read the "(D) Ideas"
section and comment on that. I want us to begin coding very soon, so
the sooner the feedback comes, the better. That being said, we can
always change things and implement new features at a later stage
(though it may be a little bit harder...)
Finally,
*** POST ALL COMMENTS TO THE MAILING LIST ***.
This will serve as a record and reference material for all of us (and
future contributors) as the project evolves.
Thanks everyone,
Joel
####################
####################
(A) OLD Technologies
####################
####################
Here is an overview of the technologies I think we should use:
- programming language: Python
- web framework: Pylons
- database (RDBMS): MySQL
- project repository: Google Project Hosting (http://
code.google.com/p/onlinelinguisticdatabase/)
- version control system: Mercurial
If you would like to suggest using something different from the above,
now's the time to speak up.
If you don't know what any of this means, don't worry. If you want to
learn, ask the group.
#########################
#########################
(B) BLD Database Overview
#########################
#########################
I want to start off the discussion by planning the backbone of the
system: the database.
Remember, our goal is to create a program that stores linguistic data
in a structure that makes sense to fieldwork linguists. We want the
system to be easy to learn and use; we want it to help us use data
more efficiently; we want to encourage consistency whenever possible.
I am describing the BLD (Blackfoot Language Database) database because
I think it's a good starting point for the OLD. From here we can
tweak the model as we see fit.
Primary Entities of the BLD:
1. Forms: morphemes, words or sentences; properties:
transcription, morphBreak, morphGloss, translation, comments, etc.
2. Files: images, text files, audio, video; properties:
description, dateEntered, etc.
3. Collections: ordered lists of Forms; properties: title, type,
description, dateEntered, etc.
Secondary Entities of the BLD:
4. Users: researchers & students who use the BLD; properties:
name, affiliation, email, permissions, settings, etc.
5. Speakers: speakers of the object language; properties: name,
dialect, proficiency, age, etc.
6. MorphoSyntacticCategories (MSC): linguistic categories of
Forms; properties: label, description
7. KeyWords: tags associated to Forms; properties: name,
description
8. ElicitationMethods: tags associated to Forms; properties: name,
description
Inter-relationships between Primary Entities:
- Forms can reference one or more Files; the same File may be
referenced by one or more Forms (many-to-many)
- Collections can reference one or more Files; the same File may
be referenced by one or more Collections (many-to-many)
- Collections can reference a sequence of 0 or more Forms; the
same Form may be referenced by one or more Collections (many-to-many)
Relationships between Primary and Secondary Entities:
- Forms/Files/Collections can reference zero or one Speaker
- Forms/Files/Collections can reference zero or one User (as
elicitor)
- Forms can reference zero or one MSC (e.g., sent, noun, verb,
etc.)
- Forms can reference zero or more KeyWords
- Forms can reference zero or one ElicitationMethods
These entities are stored in the BLD model as tables. I will now
detail the columns of these tables.
#########################
#########################
(B) BLD Database Details
#########################
#########################
###################################
1. Forms table ("linguistic_forms")
###################################
1. "formid": unique integer, primary key, auto-increment
2. "form": (form) datatype: characters; transcription of morpheme,
word, phrase, sentence in object language; name should probably be
changed from "form" to "transcription" to avoid the Form/form
ambiguity.
3. "phontranscr": (phonetic transcription) datatype: characters;
not used in BLD; could be useful though ...
4. "morphbreak": (morpheme breakdown) datatype: characters;
component morphemes of a Form; '-' used as a delimiter by convention,
though others could be used additionally/instead (e.g., "=", "+")
5. "morphgloss": (morpheme gloss) datatype: characters; shorthand
gloss for each morpheme in morphbreak; '-' used as a delimiter by
convention, though others could be used additionally/instead (e.g.,
"=", "+")
6. "translation": datatype: characters; translation of form as
offered by speakers or deduced by elicitor; considered changes (see
below): facilitate multiple translations, allow translation "felicity
judgments", change label to "glosses"
7. "comments": (general comments) datatype: characters; a catch-
all field for general comments about the Form
8. "speaker": datatype: characters; speaker who uttered the form,
if applicable; values automatically drawn from Speakers table
9. "elicitor": datatype: characters; researcher who elicited the
form, if applicable; values automatically drawn from Users table
10. "source": datatype: characters; published textual source of
the fomr, if applicable;
11. "enterer": datatype: characters; BLD user who entered the
form, if applicable; values automatically drawn from Users table
12. "dateelicited": datatype: date; YYYY-MM-DD when Form was
elicited (or year of publication of textual source?); user forced to
choose from dropdown boxes in HTML form from list of valid dates;
current date is the default
13. "dateentered": datatype: date; YYYY-MM-DD when Form was
entered in to BLD; automatically populated by the system
14. "datemodified": datatype: date & time; YYYY-MM-DD hh:mm:ss
when Form was last modified; automatically populated by the system
15. "keywords": (key words) datatype: characters; to promote
consistency, list of possible values drawn from keywords (KeyWords)
table (keywords is User-editable). Note: system automatically adds
";" as a delimiter between keywords when entering them into the table.
16. "semfld": (semantic field) datatype: characters; not used by
the BLD; I initially thought this could be used for hand-entering
semantic relationships, but automated techniques could probably be
devised to do this, if desired...
17. "class": datatype: characters; not used by BLD
18. "syncat": (syntactic category) datatype: characters;
morphological or syntactic category of the Form, e.g., "sent" for
"sentence", "det" for "determiner"; list of possible values drawn from
syncat (MorphoSyntacticCategories) table (which is user-editable)
19. "grammaticality": datatype: characters; grammaticality of the
form as determined by speaker judgment; list of possible values is
hard-coded into the logic of the BLD but this should probably be
changed to a user-editable table in the db
20. "file": datatype: characters; string of zero or more delimited
File IDs representing Files linked to the Form; this should probably
21. "scomments": (speaker comments) datatype: characters; comments
from the speaker that relate to the Form
22. "method": (elicitation method) datatype: characters; method by
which the Form was elicited; list of possible values drawn from
elicitation_method (ElicitationMethod) table
23. "verifier": datatype: characters; name of a User who has
verified the accuracy Form (i.e., perhaps by re-eliciting it); only
one verifier currently possible per form; allow more?
24. "syncat_string": (morpho-syntactic category string) datatype:
characters; delimited string programmatically created from the values
in morphbreak and morphgloss: for each morpheme-gloss pair, if BLD
finds a match for that pair in another Form it uses its syncat value
to build the syncat_string of the original form.
25. "morphbreak_formids": (morpheme break form IDs) datatype:
characters; currently not used by BLD BUT SHOULD BE, cf. Ideas for
Change # 1
26. "morphgloss_formids": (morpheme gloss form IDs) datatype:
characters; currently not used by BLD BUT SHOULD BE, cf. Ideas for
Change # 1
########################
2. Files table ("file"):
########################
1. "fileid": unique integer, primary key, auto-increment
2. "name": datatype: characters; original name of the uploaded
file, i.e., as the User named it on his/her machine
3. "retrieve_name": datatype: characters; name of uploaded file as
it is stored on the server; the BLD changes the name of the file
(retaining the extension, of course) to its fileid value. This is a
simple way of ensuring that new files do not overwrite pre-existing
ones. But it's not a very good system. When users download files,
the file they download does not have the name as displayed in the BLD!
4. "type": datatype: characters; the value for this column is
created programmatically by reading the MIME type of the uploaded file
5. "uitype": (user interface type) datatype: characters; this
column has a poorly chosen name! Its value is, I believe, a user-
friendly translation of the MIME type value. However, the BLD only
creates "uitype"s for certain "type"s and if it has no translation,
the uitype is used instead. This system needs to be fixed...
6. "size" (file size) datatype: integers; the size of the uploaded
file in bytes (or bits, I can't remember...). This is a
programmatically created value which is converted to user-friendly
units, i.e., KB, MB, GB at display time
7. "enteredby" (entered by) datatype: characters; the name (first
and last) of the User who entered the data; programmatically created
value
8. "enteredwhen" (entered when) datatype: date; date of File's
entry: YYYY-MM-DD; programmatically created value
9. "description" datatype: characters; user-generated description
of the File; a catch-all field
10. "embedded_file" datatype: characters; when a user creates a
File whose content is a video served from an external site, this field
is populated by the HTML that should be used to display the video,
i.e., an HTML "object" tag; I just realized, but this field allows
users to add HTML directly into the system and is probably a security
risk.
11. "embedded_file_password" datatype: characters; if the embedded
file requires a password to be viewed, this field will have been given
a value by the user.
###############################
3. Collections table ("story"):
###############################
1. "storyid": unique integer, primary key, auto-increment
2. "type": datatype: characters; the type of Collection, i.e., on
the BLD one of "Elicitation", "Story", "Lunch" or "Other"; we should
probably have default types in the OLD but also allow users/
administrators to add/remove them.
3. "title": datatype: characters; title of the Collection
4. "teller": datatype: characters; first and last name of the
Speaker (if applicable) from whom the Collection was elicited; perhaps
multiple values should be possible (i.e., for conversations...); the
list of possible values should also be programmatically populated from
the
5. "source": datatype: characters; reference to textual source of
Collection, if applicable
6. "elicitor": datatype: characters; User who elicited Collection,
if applicable
7. "enterer": datatype: characters; User who entered Collection;
programmatically created value
8. "dateelicited": datatype: characters; YYYY-MM-DD formatted date
9. "dateentered": datatype: characters; YYYY-MM-DD formatted date;
programmatically created value
10. "description": datatype: characters; description of
Collection; might be a good idea to allow some kind of formatting
markup to be entered here (MoinMoin? Markdown?)...
11. "contents": datatype: characters; list of Form IDs that
constitute the Collection; stored as a serialized PHP array
12. "file": datatype: characters; list of File IDs that are
associated to the Collection; stored as a serialized PHP array
13. "datemodified": datatype: date & time; YYYY-MM-DD hh:mm:ss
when Collection was last modified; automatically populated by the
system
###################
4. Users ("users"):
###################
1. "username": datatype: characters; unique username of BLD User;
by convention it is lowercase first initial of first name followed by
lowercase lastname: "jdunham", etc.
2. "passwd": (password) datatype: characters; password used to
gain access to the BLD; an alphanumeric password encrypted (by the
HTML form or MySQL? I can't remember).
3. "email": datatype: characters;
4. "memory": datatype: characters; list of Form IDs representing
the Forms that the User has in her personal Memory. Memory allows
Users to keep a set of Forms as a kind of temporary workspace, i.e.,
for creating Collections from or just for collecting data while using
the system and then exporting it for another purpose.
5. "permissions": datatype: characters; a simple way of giving
names to different permission levels; if certain functionality is
restricted in some way, the system queries this value in order to
authorize the User. Current values are "ADMINISTRATOR", "FULL_USER"
and "PARTIAL_USER". We will probably want to think about user access
levels and permissions in more depth...
6. "lexicon": datatype: characters; I can't figure out what this
field was intended to be for? Perhaps it was to fulfill the "memory"
function? It doesn't appear to be being used now. Trash it...
7. "html": datatype: characters; field for Users to enter the
content of their personal page. The personal page is a web page
created automatically by the BLD for each User. Users can add content
to their personal page (e.g., links to papers and presentations that
they have uploaded as files, some details about their history with
Blackfoot) and they can markup this content using a subset of HTML
tags.
8. "first_name": datatype: characters;
9. "last_name": datatype: characters;
10. "affiliation": datatype: characters; university or other
institution to which the User belongs; not mandatory
11. "picture": file name of a File that is an image; the user
enters the File ID and system derives the file name (which is one of
"fileid.jpg", "fileid.png", etc.)
12. "stable_syncat": (stable morpho-syntactic category) datatype:
characters; field that takes one of two values: "yes" or "no"; User is
forced to choose to have stable syncat "on" or "off" on their
"Settings" page and the system encodes that choice as "yes" or "no".
Stable syncat means that, when you are entering Forms, the system
remembers the morpho-syntactic category of the last form you entered
and makes it the default for the Form that you are currently
creating. Since Forms are most often sentences (at least, in my
experience), this saves time in data entry.
#########################
5. Speakers ("speakers"):
#########################
1. "speakerid": unique integer, primary key, auto-increment
2. "email": datatype: characters;
3. "html": datatype: characters; the same type of field as the
"html" field of the "users" table
4. "first_name": datatype: characters;
5. "last_name": datatype: characters;
6. "dialect": datatype: characters;
########################################
6. MorphoSyntacticCategories ("syncat"):
########################################
1. "syncatid": unique integer, primary key, auto-increment
2. "name": datatype: characters; shorthand name of the morpho-
syntactic category. By convention values are lowercase and 3-4
characters long; examples from Blackfoot include: "sent" (sentence),
"dem" (demonstrative), etc. Users can add new syncats to the system
but are encouraged to do so with care and deliberation. Redundancies
should be avoided so that the "syncat" and "syncat_string" fields of
the "linguistic_forms" table can be searched efficiently; however,
linguistic knowledge changes, so users should be able to change the
syncats of their system. Note: the values entered here will be those
made available to the "syncat" forced choice list when entering Forms;
also, the "syncat_string" fields of the "linguistic_forms" table will
contain strings of these "names".
3. "description": datatype: characters; here the users can describe
the meaning of the syncat that their "name" refers to. E.g. "Nouns
(n) are defined by the following distribution ...".
#########################
7. KeyWords ("keywords"):
#########################
1. "keywordid": unique integer, primary key, auto-increment
2. "name": datatype: characters; name of the keyword. Keywords
are pretty "freestyle", i.e., users should feel free to add whatever
keywords suit their purposes. Though consistency and avoidance of
redundancy are always helpful...
3. "description": datatype: characters; here the users can describe
the meaning of the keyword that their "name" refers to. E.g. for
"noun incorporation" the description might be "Forms tagged with
``noun incorporation`` have a nominal-like element within the verbal
complex.".
#############################################
8. ElicitationMethods ("elicitation_method"):
#############################################
1. "methodid" (method ID): unique integer, primary key, auto-
increment
2. "name": datatype: characters; name of the elicitation method.
As with keywords and syncats, elicitation_methods are user-editable.
Elicitation_methods currently contained in BLD: translated English to
Blackfootm, judged researcher`s Blackfoot utterance, volunteered
without instigation, told story, described stimulus, correction of
ungrammatical utterance, MuDBE, picture task, volunteered as alternate
form.
3. "description": datatype: characters; description of the
elicitation method in prose.
#########################
#########################
(D) Ideas
#########################
#########################
1. ** Self-referential Morpheme Links: each morpheme and each
gloss in the morphbreak and morphgloss lines are displayed as links to
the Forms with which they correspond (if such a Form exists).
However, currently the BLD builds these links by doing a database
query for each morpheme and morphgloss EVERY TIME a Form is viewed.
This is EXTREMELY computationally expensive and makes results load far
too slow. This MUST be changed in the OLD: i.e., the fields
"morphbreak_formids" and "morphgloss_formids" will have their values
(e.g., a list of IDs: "256-2265-3-17") computed at Form creation time
and will be altered only when necessary, i.e., at update time.
2. Multiple Translations: it makes sense for a Form to be
constrained to a single transcription, but multiple translations would
seem to be necessary. The questions is, how do we allow users to add
1 or more translations, while maintaining a consistent mode of
delimiting each translation? Having multiple "translation" columns in
the Forms table seems like a bad idea -- how many would we create?
So, assuming the Form table has 1 translation column, multiple
tranlations must be consistently delimited, e.g., "they danced; they
are dancing". Perhaps the best approach would be to have the HTML
form provide a variable # of translation input fields and then the
system can join them together in a consistent manner (like what is
currently done in the "keywords" field of the "linguistic_forms"
table). One small issue with this approach is that, in order to
search the system well, users may need to be aware of the way in which
translations are stored by the system...
3. Grammaticality Judgments for Translations: I often find myself
transcribing Forms with several translations, some of which are
clearly compatible with the Form and others which are not. I notate
this as follows: "nitsspiyi" - "I danced", "*I danced". I.e., this
means that "I am dancing" is incompatible as a translation for
"nitsspiyi". I think such translation "compatibility/grammaticality
marking" should be encodable in the OLD. Note: we may have to
coordinate implementing this feature with the Multiple Translations
feature discussed above.
4. Many-to-many Mappings Should be Represented by Independent
Tables: e.g., the relationship between Forms and Files should be
stored in a new table, perhaps called forms_files, with the following
columns: "formFileID", "formID", "fileID". Same for the other
relationships (e.g., Forms and KeyWords). This will allow searches
like: "give me all Forms that are part of a Collection" or "give me
all Files that are associated to Forms in Collection X". The basics
of how to do this in Pylons with SQLAlchemy can be found in the Pylons
Book: http://pylonsbook.com/
5. Bibliographic Entities?: external texts seem to be an entity
that a system like the BLD/OLD needs to reference. E.g., if many
Forms come from a grammar or dictionary, it is tedious (and generates
inconsistency) to input the "source" reference of each Form by hand.
Perhaps a Bibliography table (somehow linked to the Files table)
should be added. But this is a feature of secondary urgency...
6. Grammaticality Judgments: we want consistency in grammaticality
judgments, but there seem to be different conventions. Does anybody
know if there is a particular set of conventions that it would be wise
to adopt? Or should we just let it grow organically? I think the
idea of making grammaticality judgments a separate field from the
transcription (i.e., "form" in the BLD) field is a good one. And
having a user-editable table for grammaticality judgment categories
(as the BLD does for keywords and elicitation_methods) encourages some
level of consistency.
7. Elicitation Methods: so far, the elicitation_method table of
the BLD has 9 user-generated values. What can we learn from this?
Should we let this grow organically or impose structure and
consistency on it? Are some of the values (partially) redundant? If
anybody would like to comment on this, the names of the
elicitation_method values currently used in the BLD are given in "8.
ElicitationMethods" above.
8. Screening File Uploads by File (MIME) Type: the BLD currently
does NOT do this. This is a security vulnerability and is probably
not a good idea for the OLD (even though accessing the OLD will
require authentication)
9. Using a File's MIME Type to Disply User-friendly File Type
Names: this should be a simple feature to implement in the OLD...
10. Embedded Videos as Files. The BLD currently allows users to
create Files whose content is a video stored on an external web site
(e.g., YouTube). PRO: allows users to upload large video files
without taxing our server and via a familiar and user-friendly
interface (i.e., YouTube or Vimeo). CON: all the data for the system
is not in one place so features that would permit downloading of Files
en masse would not be able to include these files (perhaps other cons
also...). Any thoughts on this?
11. Adding Files is a Multi-page HTML Form. When users add a
File, they first upload the file and then a new page appears in which
they can enter a description. I don't think it needs to be this way.
This can be easily changed (I think).
12. Documentation, Documentation, Documentation! In order to make
the OLD usable it is essential to put a good deal of energy into the
documentation, especially the documentation for the end users and that
for the OLD-instance administrators (i.e., those who will need to
figure out how to install and configure the system on their own
server). Developer documentation and good coding documentation are
also important.
13. Filter embedded_file Field of File Table. At present, the BLD
allows (even encourages) users to enter raw HTML into this field.
Allowing users to do this is generally considered bad practice and we
should at least partially filter the HTML or use an existing
lightweight markup language like reStructuredText or Markdown (http://
daringfireball.net/projects/markdown/).
14. OLD Stats. A visual and informative statistics page would be
a nice addition to the OLD. It could show users, e.g., how many forms
there are, a graph of how the system has been growing, who are the
active contributors, what dialects are more prevalent in the data,
etc. Using Python's MatPlotLib, this should not be too difficult (?)
15. Extend the Features of Collections. One idea: I think it
would be nice to allow users to enter comments and prose in
Collections and be able to interleave such comments with the Forms
that constitute the "core" of the collection. But then Collections
would basically become user-edited documents with a special syntax for
referencing Forms. Would this cause any foreseeable issues?
16. Deleting Files. Currently the BLD does not allow users to
delete files. This feature should be implemented. But it raises
questions about permissions. Should User A be able to delete a file
uploaded by User B? (see Altering Data and Permissions below).
17. Altering Data and Permissions. The BLD allows anybody to
alter anybody else's data. Is this a good idea? Should we allow
"pending updates" that must be approved by the enterer/elicitor of the
Form. Should we allow non-creators of a Form to make comments on that
Form but not alter it? My intuition was that users would be
responsible and respectful and that, since the system keeps a record
of every version of every Form, reversion to a previous version would
mitigate the risks of free data altering. Thoughts?
18. User Authorization. What levels of users should be created
and what permission/default interfaces should they have? I was
thinking "Administrator" (everything) , "Researcher" (view, create,
update, and delete Forms, Files, and Collections, etc.) "User" (view
entities only, specialized interfaces, ...).
19. Form Export. The OLD needs to be able to export Form data in
a variety of formats. Here are the basic file types/syntaxes I was
thinking of: plain text, XML, LaTeX, even SQL(?). But what formats in
particular?: CSV (comma-separated values); TSV (tab); plain text
format for easy copy and paste into Word/OpenOffice documents (?); an
XML dialect that we invent; an XML dialect for use in language games/
software; an XML dialect standardized by some (typological or
computational) linguistics body (i.e., ODIN, ...); LaTeX with various
formatting styles (e.g., Covington for word-aligned formatting of
Forms, other?); .odt files (i.e., OpenOffice.org files -- if we're
really feeling ambitious!).
20. Form Import. I find that entering large amounts of data into
the BLD using the HTML forms can be time-consuming (even with cut-and-
paste, intelligent form defaults and keyboard shortcuts programmed
into the HTML form interface). It would be nice to develop a terse
XML vocabulary that elicitors could use to quickly markup their
elicitation transcriptions and which the OLD could quickly ingest and
convert into the appropriate Form and Collection formats. In fact, I
have begun developing such an XML vocabulary already, which I call
"elixml" (i.e., Elicitation XML). Let me know if you want to know
more about this.
21. Users Table Needs Primary Key.
22. Create Interface and Permissions for Non-Researcher Users --
see "User Authorization" above.
23. Batch Update. Perhaps we should allow some mechanism for
updates of multiple Forms (or Files or Collections). For example, the
group may learn that what they have been calling a single morpheme is
really two different ones. There should be some way of making the
change in morpheme labels throughout all relevant Forms, without doing
each one by hand. Of course, this would require some kind of
consensus on the change, which may be impossible! Perhaps this
feature, if it is to exist, should only be put in the hands of
Administrators and should only be used after deliberation by the group
has resulted in a decision. This then raises a general question about
what the OLD is supposed to be: is it a (1) pluralistic collection of
data where different theories and notations abound or (2) a monolithic
collection where conventions are enforced and perfect consistency is
the goal (ha!). My guess is that the reality will be somewhere in
between... At any rate, Batch Update is an advanced feature that
certainly does not need to be present in "OLD version 0.1".
24. Forums for Discussion. Users should be able to have
discussions about the data and the system ON the system, i.e., there
should be a forum for threaded discussions. A key feature for the
entries would be a syntax for referencing or embedding entities (i.e.,
Files, Forms, Collections); others features? ...
25. Form History. Viewing the previous manifestations of a Form
ought to be easy. It should also be easy to revert a Form to a
previous version. A related feature would be the highlighting of the
changes between one version of the Form and another (though that may
be a more complex algorithm than it seems...)