Hi all,
Sorry for the slow update, I was busy having a relatively non-
computer
weekend. Anyway, I've now placed the current version of my HOTU
rebuild online at
http://hotu.pratyeka.org/
So I guess the major question is - what makes this one different?
Well, as well as making everything multilingual-friendly (no small
task!), I've
made URL-friendly 'codes' for each of the games, platforms, companies
and
people, which enables pretty / SEO friendly / human readable URLs.
The site uses AJAX/JSON to load and cache the lists of platforms,
themes
and genres, as well as other javascript techniques, client side
caching, MySQL
caching, the memcache daemon and lighttpd to speed up load times and
make the database easily searchable via any aspect.
Gone are the multi-screen results of the previous site, if you search
for
a certain group of games they will all show immediately on one results
page.
Data import so far includes everything from the Excel sheet, plus all
of the
static text data and box/screenshots remaining on the swiss mirror
(updated
to reformat / fix old links, mostly normalise formatting, etc).
Gamebooks are
so far inaccessible, but I will add them ASAP.
During import I used regular expressions to further categorise game
files
in to various types to make them a little more useful / delineated.
This way
we can also track which games have or don't have various types of
additional resources, so that we can more easily locate or write them.
The list is: map/patch/source/original/remake/tools/crack/demo/guide
(which
includes walkthroughs, solutions, guides)/reference.
I also similarly categorised URLs which were clearly one of: reviews,
official
or unofficial/fansites.
I've also added an experimental placeholder for DOSBox compatibility
information, in the future this will be auto-harvested from the DOSBox
page (data ripper only needs to be run once per DOSBox release,
compatibility information will auto-generate an appropriate DOSBox
version download link).
Ideally we could also add DOSBox configurations if required, there are
libraries of these out there already...
In testing I noticed that the boxshots archive linked to on the
'files' page
of the group was definitely incomplete, so I ripped the entire swiss
mirror
(>800MB!) and then used unix tools to gather all of the jpg files and
dump
them in one directory, which mostly fixed the problem.
There are still some missing box shots, which have filenames in the
Excel sheet but which were not preserved on the swiss mirror. I will
publish a list of the affected games should anyone want to go hunting.
Another problem was incorrect naming. In the Excel sheet, 'related
games' are sometimes not the exact, current names of the corresponding
games. Coupled with the fact that related game IDs are missing, this
is a bit of a pain. I've used a partial-match mechanism to resolve
some
of these, but there is still a shortlist of these I've detected that
still need to
be manually fixed.
Over the next few days I will try to find the time to write a crawler
to go
through the mirrored mirror's files (!!) and extract 'real dog'
status,
rating, # rating votes, company intros, and anything else that appears
to be missing from the other data sources. This will hopefully
complete
the import. (I'm sure this won't be difficult, it's just a matter of
finding
time.)
There are a few known issues: platform information is somewhat
corrupt due to the excel dump being textual (ie: "Windows X, Windows
Y, ..."
vs "Windows X" vs "Windows Y" entires being a pain to break up).
Around five different game records have weird issues due to some
kind of parsing bug, which causes a few strange bits of data to
appear.
Finally, there is a problem with subgenre allocation so selecting a
subgenre only shows one game. These should all be fixed soon.
Please let me know what you all think about the rebuild so far, and
let me
know if you can see any errors, since it's important to catch any
issues
at the earliest possible stage (ie: before manual additions /
modifications
to the current database).
Also, I really liked Siddhartha's idea of considering the project as a
form of
modern-day cultural archivism. It's probably fair to say that having
multilingual support is very important for the project if we are going
to
consider the goal of the project to be preserving and sharing cultural
heritage.
Amusingly work on HOTU today was made possible by China's
'grave sweeping holiday' - a Confucian tradition about respect for
your ancestors. See
http://en.wikipedia.org/wiki/Qingming_Festival
and
http://en.wikipedia.org/wiki/Along_the_River_During_Qingming_Festival
(one of the most famous ancient Chinese paintings, of the ancient
capital
Kaifeng - with a prominent alcohol shop next to the bridge in the
center
of town!).
OK, so that's it for now.
Stay well, be happy and respect your ancestors! :)
- Walter