1. Need better Wiki page cleaning mechanism. May be you need to look
at extracting printable version. I haven't gone through Wiki Dump.So
I'm not sure whether it is included in the dump or not.
2. Of course can think of other data sources too.
3. Are you aiming at predicating on Word sequences or character
sequences? Please re-think on the model.
[..]that the dump does not reflect the frequency of word usage in
Telugu language[...]
On 2/13/13, Rakesh A <
rake...@gmail.com> wrote:
> ఇంకొద్దిగా వివరంగా వ్రాయవలసినదేమో...
> I took the xml dump and extracted the pages. These pages are now in Wiki
> Metalanguage.
>
> From that wiki marked up text, I just removed all Enlgish letters and
> puctuation and special characters.
>
> That way I got these telugu words. Understandably words like దస్త్రం వర్గం
> జిల్లా from the templates stayed on.
> Also some user names stayed over.
>
> I need to use a better system to remove wiki markup text. Like
> this<
http://pastebin.com/idw8vQQK>one I think.
--
Dileep.M
+91-897-855-9072