> I just have a small comment about the evaluation script: langid-evaluate.prl we got that is:
> this part:
>
> while (<GOLD>)
> {
> chomp;
> if (my($docid,$docclass) = &process_line($_))
> {
> $gold{$docid}{$docclass} = 1;
> $docclass_list{$docclass} = 1;
> $goldtotal++;
> }
> }
>
> close GOLD;
>
> I think chomp does not perform well and instead I use $line =~ s/\s+$//;
> I already got errors: F-score, Precision and Recall are equal to 0.000 just because the special character that chomp does not catch when I ran on Mac OS system.
You are right that it's brittle and doesn't handle DOS-style line breaks, but
it should be possible to pre-convert your files to UNIX line breaks before
evaluating with the script, e.g. using dos2unix (or fromdos). Alternatively,
create a simple Perl filter to convert your files to UNIX, e.g. with something
like:
perl -i -pne 's/\r\n/\n/g' FILE
over your output file. This is perhaps a better solution than modifying the
evaluation script at this late stage.
Tim