Creation of Discriminator.txt and Discriminator_temp.txt

1 view
Skip to first unread message

Tobias Horsmann

unread,
Jan 8, 2018, 2:26:01 PM1/8/18
to dkpro-lab-developers
Hi,

when using a nested subtasks i.e. DKPro TC Crossvalidation, Lab creates two Discriminator files, one has the name '_temp' in it. It appears as the information in both files should be merged.

The Discriminator.txt contains for instance

#Mon Jan 08 19:43:51 CET 2018

org.dkpro.tc.ml.ExperimentCrossValidation$1|featureMode=document

org.dkpro.tc.core.task.InitTask|threshold=null

org.dkpro.tc.ml.ExperimentCrossValidation$1|useCrossValidationManualFolds=false

org.dkpro.tc.core.task.InitTask|developerMode=false

org.dkpro.tc.core.task.InitTask|readerTest=[org.dkpro.tc.examples.io.TwentyNewsgroupsCorpusReader, sourceEncoding\=UTF-8, useDefaultExcludes\=true, includeHidden\=false, logFreq\=1, sourceLocation\=src/main/resources/data/twentynewsgroups/bydate-test, language\=en, patterns\=[+]*/*.txt]

org.dkpro.tc.core.task.InitTask|readerTrain=[org.dkpro.tc.examples.io.TwentyNewsgroupsCorpusReader, sourceEncoding\=UTF-8, useDefaultExcludes\=true, includeHidden\=false, logFreq\=1, sourceLocation\=src/main/resources/data/twentynewsgroups/bydate-train, language\=en, patterns\=[+]*/*.txt]

org.dkpro.tc.core.task.InitTask|featureSet=[org.dkpro.tc.features.length.NrOfTokens| uniqueFeatureExtractorName, NrOfTokens690265819978297], [org.dkpro.tc.features.ngram.LuceneNGram| ngramUseTopK, 50, ngramMinN, 1, ngramMaxN, 3, uniqueFeatureExtractorName, LuceneNGram690265821454424]

org.dkpro.tc.core.task.InitTask|learningMode=singleLabel

org.dkpro.tc.core.task.InitTask|featureMode=document


while the Discriminator_temp.txt contains complementary information about the nested tasks i.e.

#Mon Jan 08 19:43:51 CET 2018

org.dkpro.tc.core.task.InitTask|featureMode=document

org.dkpro.tc.core.task.InitTask|readerTrain=[org.dkpro.tc.examples.io.TwentyNewsgroupsCorpusReader, sourceEncoding\=UTF-8, useDefaultExcludes\=true, includeHidden\=false, logFreq\=1, sourceLocation\=src/main/resources/data/twentynewsgroups/bydate-train, language\=en, patterns\=[+]*/*.txt]

org.dkpro.tc.core.task.InitTask|developerMode=false

org.dkpro.tc.core.task.OutcomeCollectionTask|readerTest=[org.dkpro.tc.examples.io.TwentyNewsgroupsCorpusReader, sourceEncoding\=UTF-8, useDefaultExcludes\=true, includeHidden\=false, logFreq\=1, sourceLocation\=src/main/resources/data/twentynewsgroups/bydate-test, language\=en, patterns\=[+]*/*.txt]

org.dkpro.tc.core.task.ExtractFeaturesTask|featureMode=document

org.dkpro.tc.core.task.ExtractFeaturesTask|applyWeighting=false

org.dkpro.tc.core.task.InitTask|featureSet=[org.dkpro.tc.features.length.NrOfTokens| uniqueFeatureExtractorName, NrOfTokens690265819978297], [org.dkpro.tc.features.ngram.LuceneNGram| ngramUseTopK, 50, ngramMinN, 1, ngramMaxN, 3, uniqueFeatureExtractorName, LuceneNGram690265821454424]

org.dkpro.tc.ml.libsvm.LibsvmTestTask|learningMode=singleLabel

org.dkpro.tc.core.task.InitTask|readerTest=[org.dkpro.tc.examples.io.TwentyNewsgroupsCorpusReader, sourceEncoding\=UTF-8, useDefaultExcludes\=true, 

...... and so on


I am not really sure why these two files are created in the first place but it appears as the information should be stored in one file and the _temp.txt should be deleted?

Richard Eckart de Castilho

unread,
Jan 8, 2018, 2:46:31 PM1/8/18
to dkpro-lab-...@googlegroups.com

> On 08.01.2018, at 20:26, Tobias Horsmann <tobias....@gmail.com> wrote:
>
> when using a nested subtasks i.e. DKPro TC Crossvalidation, Lab creates two Discriminator files, one has the name '_temp' in it. It appears as the information in both files should be merged.

I don't know where this _temp comes from - I don't think it comes from DKPro Lab. Search the code for "_temp" and you find nothing.

That said, DKPro Lab does use temporary files with the extension ".tmp" in FileSystemStorageService.storeBinary(). Data is first written to the "XXX.tmp" file
and this is renamed to the final name (XXX) once the data has been completely written.
This is important to avoid in a parallel environment that data is already considered
to be full written when some thread is still writing it. But it looks like your "_temp"
must come from somewhere else. Did you search the TC code for the string?

cheers,

-- Richard
Reply all
Reply to author
Forward
0 new messages