--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/1ed9881b-67f1-4ace-9333-ad0feaa90704n%40sketchengine.co.uk.
Slovak Academy of Sciences
Ľ. Štúr Institute of Linguistics
Panská 26, SK-81101 Bratislava
Tel +421-2-54431762 Fax -54431756
http://aranea.juls.savba.sk/guest/
https://www.facebook.com/araneawebcorpora/
Hannes,
I am always interested in what others use as solution for their de-duplication (I am using onion, btw.: https://corpus.tools/wiki/Onion ) ... but I would not use that as a
solution for our special case. As a matter of fact, we already _are_ performing de-duplication also on the <p>-level and I can see, that this would catch _some_ parts of _some_ cinema-programs but only a _minority_ of cases!
In our setting we already _do_ have a fine-grained, manually designed + tested CQL-query for detecting cinema-programs with high precision (and hopefully high recall as well). Thus making use of that CQL-query to construct a "cinemaprog-subcorpus" is straightforward. The question was, whether there is a way to specify "non-membership" in a subcorpus as additional condition in NoSke-searches. (with add_fields.sh there there exists a means for transforming the information about the subcorpus-membership-status into a normal structure-attribute, but I was just wondering if there is some _alternative_ solutions to that method)
Onion usually does not perform well on short text segments, especially in situations when their length is less than n-gram length used as the Onion parameter. In fact, these are considered duplicate only if the last words of the preceding segment are duplicate as well. This is why we use a simple signature method to deduplicate sentences, and ignore numbers and punctuation when computing the respective hashes.
Best,
Vlado B, 12:55
--
Dear Vojtech,
great, I will give it a try tomorrow.
Dear Vojtech,
FYI: I first tried to simply modify a "normal" concordance-call by just adding complement_subc=1 --
which unfortunately doesn't have any effect
.../crystal/#concordance?complement_subc=1&...
In contrast the following DOES work:
.../bonito/run.cgi/concordance?complement_subc=1&...
(I simply should have followed your advice from
the beginning ...)
Thanks a lot