Corpus search excluding content a specific sub-corpus?

9 views
Skip to first unread message

H Pirker

unread,
Mar 1, 2023, 10:01:25 AM3/1/23
to NoSketch Engine
Dear all,

when searching in a corpus is there a way to make use of the complement  of a sub-corpus?
I.e. is there a  search option for  excluding the content of a specific subcorpus?

(concrete example: movie-theatre-programs are part of our newspaper-corpus, and  these  can have considerable distorting effects because words/phrases found in a movie title will be mentioned daily, in each newspaper, for many cinemas, for an extended period of time.  We are able to formulate bulky CQL-queries to identify these programs and thus "place" them into  a sub-corpus. But then we want to apply searches on the global corpus which exclude these programs. Our current solution is to use add_filelds.sh to mark the members of the sub-corpus with a newly created explicit stucture-element, but I wonder whether there is a more direct method to deal with that use-case) 
 
thanx a lot for your advice

Hannes

Vladimír Benko

unread,
Mar 2, 2023, 2:14:21 AM3/2/23
to no...@sketchengine.co.uk
Dear Hannes,

We are solving this problem by dedeplicating our corpus data at the paragraph and also at the sentence level (using two different methods) -- I'll send you the details off-list, if you are interested.

Best,

Vlado B, 8:10


--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/1ed9881b-67f1-4ace-9333-ad0feaa90704n%40sketchengine.co.uk.


--
Vladimír Benko

Slovak Academy of Sciences
Ľ. Štúr Institute of Linguistics
Panská 26, SK-81101 Bratislava

Tel +421-2-54431762 Fax -54431756

http://aranea.juls.savba.sk/guest/
https://www.facebook.com/araneawebcorpora/

H Pirker

unread,
Mar 2, 2023, 5:46:16 AM3/2/23
to NoSketch Engine, Vladimír Benko
Dear Vlado, 

thanks for your suggestion. 
I am always interested in what others use as solution for their de-duplication (I am using onion, btw.:  https://corpus.tools/wiki/Onion )  ... but I would not use that as a 
solution for our special case. As a matter of fact, we already _are_ performing de-duplication also on the <p>-level and I can see, that this would catch  _some_ parts of _some_  cinema-programs but  only a _minority_ of cases! 
In our setting we already _do_ have a fine-grained, manually designed + tested  CQL-query for detecting cinema-programs with high precision (and hopefully high recall as well). Thus making use of that CQL-query to construct a  "cinemaprog-subcorpus" is straightforward. The question was, whether there is a way to specify "non-membership" in a subcorpus as additional condition in NoSke-searches.  (with add_fields.sh there there exists a means for transforming the information about  the subcorpus-membership-status into a normal structure-attribute, but I was just wondering if there is some _alternative_ solutions to that method) 

cheers

Hannes

Vladimír Benko

unread,
Mar 2, 2023, 6:59:30 AM3/2/23
to no...@sketchengine.co.uk

Hannes,


I am always interested in what others use as solution for their de-duplication (I am using onion, btw.:  https://corpus.tools/wiki/Onion )  ... but I would not use that as a 
solution for our special case. As a matter of fact, we already _are_ performing de-duplication also on the <p>-level and I can see, that this would catch  _some_ parts of _some_  cinema-programs but  only a _minority_ of cases! 
In our setting we already _do_ have a fine-grained, manually designed + tested  CQL-query for detecting cinema-programs with high precision (and hopefully high recall as well). Thus making use of that CQL-query to construct a  "cinemaprog-subcorpus" is straightforward. The question was, whether there is a way to specify "non-membership" in a subcorpus as additional condition in NoSke-searches.  (with add_fields.sh there there exists a means for transforming the information about  the subcorpus-membership-status into a normal structure-attribute, but I was just wondering if there is some _alternative_ solutions to that method)

Onion usually does not perform well on short text segments, especially in situations when their length is less than n-gram length used as the Onion parameter.  In fact, these are considered duplicate only if the last words of the preceding segment are duplicate as well.  This is why we use a simple signature method to deduplicate sentences, and ignore numbers and punctuation when computing the respective hashes.

Best,

Vlado B, 12:55

Vojtěch Kovář

unread,
Mar 2, 2023, 10:47:47 AM3/2/23
to H Pirker, NoSketch Engine
Dear Hannes,

there is no such option implemented on the front end, however, it is possible on the back end. If you send complement_subc=1 as a parameter to bonito (e.g. in the run.cgi/concordance call), the result will correspond to the complement of the current subcorpus (sent in the usesubcorp parameter).

So far, the front end uses this feature only in the Keywords function where you can select "the rest of the corpus" as the reference subcorpus.

Hope this helps, best regards

Vojtech
Sketch Engine Team


--

Pirker, Hannes

unread,
Mar 2, 2023, 12:40:41 PM3/2/23
to Vojtěch Kovář, H Pirker, NoSketch Engine

Dear Vojtech, 

great, I will give it a try tomorrow. 

thanks a lot!                
Hannes 
--
Austrian Centre for Digital Humanities 
and Cultural Heritage (ACDH-CH)
Austrian Academy of Science
A-1010 Vienna, Bäckerstraße 13
Tel.: +43 (1) 51581-2293

From: Vojtěch Kovář <vojtec...@sketchengine.co.uk>
Sent: 02 March 2023 16:47:07
To: H Pirker
Cc: NoSketch Engine
Subject: Re: Corpus search excluding content a specific sub-corpus?
 

Pirker, Hannes

unread,
Mar 3, 2023, 7:23:24 AM3/3/23
to Vojtěch Kovář, H Pirker, NoSketch Engine

Dear Vojtech,

FYI: I first tried to simply modify a "normal" concordance-call by just adding complement_subc=1 -- 

which unfortunately doesn't have any effect

 
.../crystal/#concordance?complement_subc=1&...


In contrast the following DOES work:
 

.../bonito/run.cgi/concordance?complement_subc=1&...


(I simply should have followed your advice from the beginning ...) 

Thanks a lot

        

Hannes  
--
Austrian Centre for Digital Humanities 
and Cultural Heritage (ACDH-CH)
Austrian Academy of Science
A-1010 Vienna, Bäckerstraße 13
Tel.: +43 (1) 51581-2293

From: Pirker, Hannes <Hannes...@oeaw.ac.at>
Sent: 02 March 2023 18:39:37
To: Vojtěch Kovář; H Pirker
Reply all
Reply to author
Forward
0 new messages