Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

NoSkE ignores document boundaries

14 views
Skip to first unread message

Michal Měchura

unread,
Nov 22, 2024, 12:40:11 PM11/22/24
to NoSketch Engine

It seems to me that NoSkE is pretty much completely ignoring the <doc> structure.

  • When doing a concordance search, the concordance sometimes includes matches that stretch over two documents (eg. the first word in KWIC is the last word in document 1 and the last word in KWIC is the first word in document 2). Yes I know I can work around this by putting within <doc/> in every CQL query. But what if I want to do a basic query? It seems strange that results spanning multiple documents cannot be excluded.

  • In a concordance, the left and right context sometimes include text from the start or end of neighbouring documents. Yes I know that I can switch from KWIC layout to Sentence layout to prevent this. But what if I do want the KWIC layout? Then there seems to be no way to prevent this.

  • Same for the “wide context” pop-up: it often includes text from neighbouring documents.

  • When looking at the collocations of a concordance, the collocates sometimes include words from neighbouring documents. There seems to be no way to prevent this from happening.

Is this behaviour considered by-design?

If it is, then why? What is the thinking behind it? In my opinion it shouldn’t be like that. Documents do not normally follow on from each other like sentences or paragraphs. Things like left and right context, collocations and everything should only be taken from within the same document, in my opinion. There is no value in seeing, for example, fake “collocates” from documents which just happen to sit next to each other in the corpus. Am I right?

If this isn’t supposed to be happening, then what have I missed? I don’t seem to be able to get my NoSkE installation to respect document boundaries. My corpus has the following structures defined in the corpus configuration file:

STRUCTURE doc { ATTRIBUTE file ATTRIBUTE medium ATTRIBUTE textType ATTRIBUTE genre ATTRIBUTE topic ATTRIBUTE author ATTRIBUTE title ATTRIBUTE url ATTRIBUTE source ATTRIBUTE publisher } STRUCTURE s STRUCTURE g { DISPLAYTAG 0 DISPLAYBEGIN "_EMPTY_" }

And my configuration file does not contain a DOCSTRUCTURE option, which means that doc should be treated as documents by default.

Also, I have tried including DOCSTRUCTURE "doc" in my configuration file explicitly, but no change, same result.

Yours in confusion,
Michal

Miloš Jakubíček

unread,
Nov 25, 2024, 5:47:54 AM11/25/24
to Michal Měchura, NoSketch Engine
Hi Michal,

Yes, that's correct and by design. Corpora are very diverse and might
not even contain any document markup (technically speaking, a corpus
is not a set of documents, but a vector of tokens, with arbitrary
optional additional markup), so we make no assumptions about that.

All of what you suggest is in principle doable, but comes with some
considerations one needs to make so that it is not confusing to users.
The very first thing you should do if you care about document
boundaries is to make them visible by default to users (using the
DEFAULTSTRUCTS directive) -- I'd say this in principle solves all your
points except for the last one which gets solved by using word
sketches instead of a concordance collocation function. As for the
concordance, remember that all queries are CQL queries, the "simple
query" is just one of the wrappers generating the CQL query and you
can modify how the simple query works using the SIMPLEQUERY directive.
Just be aware of the fact that adding "within <doc/>" might have a
significant performance impact in case of big(gish) corpora.

Best
Milos



Milos Jakubicek

CEO, Lexical Computing
http://www.lexicalcomputing.com
http://www.sketchengine.eu
> --
> You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to noske+un...@sketchengine.co.uk.
> To view this discussion visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/7b3c9a6f-55ea-4b3e-9cc1-0aa865611984n%40sketchengine.co.uk.

Michal Měchura

unread,
Nov 25, 2024, 9:36:28 AM11/25/24
to NoSketch Engine
Cool, thank you for clarifying that this is intended. I'm glad I wasn't just seeing things.
M.
Reply all
Reply to author
Forward
0 new messages