It seems to me that NoSkE is pretty much completely ignoring the <doc> structure.
When doing a concordance search, the concordance sometimes includes matches that stretch over two documents (eg. the first word in KWIC is the last word in document 1 and the last word in KWIC is the first word in document 2). Yes I know I can work around this by putting within <doc/> in every CQL query. But what if I want to do a basic query? It seems strange that results spanning multiple documents cannot be excluded.
In a concordance, the left and right context sometimes include text from the start or end of neighbouring documents. Yes I know that I can switch from KWIC layout to Sentence layout to prevent this. But what if I do want the KWIC layout? Then there seems to be no way to prevent this.
Same for the “wide context” pop-up: it often includes text from neighbouring documents.
When looking at the collocations of a concordance, the collocates sometimes include words from neighbouring documents. There seems to be no way to prevent this from happening.
Is this behaviour considered by-design?
If it is, then why? What is the thinking behind it? In my opinion it shouldn’t be like that. Documents do not normally follow on from each other like sentences or paragraphs. Things like left and right context, collocations and everything should only be taken from within the same document, in my opinion. There is no value in seeing, for example, fake “collocates” from documents which just happen to sit next to each other in the corpus. Am I right?
If this isn’t supposed to be happening, then what have I missed? I don’t seem to be able to get my NoSkE installation to respect document boundaries. My corpus has the following structures defined in the corpus configuration file:
STRUCTURE doc { ATTRIBUTE file ATTRIBUTE medium ATTRIBUTE textType ATTRIBUTE genre ATTRIBUTE topic ATTRIBUTE author ATTRIBUTE title ATTRIBUTE url ATTRIBUTE source ATTRIBUTE publisher } STRUCTURE s STRUCTURE g { DISPLAYTAG 0 DISPLAYBEGIN "_EMPTY_" }And my configuration file does not contain a DOCSTRUCTURE option, which means that doc should be treated as documents by default.
Also, I have tried including DOCSTRUCTURE "doc" in my configuration file explicitly, but no change, same result.
Yours in confusion,
Michal