Get collocations and concordance for any node by frequency

40 views
Skip to first unread message

Evan Brown

unread,
Jun 21, 2017, 12:19:57 PM6/21/17
to NoSketch Engine
Hello everyone,

I am using NoSketchEngine on a project involving legal texts. This is
working fine so far but I have a couple of queries about additional
functionality that I want to implement.

I am using the latest version of Manatee through the Python API.

1. I am able to retrieve collocations and concordances based on a
query for a specific node word and to order the results by frequency.
In addition, I would like to be able to retrieve the most common
collocations irrespective of the node - i.e. the collocations which
occur most frequently in a given corpus for any node. I have tried
querying with:

rangestream = corpus.eval_query("[word=\"*"]")

but I get a regular expression error with this. How can I obtain say
the top ten collocations from a corpus without specifying a particular
node?

2. I am going to include document delimiters in the final version of
my corpus. How can I access the document identifiers once I have a
concordance returned so that I can tell which documents in the corpus
feature the returned language?

I hope that this makes sense. Any help would be gratefully received.

Thanks.

Best wishes

Evan

Evan Brown

unread,
Jun 21, 2017, 3:05:12 PM6/21/17
to Vladimír Benko, NoSketch Engine
Dear Vladimir,

I want to find out what the most common collocations in the corpus are.

A frequency list tells you what the most prevalent words are in the
corpus, but not what they collocate with frequently.

Perhaps I am not expressing a sensible question but I would have
thought that there would be a way to see the most common collocations
in the corpus as a whole.

Thanks for your reply.

Best wishes

Evan

On 21 June 2017 at 19:47, Vladimír Benko <vla...@juls.savba.sk> wrote:
> Dear Evan,
>
> How large are your corpora? And, I am bit surprised by your query -- what
> do you expect to find out? How should the result differ from, say, a plain
> frequency list?
>
> I am using NoSketchEngine on a project involving legal texts. This is
> working fine so far but I have a couple of queries about additional
> functionality that I want to implement.
>
> I am using the latest version of Manatee through the Python API.
>
> 1. I am able to retrieve collocations and concordances based on a
> query for a specific node word and to order the results by frequency.
> In addition, I would like to be able to retrieve the most common
> collocations irrespective of the node - i.e. the collocations which
> occur most frequently in a given corpus for any node. I have tried
> querying with:
>
> rangestream = corpus.eval_query("[word=\"*"]")
>
>
> I do no have any experience with the Python API. The regular expression for
> "everything", however, should be:
>
> ".*"
>
> Or, if you do not want to math "non-words":
>
> "[[:alpha:]]*"
>
> Best,
>
> Vlado B, 20:45
>
>
> --
> Vladimír Benko
>
> Slovak Academy of Sciences
> Ľ. Štúr Institute of Linguistics
> Panská 26, SK-81101 Bratislava
>
> Tel +421-2-54431762 Fax -54431756
>
> http://aranea.juls.savba.sk/guest/
> https://www.facebook.com/araneawebcorpora/

Miloš Jakubíček

unread,
Jun 22, 2017, 6:14:31 AM6/22/17
to Evan Brown, Vladimír Benko, NoSketch Engine
Hi Evan,

I'm not sure I'm getting this right either - a collocation is a pair of words: a headword and a collocate.

Anyway, in your line there is a missing backslash as well as the dot as Vlado pointed out:

rangestream = corpus.eval_query("[word=\".*\"]")

it's more readable with apostrophes used for the Python string:

rangestream = corpus.eval_query('[word=".*"]')

but to get "any" word, you can just do:

rangestream = corpus.eval_query("[]")

As for the other question, if the corpus has e.g. documents as "doc" structure, you can access it trough the API as

doc = corpus.get_struct("doc")

see the manatee.i file for the methods description

Best
Milos

Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton UK
http://www.sketchengine.co.uk


--
You received this message because you are subscribed to the Google Groups "NoSketch Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to noske+unsubscribe@sketchengine.co.uk.
To post to this group, send email to no...@sketchengine.co.uk.
Visit this group at https://groups.google.com/a/sketchengine.co.uk/group/noske/.
To view this discussion on the web visit https://groups.google.com/a/sketchengine.co.uk/d/msgid/noske/CAF-S9NJtGZkFo5FJBO_5w7-h0Sxuv1KDDrgdd-oLLVpcQH_SGQ%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages