High quality word embeddings pre-trained on QatarLiving data

32 views

Skip to first unread message

tbmihailov

unread,

Jan 11, 2017, 2:21:31 PM1/11/17

to SemEval-2017 Task 3 CQA

Hi All,

Since this years' Semeval CQA has the same A,B,C Subtasks performed on Qatar Living data it is a good idea that we share some resources obtained from last year's task.
Here are some Word2Vec embeddings trained on the QatarLiving data and fine-tuned for the A and C Subtasks that can be reused in for Subtasks A,B,C: https://github.com/tbmihailov/semeval2016-task3-cqa#resources . Here are listed different configurations (with download links) of the embeddings (vector size, context window, skip-grams etc.) and their performance when used in simple but powerful linear system based on similarity between the question and the candidate answers. Feel free to reuse the embeddings in any deep learning system/feature based systems for subtasks A,B,C.
Ready-to-use pre-trained models can be downloaded and easily loaded using gensim (https://rare-technologies.com/word2vec-tutorial/). Simple example on how to load the embeddings in gensim is also available on https://github.com/tbmihailov/semeval2016-task3-cqa#how-to-use-the-embeddings. Feel free to contact me if you have any questions or problems.