Queries depends on the problem that you want to solve.
First the article about ConceptNet is a must read [1]. Here a summary/extracts:
While WordNet is optimised for lexical categorisation and word-similarity
determination, and Cyc is optimised for formalised logical
reasoning, ConceptNet is optimised for making practical
context-based inferences over real-world texts.
Context-based inference methods allow ConceptNet to
perform interesting tasks such as the following:
• ‘given a story describing a series of everyday events,
where is it likely that these events will take place, what is
the mood of the story, and what are possible next
events?’ (spatial, affective, and temporal projections),
• ‘given a search query (assuming the terms are
commonsensical) where one of the terms can have
multiple meanings, which meaning is most likely?’
(contextual disambiguation),
• ‘presented with a novel concept appearing in a story,
which known concepts most closely resemble or
approximate the novel concept?’ (analogy-making).
ConceptNet embraces the ease-of-use of WordNet’s semantic
network representation, and the richness of Cyc’s content.
While WordNet excels as a lexical resource, and Cyc excels
at unambiguous logical deduction,
ConceptNet’s forte is contextual commonsense reasoning
making practical inferences over real-world texts, such as
analogy, spatial-temporal-affective projection, and contextual
disambiguation. Tasks ConceptNets can achieve:
- Contextual neighbourhoods (through assoc-space or SimRank?)
- Realm-filtering
- Topic generation
- Analogy-making: “Stated concisely, two ConceptNet nodes are analogous if their sets of back-edges (incoming edges) overlap”
- Projection: following a single transitive relation-type. ‘Los Angeles’ is located in ‘California’, which is located in ‘United States’, which is located on ‘Earth’ is an example of a spatial projection, since LocationOf is a transitive relation.
- Topic gisting
- Disambiguation and classification: similar to the ones taken by statistical classifiers which compute classification using cosine-distance in high-dimensional vector space. The main difference in our approach is that the dimensions of our vector space are commonsense-semantic (e.g. along dimensions of time,space, affect) rather than statistically based (e.g. features such as punctuation, keyword frequency, syntactic role).
- Novel-concept identification: take as input a document and a novel concept in that document. It outputs a list of potential things where the novel concept might be by making analogies to known concepts.
- Affect sensing
Applications of ConceptNet
- Observes a user writing an e-mail and proactively suggests photos relevant to the user’s story.
- Story-generator that allows a person to interactively invent a story with the system
- Product recommendatino from Amazon.com by using ConceptNet to reason about a person’s goals and desires, creating a profile of their predicted tastes.
- Speech-based conversation understanding system that uses commonsense to gist the topics of casual conversations.
- ‘Commonsense Predictive Text Entry’ [30] leverages ConceptNet to understand the context of a user’s mobile-phone text-message and to suggest likely word completions.
[1]
http://web.media.mit.edu/~push/ConceptNet-BTTJ.pdfIf you don't know about NLP you have to read the wikipedia article [2]. and the main task of NLP [3]. This sound very abstract to me, so I started to watch the Standford NLP Coursera [4]. It's helpful as introduction material to get a grasp of the different low level tasks (parsing, tokenization, data analysis, some algorithm) and some higher level things (Question answering, Summarization, Information Retrival). I assume you are interested by the latter.
To be honest, I'm not sure what I am doing, I tried to skip the lowlevel stuff since it's already done, I assigned myself the task to extract interesting information from bbc articles [5] to start with, here is what I do:
- I load every article from a tech category (I choosed that category because I know the domain)
- I remove stopwords from the articles
- I link words from the articles to concepts in ConptNet going through some heuristic to match stem to concepts when words don't exists in ConceptNet as is
So far, that's it. I need to replace the current library I use with spaCy to parse article because it has better tokenizer than porter2.
I
plan
to use theorical graph algorithm (that's how people call them..) like SimRank or CoSimRank, Shortest Path
[6].
And keep an eye
on "Term rewritring" stuff. I'm not a this point though. Right now, I try to get a better understanding
of conceptnet through the bbc data by walking, exploring the resulting graph to understand how it behaves or how it
can behave. There is a lot "manual" work which both translates into exploring the data and building Dynamic Programming
algorithm
My short term goals:
- Add missing concepts or missing links between the articles and concepnet
- Extract significant single word concepts,
topics, from the articles,
brands,
geographical locations and
dates
- Make sure
that
topics are merged into
similar concepts
.
- Create a hierarchy of topics
.
- Create surface
texts (summary) for articles for instance:
Datastax released Cassandra
Other possible goals:
- summarize hackernews articles which have no pre-computed category (but a lot of data) [7]
- summarize stackoverflow "clusters" to add a "group of questions view" that answers the question: «How create a REST API in Python» listing all questions related to this topic avoiding possible duplicates. The previous query which can be the parent of another: «How to create a REST API in Python using Django RESTFramework». Clusters already exists in SO dumps implicitly through PostLink table (typed duplicate or related) [8]. You can group them using ML or Dynamic Programming exluding duplicate questions, but the goal is really to make sens of data, and possibly use the algorithm for future unlabeled data.
- Or implement a vague/list search queries like «What are common pitfalls of building REST API», «What are the most common security issues in Web dev» which are not a valid SO questions, but can be aggregated from SO data. The previous point ressembles the new wikibase «Query» feature which allows other wikis to use the structured data of wikibase as a source of knowledge using SparQL query.
Getting Wikibase to work with conceptnet is an intersting topic, I think. It has another way to link items between them going through property node, which you can link to other properties, I think it's best represented as an hypergraph. I also discovered
http://rest.wikimedia.org/ (down/slow at this time) which makes available wikis in various format among which html and parsoid, a json representation of the wikimarkup, this can make
maybe the code to build conceptnet faster.
IMO compared to learning programming, the path is more obscure
[2]
https://en.wikipedia.org/wiki/Natural_language_processing[3]
https://en.wikipedia.org/wiki/Natural_language_processing#Major_tasks_in_NLP[4]
https://class.coursera.org/nlp/lecture[5] I use the full dataset:
http://mlg.ucd.ie/datasets/bbc.html[6]
https://www.hackerrank.com/domains/algorithms/graph-theory
[7]
https://archive.org/details/HackerNewsStoriesAndCommentsDump
[8]
https://archive.org/details/stackexchange