Fwd: [iCQA project:25]: Organizing questions into syllabus-like structure

24 views
Skip to first unread message

Yu Wang

unread,
Nov 20, 2012, 3:23:31 PM11/20/12
to iCQA
Hello,

I'd like to share some of the results.

I got the syllabus from a java textbook, and I just treat the chapter title as a query to search in all java related questions. Below are some sample chapters and the top 5 questions (with their ids in front):

Using BM25 to retrieve question title

Query 8: A Simple Java Program
        7627006 Java Simple Line Drawing Program
        8027672 Java - What's wrong with this simple program?
        11197363        ELM327 simple connect java program
        2165006 Simple Java Client/Server Program
        2731608 Simple Question:Output of below Java program
Query 9: Creating Compiling and Executing a Java Program
        529391  Eclipse - Compiling & executing program
        7460199 Compiling and executing a java program with a jar library package
        4084466 java program compiling
        2011664 Compiling a java program into an exe
        9813821 Executing external program in Java
Query 14: Reading Input from the Console
        9983500 How to cancel out reading input from console?
        10971598        Reading console input with mpj-express
        1938389 reading int from console
        8510398 Make the input text bold when reading from the console
        8107903 Passing input from JButton to console
Query 22: Numeric Type Conversions
        1564537 Supporting covariant type conversions in Java
        10478429        Casting a double to another numeric type
        1691876 Using the right numeric data type
        1702034 HSSFCell - determining what type of numeric
        463950  String conversions


Using BM25 to retrieve question boday

Query 8: A Simple Java Program
        2259438 How to implement communication between Java client application (Android) and PHP server application?
        10535804        == vs .equals - why different behaviour?
        7504832 Simple FTP program in C or Java
        8677639 Is it possible to invoke a function in VBA from java?
        6556452 How to import libraries from JAR-files into a Java program using TextMate
Query 9: Creating Compiling and Executing a Java Program
        804466  How do I create executable Java program?
        9059526 I want to autocompile a java file using Java Programming (J2SE), is there any script tutorials?
        325114  Accessing non top-level class without a top level class in Java
        7139301 Problem in android
        1766216 Exception while compiling: wrong version 50.0, should be 49.0
Query 14: Reading Input from the Console
        6663698 redirect inputStream to JTextField
        11044499        SocketException- Connection reset on multithreaded client/server program? Java
        7588069 Why does Java takes Input in String format only?
        4193293 Log4j hanging my application
        10052658        Integer vs. String decision making in Java
Query 22: Numeric Type Conversions
        3746365 Conversion between numeric objects in Java
         4718200 Numeric assignment should throw error
        3675365 Assigning integer value to a Float wrapper in Java
        3504521 What Java data type corresponds to the Oracle SQL data type NUMERIC?
        3733644 Java "fresh type variable"

It seems using title only is a quite reasonable baseline for our purpose, and of course, the "body" of questions will help a little bit.

Thanks,
Yu


On Thu, Nov 1, 2012 at 4:52 PM, iCQA on behalf of qiaoling.liu <ic...@googlegroups.com> wrote:
Nice ideas.

Since "representative is more like a combination of difficulty level, relevance, and coverage (how crucial
the question is in the scope of the topic)", do we need a model to evaluate the "representativity" of the top N questions for a topic, i.e. P(q_1, q_2, q_3, ... q_N|z). Note that the N questions here are not independent of each other.

On Thu, Nov 1, 2012 at 2:22 PM, iCQA on behalf of Yu Wang <ic...@googlegroups.com> wrote:
Selecting representative questions for topics in syllabus.
Since the goal is to organize questions into a syllabus structure, the
whole process can therefore be decomposed into two parts: (1) given a
question, which chapter/section/topic in the syllabus it belongs to;
(2) given a topic in syllabus, what are the top N representative
questions.
(1)     A probabilistic way of stating the idea is P(z|q), where q is a
question in the collection, and z is a topic in syllabus (for example,
the title of a chapter).
(2)     "Representative" is different from popularity (e.g. vote in Stack
Overflow) or relevance measure (e.g. score of a ranking algorithm).
Because we are building materials for learners, "representative" here
means the usefulness for learners or beginners. It is more like a
combination of difficulty level, relevance, and coverage (how crucial
the question is in the scope of the topic).
Selecting questions asked by learners.
One possible way to accomplish (2) is to find high quality questions
asked by real learners. So the problem becomes identifying good
learners according to their questions. We assume that learning process
usually follows pre-defined logistics when the users learn through
books or online materials.  When they ask questions along their
learning process, the topic of the questions would align with the
logistics defined in those materials. We propose the topic transition
matrix: p(z_i | z_j) defines the probability of topic z_i is learned
(introduced in the book) after topic z_j. For example, when users
learn Java programming language, they usually start with data types,
if else statement, and then for loop. So p(if else|data types) and
p(for loop| if else) will have high values, but p(data types| for
loop) will be low. When the topic transition matrix is found, we can
therefore measure how likely the user is a real learner by checking if
topic transitions in his question stream follow the transition matrix.


On Wed, Oct 31, 2012 at 8:44 PM, Yu Wang <ywa...@emory.edu> wrote:
Hi Azat,

Sorry for the late response. My research focuses on the temporal part
of the project (topic transition in my previous email). And I am not
an expert in CQA. I think Qiaoling could give better insight of the
novelty of our project, and also provide some related work.

Thanks,
Yu

On Wed, Oct 31, 2012 at 8:41 PM, Yu Wang <ywa...@emory.edu> wrote:
> Selecting representative questions for topics in syllabus.
> Since the goal is to organize questions into a syllabus structure, the
> whole process can therefore be decomposed into two parts: (1) given a
> question, which chapter/section/topic in the syllabus it belongs to;
> (2) given a topic in syllabus, what are the top N representative
> questions.
> (1)     A probabilistic way of stating the idea is P(z|q), where q is a
> question in the collection, and z is a topic in syllabus (for example,
> the title of a chapter).
> (2)     "Representative" is different from popularity (e.g. vote in Stack
> Overflow) or relevance measure (e.g. score of a ranking algorithm).
> Because we are building materials for learners, "representative" here
> means the usefulness for learners or beginners. It is more like a
> combination of difficulty level, relevance, and coverage (how crucial
> the question is in the scope of the topic).
> Selecting questions asked by learners.
> One possible way to accomplish (2) is to find high quality questions
> asked by real learners. So the problem becomes identifying good
> learners according to their questions. We assume that learning process
> usually follows pre-defined logistics when the users learn through
> books or online materials.  When they ask questions along their
> learning process, the topic of the questions would align with the
> logistics defined in those materials. We propose the topic transition
> matrix: p(z_i | z_j) defines the probability of topic z_i is learned
> (introduced in the book) after topic z_j. For example, when users
> learn Java programming language, they usually start with data types,
> if else statement, and then for loop. So p(if else|data types) and
> p(for loop| if else) will have high values, but p(data types| for
> loop) will be low. When the topic transition matrix is found, we can
> therefore measure how likely the user is a real learner by checking if
> topic transitions in his question stream follow the transition matrix.

>
>
> On Thu, Oct 25, 2012 at 3:41 PM, iCQA on behalf of qiaoling.liu
> <ic...@googlegroups.com> wrote:
>> Nice. I think that could be a good start.
>>
>> Actually, I think there are two directions that we could go:
>> 1. develop algorithms to automatically build a syllabus based on the
>> questions.
>> 2. Given any syllabus (e.g. from a text book or course web page), develop
>> algorithms to organize the questions around the syllabus; More specifically,
>> select important questions for each topic in syllabus and rank them from
>> easy to hard.
>>
>> Are we going to the second direction? I personally think it's easier to
>> start.
>>
>>
>> On Thu, Oct 25, 2012 at 2:19 PM, iCQA on behalf of Nikita Zhiltsov
>> <icqa+noreply-APn2wQepKAJS0lVTFxy...@googlegroups.com>
>> wrote:
>>>
>>> Azat,
>>>
>>> the task looks quite novel. Let's start with brainstorming possible
>>> solutions for it.
>>> So, given a set of questions tagged by some category of interest, e.g.
>>> 'java', we need to organize them according to a reasonable hierarchical
>>> structure (1st layer - topics, 2nd layer - questions), i.e. with basic and
>>> easy topics as well as questions coming first, more complex topic coming
>>> next and so on.
>>> One of the ideas may be as follows (@Yu, please amend it if something's
>>> wrong):
>>>
>>> We have an example syllabus from a textbook or online course, i.e. a list
>>> of topics.
>>> We use topic titles as queries and issue those queries (and, perhaps, some
>>> additional queries after applying expansion techniques like using synonyms
>>> or pseudo relevance feedback) to a search system. In our case, we may
>>> exploit StackOverflow API or Sphinx search engine facilities after
>>> populating the data into iCQA app and configuring it properly.
>>> We get top-k results ranked by relevance. Represent them as a vector of
>>> features (relevance score, asker/answerer ratings etc.) and apply clustering
>>> techniques. The questions, which will be located close to cluster centroids,
>>> may be good candidates.
>>> Usefulness of the resulting syllabus should be evaluated, e.g. by using
>>> Amazon MTurk.
>>>
>>>
>>> On Wednesday, October 24, 2012 6:27:50 AM UTC-4, Азат Хасаншин wrote:
>>>>
>>>> Hello, Yu. I'm also interested in this task. Could you give us some
>>>> indications on how to start working on it, may be existing papers?
>>>>
>>>> 23.10.2012 21:22 пользователь "iCQA on behalf of Nikita Zhiltsov"
>>>> <icqa+noreply-APn2wQepKAJS0lVTFxy...@googlegroups.com>
>>>> написал:
>>>>>
>>>>> Let's elaborate on this task proposed by Yu in this thread.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "iCQA" group.
>>>>> To post to this group, send email to ic...@googlegroups.com.
>>>>> To unsubscribe from this group, send email to
>>>>> icqa+uns...@googlegroups.com.
>>>>>
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msg/icqa/-/AIFLvpJY5bkJ.
>>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>
>>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "iCQA" group.
>>> To post to this group, send email to ic...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> icqa+uns...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msg/icqa/-/eDhsDkCqzS0J.
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "iCQA" group.
>> To post to this group, send email to ic...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> icqa+uns...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>

--
You received this message because you are subscribed to the Google Groups "iCQA" group.
To post to this group, send email to ic...@googlegroups.com.
To unsubscribe from this group, send email to icqa+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "iCQA" group.
To post to this group, send email to ic...@googlegroups.com.
To unsubscribe from this group, send email to icqa+uns...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


Qiaoling Liu

unread,
Nov 20, 2012, 4:24:15 PM11/20/12
to iCQA on behalf of Yu Wang
Nice baselines!   I think body-based retrieval is more noisy, though it retrieves some better items than title-based retrieval.

I wonder whether a 2-step retrieval would help:
step 1:  use the current retrieval, and keep N results (e.g. N=1000);
step 2:  use the N results as collection (e.g. to compute idf, tf) to retrieve top K results (e.g. K=5).

The intuition is that we first get relevant results from the whole question collection, and then pick the best questions that distinguish them from others.

Nikita Zhiltsov

unread,
Nov 20, 2012, 9:53:48 PM11/20/12
to ic...@googlegroups.com
Second Qiaoling's idea. Also, it would be interesting to see the results of the following precision-oriented approach:
  • AND query
  • BM25
  • over question titles only.
See Lucene's BooleanQuery and BooleanClause.Occurs.MUST.
Hello,
Reply all
Reply to author
Forward
0 new messages