sharing Summer of Code application materials

1 view

Skip to first unread message

Joe Corneli

unread,

Jun 14, 2007, 12:57:07 PM6/14/07

to plane...@googlegroups.com, pju...@gmail.com, tle...@gmail.com, peb...@gmail.com

Some people have been curious what the participants in Summer of Code
are working on. Since I'm not sure what people are working on right
now, I am just going to share their original applications. These
people have already been accepted to the program and are at
least in theory working on these projects right now! This email
is purely meant to be informative for other PM people who
might not know what's going on with Summer of Code otherwise!

Indeed, various members of the PlanetMath community may be
interested in coordinating efforts with these interns. Feel
free to contact them directly, or ask me or Aaron to serve
as your liason.

Title/Summary Towards Estimating Authority of Users in PlanetMath.org
Student Pawel Jurczyk
Student Email pju...@gmail.com

Student Major Computer Science
Student Degree phd
Student Graduation 2010

Student Home Page http://mathcs.emory.edu/~pjurczy/

Organization PlanetMath
Assigned Mentor Aaron Krowne
Score 7

Abstract

Popularity of portals where users generate contents is constantly
growing. However, despite the increased popularity, the quality of
contents is uneven, and while some users usually provide good content,
many others often provide bad answers. Hence, estimating the
authority, or the expected quality of users, is a crucial task for
this emerging domain, with potential applications to answer ranking
and to incentive mechanism design.

Detailed Description

Current search algorithms for many portals with users-generated
contents focus on textual features of posts. However, this approach
does not take into account other features those can be used in ranking
method. These non-textual features can include number of votes for
given post, length of post, average length of posts for given user,
total number of votes for given user or many other (see: J. Jeon,
W.B. Croft, J.H. Lee, and S. Park. A framework to predict the quality
of answers with non-textual features. In Proceedings of SIGIR,
2006). The non-textual features (if available) were proven to increase
the accuracy of ranking posts. However, in many cases these features
are not as effective as they should be. Therefore, I propose a new
non-textual feature that provides a clear rank of users that can be
used in the process of ranking search results. The idea behind
providing rank of users is to adapt HITS algorithm. The HITS algorithm
was developed to predict importance of web-pages by assigning each
page a hub and authority value. A page is considered a good hub if it
links to authoritative pages, and authoritative pages are linked to by
good hubs. This idea has an intuitive parallel for user-generated
content portals. Specifically, we can consider question authors as
hubs, and answer authors as authorities (if posts can be divided into
questions and answers groups) or initial authors as hubs and further
editors as authorities etc. When we would like to compute values of
hub/authority for users, we start with building a graph which
represents relations between them. Simply, edges in the graph connect
two users which connect hub users with respective authority
users. After the graph has been prepared, we can run HITS
algorithm. Specifically, the algorithm can calculate the hub and
authority value of all users using Kleinberg's HITS algorithm
formulation:

H(i) = sum(A(j)) and
A(j) = sum(H(i))

where H(i) is the hub value of each user i and A(j) is the authority
value of each user j. The vectors H and A are initialized to all 0s
and 1s respectively, and are updated iteratively using the equation
above. After each iteration, the values in the H and A vectors are
normalized, so that the highest hub and the highest authority values
are 1. This algorithm iterates until convergence, defined as total sum
of changes in hub and authority values is less than some small ε (for
instance 0.001).

The algorithm needs sufficient data to produce the graph as discussed
above. Users in graph are classified into 2 groups (first of them are
users having hub and the second are users having authority value). Of
course one user can be in both groups as he can have hub and authority
values together. Any edges in the graph always connect user from one
group with user from the other group. By looking at the PlanetMath
organization, one could use the following criteria for deciding about
user's group:

1) if user started some topic (has written some article), he is placed
in hubs group

2) if user posted any correction to article and this correction was
accepted, he is placed in the authority group

Edges in the graph would connect users which started some topic and
users which made some corrections. The meaning of hub/authority values
for such a setup is as follows. If user makes many corrections and his
corrections are accepted, it means his knowledge is 'good'.
Therefore, his authority value is going to be high. On the other hand,
if anyone writes an article and is corrected by many users with high
authority, he gets high hub value as he can gather in one place good
users. Of course, the idea above is not the only one. It even makes
perfect sense to reverse mapping completely – one could assume that
authority value is assigned to people writing articles and hub values
are assigned to people writing corrections. The right approach needs
to be chosen in experimental way. Important is that, for setup above,
we actually don't need more information than currently available in
PlanetMath. However, by using the algorithm, we are gaining an
interesting feature that might be used to improve retrieval mechanism.

There might be a lot of different potential in the algorithm than
discussed ranking of users. After implementing this solution, one
should definitely take a closer look at the results. My scope was just
ranking so far and your suggestion about communities is quite
interesting. Maybe one really could use the graph to present given
user documents interesting from his/her point of view.

Project plan:

1 week - develop a concrete idea for building the graph describing
relations between users and applicable in PlanetMath

1 week - get familiar with Noosphere

4 weeks - initial implementation and evaluation of the solution

3 weeks - implementation of extensions of initial solution

2 weeks - testing and fine tuning

Link to Further Information:
http://planetx.cc.vt.edu/AsteroidMeta/Pawel

Public Comment History
Add a Public Comment:
This comment will be visible to the student.

Title/Summary MUSN - a Multi-User Semantic Network
Student Thomas A Lenius
Student Email tle...@gmail.com

Student Major Anthropology
Student Degree undergrad
Student Graduation 2007

Organization PlanetMath
Assigned Mentor Joseph Corneli
Score 6

Abstract

The MUSN project will develop a semantic network accessible to
multiple users at a time. The network also will integrate machine
learning capabilities. The aim is to relate all types of information
in a network model made up of nodes, where each node is one datum of
arbitrary type/size, and links, where each link describes a direct
relationship between two nodes. The machine learning component will
rely on a question/answer paradigm in which interactive processes
traverse the (constantly built) network and prompt users about how to
build new links. New data will be added to the network according to
statistical properties of the extant network. For example, an article
may automatically link to several other articles based on the presence
of keywords, but only if the words are deemed significant according to
some section of MUSN code. Ideally, MUSN would develop its own linking
methods, in other words, not only would MUSN learn about and utilize
new links, MUSN would learn new ways of discovering new
links. Automated linking method learning probably is beyond the scope
of the project, however, and in light of this constraint "programmer
intervention" will be necessary in order to facilitate the network
code's use of new algorithms. Finally, immediate tasks involve
organizing data storage, providing multi-user access, developing
prompt and IO sequences, and exposing methods for adding and using new
MUSN code. The intermediate result will be a "scholia-based content
management system with web front-end." This intermediate result
represents the end state for the summer of code, although it is only
an initial step toward creating a more general system.

More succinctly, MUSN will feature:

- Multi-user access
- Algorithms organized around building links between data

- An extensible code base hosted in-network

- A new alternative for hosting and interfacing with PlanetMath

Some work on the project is available in the ./contrib/ directory of
the Monster Mountain distribution at http://code.google.com/p/mmtn/

Detailed Description

MUSN - a Multi-User Semantic Network

The MUSN project encapsulates a number of goals. Beyond the scope of
the Summer of Code, MUSN ideally will serve as an application
generator capable of offering new services as a result of user
interaction. The ability to generate applications implies a learning
component and indeed a variety of algorithms ultimately will allow
MUSN to "learn" how to handle information in new ways. For example,
support vector machines can be trained on user interactions to
organize nodes by newly (user-)defined semantic relationships.

As a result of its application generation capabilities, MUSN often
will "look like" an application server. That is, users will interact
with MUSN to accomplish tasks, such as collaborating to publish
documents. In this capacity, MUSN will offer compartmentalized views
of the semantic network, each view being defined with particular goals
in mind. This goal is more readily achievable during the Summer of
Code.

More concretely, the MUSN project will develop an application suitable
for hosting the PlanetMath website (or any similar
knowledge-base). The application will be based on the concept of
scholia, organized in a semantic network, and accessible through a
web-based interface.

On Scholia

Strictly speaking, scholia are annotations. There can be different
kinds of scholia, such as scholia intended to be read like dictionary
entries or scholia capable of being executed by a machine. Adapting
these ideas to semantic networks, it is possible to conceive of nodes
forming a network of "annotation of" and "type of"
relationships. Applications can use "type of" relationships to
determine how to represent scholia, while "annotation of"
relationships help applications resolve the data that make up a paper,
article or other relevant user-visible unit of information. Scholia
are particularly useful for representing information important to
specialized knowledge communities, since the model can be extended to
support many different representations (such as field-specific
notation) and kinds of interactions (such as user-generated comments).

Implementation

The project will implement the semantic network in Common Lisp and a
database management system. The database will simplify data storage
primarily by providing transactions. Not much more needs to be said
about data storage since most of the work will be handled by
Lisp. Lisp has been chosen for its support for macros, which greatly
enhance Lisp's utility. Support for closures and anonymous functions
also contribute to Lisp's flexibility, which is a major concern in
creating a semantic network with machine learning capabilities. For
example, the network may encounter a situation involving a question
about the location of France. In this situation, the system should be
able to discover how to describe France's location. Subsequent to
learning how to accomplish the task for a specific location, the
system will seek to generalize the problem such that it can develop a
function generator that takes a location argument and returns a
function for finding the location named by the argument. The new
function can then be attached to the node describing the location as a
sort of cached property of the node. Routines for generating specific
functions are never evaluated all-at-once for every node identified as
"locatable." Secondly, the function generator is called at most one
time for each locatable node. The ability to bind nodes to location
functions rather than location function results is desirable in a
moving frame of reference, for example, where each query requires a
new calculation. This ability to generate and save functions is rather
more difficult to accomplish in many other programming languages. In
Java, both typing and object management concerns necessitate a number
of additional design choices. Finally, and related to the previous
point, it is relatively easy to write and test components on a
per-function basis, something that again is somewhat more difficult to
do in many other programming languages, if for no other reason than
the fact that many languages feature clearer distinctions between
steps in the write-compile-run process.

Project Plan

The project plan is broken up into fractions rather than periods of
time.

Basic system programming (1/3)

- Establish data storage and retrieval routines

- Add user access controls, possibly including a separate table of
authorized users

- Develop the logic for link management operations

- Develop interactive prompt from which the system will read requests,
add new system code, and display results

Web programming I (1/3)

- Establish low level links between the network and web server
processes

- Develop security measures for the web environment, e.g., IO cleaning

Web programming II (1/3)

- Add forms for user input

- Organize the web-based interface around document level objects

Testing and the development of a test suite will occur at each stage.

Bio/Qualifications

I have worked for the past 2 years as a research assistant on a large
census project at a major research university. My responsibilities
include assigning occupation codes to census data and developing ways
to automate and verify code assignment. I am also familiar with a
project aimed at linking individuals across census years. The record
linking project makes use of support vector machines; it is through
the linking project that I am becoming familiar with machine
learning. Previously, I worked for almost 2 years as an instructor for
a small organization in Canada, where I taught programming and web
development. I also develop and maintain savegc.org, a site dedicated
to social justice activities in the Minneapolis area.

Some work on the project is available in the ./contrib/ directory of
the Monster Mountain distribution at http://code.google.com/p/mmtn/

Title/Summary Modularizing the Classification and Document Handling of NNexus

Student James Johnson Gardner
Student Email peb...@gmail.com

Student Major Computer Science
Student Degree phd
Student Graduation 2010

Student Home Page http://mathcs.emory.edu/~jgardn3

Organization PlanetMath

Assigned Mentor Aaron Krowne

Score 4

Abstract

Knowlege bases can be viewed as semantic network. For a user to learn
about a particular topic in most cases requires reading other related
articles. The PlanetMath.org auto-linking system, NNexus, tries to
help users find related information. The auto-linking system performs
the task of linking articles so that authors will not have to search
to find related articles and manually link to them. The NNexus system
is a one of kind process for a dynamic corpus. I propose to
abstractify the handling of classification schemes to facilitate
better link steering between documents in multiple corpora and
modularize the document handling component of NNexus to allow for
linking document types other than LaTeX.

Detailed Description

Modularizing the Classification and Document Handling of NNexus
James Gardner
peb...@gmail.com

Proposal

Abstractifying the handling of classification schemes in NNexus is
necessary. During Summer of Code 2006 I pulled all linking code out
of the Noosphere code base and developed the NNexus system to be a
standalone linking server. Once NNexus was a stand-alone system I was
able to import content from both MathWorld.com and PlanetMath.org. It
is now possible to automatically link documents to both of these sites
(even at the same time) based on user preferences. It is great to be
able to link to multiple sources but NNexus uses the classification of
objects to determine where to link. MathWorld.com uses the MSC
classification scheme that PlanetMath.org uses for many of its
articles, but not all. In order to automatically link documents among
various corpora NNexus needs new classification-based steering
functionality. I propose to develop a new classification handling
module that can determine the distance among categories in multiple
classification schemes. This new module will likely use a weighted
directed graph model to determine the distance between the categories.

One goal of the NNexus project is that NNexus be developed to a point
where multiple sites and individuals will utilize the NNexus system to
link their own documents to sources that they deem relevant. I have
worked with one professor on linking his class notes to Planetmath.org
and MathWorld.com. The NNexus system can successfully do this and
gives promise to the idea that NNexus can be used in a context other
than PlanetMath.org. The class notes are written in LaTeX and gives no
trouble to the NNexus system. If more individuals outside of the realm
of Mathematics would like to use NNexus it needs to support document
types other than LaTeX. I propose to separate the linking code from
the LaTeX handling code in the cross referencing module in NNexus. A
clean separation will allow for multiple document types to be
linked. One possibility for handling multiple document types in NNexus
is to define a NNexus XML documents schema and have NNexus link this
type of document. We could then provide converters from LaTeX, HTML,
etc. to this XML format and vice-versa.

Timeline of Development

2 weeks - Develop model and plan for abstractifying the classification handling.
4 weeks - Implement the new classification module.
1 week - Plan the implementation of abstract document handling.
2 weeks - Separate the document handling code from the linking module.
3 weeks - Implement the abstract document handling module.
1 week - Perform thorough testing.

Qualifications

My research Interests include Graph Theory, Optimization, Information
Retrieval, and the intersection of these topics.

Education

Emory University, Atlanta, GA, USA
PhD Candidate in Computer Science
Concentration in the intersection of Graph Theory, Optimization, Information
Retrieval, and Data Mining.

I am currently researching parallel frequent itemset algorithms,
protein-protein interaction prediction, and data models. My advisor
is Dr. Li Xiong.

I am Proficient in Perl, C, C++, and Java. I also have experience in C#, x86
Assembly, Visual Basic, and Python. I have experience with Linux, Solaris,
Mac OS X, and all Microsoft OS.

My full resume can be found at http://mathcs.emory.edu/~jgardn3/jamescv.pdf

Link to Further Information:
http://mathcs.emory.edu/~jgardn3/

Aaron Krowne

unread,

Jun 14, 2007, 4:50:27 PM6/14/07

to plane...@googlegroups.com

Thanks for posting this.

The work for James and Pawel is proceeding nicely.

Note that Pawel has posted some information and questions regarding surveying aspects of entry quality for testing the reputation system:

http://planetmath.org/?op=getmsg&id=15556