Discussing Eric Vin's PR and storage

28 views
Skip to first unread message

Luca de Alfaro

unread,
Mar 11, 2020, 9:20:24 PM3/11/20
to WikiTrust Development
I am posting this here as it can be of common interest:

Eric Vin did a very nice PR for algorithms for WikiTrust.  I would like to have a meeting to discuss the PR and also discuss the storage structure; in fact now that Eric and I met in my office to discuss the general lines, perhaps the best would be to work together on the storage structure first, then on a general plan for the algorithms next, and then, see how to adapt Eric's excellent code. 

Eric, would you like to do this? When would be a convenient time for you in the next days? 
Would others like to be invited to the meeting to keep abreast of the design of WikiTrust? 

Many thanks,

Luca

Eric Vin

unread,
Mar 11, 2020, 10:18:04 PM3/11/20
to WikiTrust Development
Hi Professor De Alfaro,

I'm free Friday afternoon after 2:30, if that works for you. Scheduling is a little rough at the moment with all the disruptions and finals next week, but I'm sure we can find a time that works for everyone.

As far as storage structure goes, I've been working with Luke who has just finished implementing some databases that I plan to adapt the algorithms to, since I know fairly little about working with databases. Hopefully Luke can attend the meeting to discuss this. We've thought out a way to cache some of the deltas so we can recompute author reputation with additional/different pages much more quickly, though I'll need to do some testing and bench marking to see if it really provides a significant speedup or if we'll need to explore other options. We can discuss this in depth at the meeting as I'm sure you could provide some valuable input.

Let me know what you'd like to do and I look forward to meeting with you!

-Eric

kworcest

unread,
Mar 13, 2020, 4:48:39 PM3/13/20
to WikiTrust Development

Hi WikiTrust folks,

I unfortunately won't be available to meet at 2:30 today (though I can make it after 5:30 if that works for everyone)

Instead I thought I'd put down some architecture ideas for discussion. I'm curious to see what Luca and everyone thinks about them.

Here's my thinking:

On the server:
  1.  Every 24hrs our server would download the past day's revisions from the data dumps here (instead of using the API and putting more stress on Wikipedia's servers)
  2. Then, slowly (within 24hrs!) loop through the revisions and compute the changes to the trust scores for each updated Wikipedia page.
  3. We add the delta trust scores from today to the trust scores we computed the previous day (basically we update the trust scores on a rolling basis every 24hrs)
  4. We save the updated trust score for each page and each user in a static file.
  5. We can put these static files behind a free caching service (Eg. Cloudflair) with a 24hr expiration, so that end users rarely/never hit our server.

On a user's browser (When they visit a Wikipedia page and click on our extension/bookmarklet):
  1. The extension fetches the static, day-old article trust scores / author trust from our caching server.
  2. The extension also fetches any revisions to this article that have happened within the past 24hrs.
  3. The extension computes these very recent updates to the trust scores / author reputation using Eric's python algorithm in the user's browser using transcrypt.
    • I've already had success trans-piling the python algorithms to JavaScript using transcrypt (which should be just as fast as python), which means this involves little-to-no changes to our python code base.
  4. We display the very latest trust scores to the user.
This has a few advantages that I can think of:
  1. The server processing and client are completely de-coupled, which means we can process as slowly as we want using a very tiny / low cost server.
    • The server also doesn't have to handle requests from client if we use static caching.
  2. Every user gets the latest trust scores without excessive processing on their machine (we only process a few revisions from the past 24hrs).
Some issues:
We need a way to compute rolling or delta updates to trust scores for new revisions.
We would need to also keep track of some past revisions in case someone vandalizes a page exactly at midnight.

Luca de Alfaro

unread,
Mar 13, 2020, 4:57:47 PM3/13/20
to WikiTrust Development
Dear All,

let's postpone to next week; I am unfortunately swamped with a paper deadline today and through the weekend.
But let's continue the discussion here.  I like the suggestions below, with some exceptions.

In particular, I don't think we should use the dumps indicated there.  If you look at the size, for enwiki, it's 700+ MB the latest.  I am not sure what period they cover.
Rather, I would use the API to fetch changes only to set of pages that we observe.  And I would fetch those pages in XML = markup, not in HTML; we work on markup internally in the algorithms.
This will enable us to grow in a more gradual way, as we can initially start with small sets of pages under observation.

I would like next week to work to define the storage structure; best days for me are Monday and Tuesday.  Any time that works for you?  If you are too busy with exams, we can also do the week after even though it's break. 

Luca

Golam Md Muktadir

unread,
Mar 13, 2020, 5:09:56 PM3/13/20
to Luca de Alfaro, WikiTrust Development
I also recommend XML markup. It would be more efficient and consistent in parsing data and meta-data. Though it looks more work in the beginning, it will make your lives easier as we progress further.

Best,
Muktadir 

--
You received this message because you are subscribed to the Google Groups "WikiTrust Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikitrust-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wikitrust-dev/169f2caf-49f6-4164-9f81-f7acbd5cfe7d%40googlegroups.com.
--
Golam Md Muktadir
SID 1707346
mukt...@ucsc.edu

Eric Vin

unread,
Mar 13, 2020, 8:37:58 PM3/13/20
to WikiTrust Development
I am unfortunately going to be swamped next week with finals and travel considering the school closure, can we postpone till the week after exams?

Aproop Kamat

unread,
Mar 14, 2020, 4:03:08 AM3/14/20
to Eric Vin, WikiTrust Development
I am available on Monday, but I think a meeting during spring break would be a good option as well. 

--
You received this message because you are subscribed to the Google Groups "WikiTrust Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikitrust-de...@googlegroups.com.

kworcest

unread,
Mar 16, 2020, 1:33:43 PM3/16/20
to WikiTrust Development
I can also meet anytime today before 5 or after 6:30 or on a later day.
To unsubscribe from this group and stop receiving emails from it, send an email to wikitr...@googlegroups.com.

Eric Vin

unread,
Mar 23, 2020, 8:31:48 PM3/23/20
to WikiTrust Development
Now that it's spring break, I'm free all week. What times work for everyone else?

-Eric

Matthew Boisvert

unread,
Mar 23, 2020, 9:00:21 PM3/23/20
to WikiTrust Development
This week I can meet Wednesday after 12PM or anytime Thursday or Friday. If it's convenient maybe we should use a when2meet poll to decide when we should meet for spring quarter?

Luca de Alfaro

unread,
Mar 23, 2020, 9:12:22 PM3/23/20
to Matthew Boisvert, WikiTrust Development
Yes, can someone set a poll up?  I am very busy till Wednesday (paper deadline) but Thursday and Friday are open and I have also time to prepare for the meeting.

Luca

On Mon, Mar 23, 2020 at 6:00 PM 'Matthew Boisvert' via WikiTrust Development <wikitr...@googlegroups.com> wrote:
This week I can meet Wednesday after 12PM or anytime Thursday or Friday. If it's convenient maybe we should use a when2meet poll to decide when we should meet for spring quarter?

--
You received this message because you are subscribed to the Google Groups "WikiTrust Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikitrust-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wikitrust-dev/a00d5906-30f4-4ae2-88e5-e1e7b7a31773%40googlegroups.com.


--
Luca de Alfaro
Professor, Computer Science and Engineering
University of California, Santa Cruz

Luca de Alfaro

unread,
Mar 23, 2020, 9:14:55 PM3/23/20
to Matthew Boisvert, WikiTrust Development
Actually, here's a doodle poll.  https://doodle.com/poll/fu3ppyevqcungy3k

Let's use that to decide when to meet.

Luca
Reply all
Reply to author
Forward
0 new messages