Document assignments for paragraphCorpus

60 views
Skip to first unread message

Sebastian Arnold

unread,
May 6, 2019, 9:52:00 AM5/6/19
to TREC Car
Hi Laura and CAReers,

we are working on a use case for passage retrieval which requires the passages to appear in their natural context – larger documents. As far as I got it, the CAR paragraph collection consists of independent paragraphs without any document assignments, right? That makes it impossible to track long-range dependencies over the course of a full document.

Is it somehow possible to merge the given paragraphs back into their original structure? I don't need any URLs, entity or heading labels, just the original assignments and ordering of the passages, e.g.

[ { "doc0" : [ "p0_id", "p1_id", "p2_id", "p3_id" ] }, { "doc1" : [ "p4_id", "p5_id", "p6_id" ] }, ... ]

And, would that already be considered "cheating" the original task?

Best,
Sebastian

Laura Dietz

unread,
May 6, 2019, 10:21:47 AM5/6/19
to trec...@googlegroups.com
Hi Sebastian,

Have a look at the allButBenchmark from the data release (v2.1) [1]  --- formerly called "halfwiki"  (I apologize for the terrible name)


unprocessedAllButBenchmark.v2.1.tar.xz Like unprocessedTrain, but contains nearly everything of Wikipedia, only omitting pages in the benchmarkY2, benchmarkY1, and test200 benchmarks. The splits offered are consistent with the “train” file below.


The trec car python/java bindings give you full access to the Wikipedia article -- see documentation [2]

"all" lists  the full wikipedia articles in the last preprocessing stage BEFORE we split outlines (used as queries) and paragraphs (corpus), and derive the ground truth (qrels). So you can see exactly where which paragraph was on the page -- not just the title and heading, but also the order of paragraphs.

You can get from the Page to the page skeleton (the hierarchy of headings and contained paragraphs). Each paragraph has a unique Id based on the text content. These are the same as you will find in the paragraphCorpus.

Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLinks.

     Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
     PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
     PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
     RedirectNames       -> [$pageName] 
     DisambiguationNames -> [$pageName] 
     DisambiguationIds   -> [$pageId] 
     CategoryNames       -> [$pageName] 
     CategoryIds         -> [$pageId] 
     InlinkIds           -> [$pageId] 
     InlinkAnchors       -> [$anchorText] 
     
     PageSkeleton -> Section | Para | Image | ListItem
     Section      -> $sectionHeading [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     ListItem     -> $nestingLevel, Paragraph
     Image        -> $imageURL [PageSkeleton]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $linkSection $anchorText

Thus, you don't have to use the paragraphCorpus and Qrels, if you prefer to directly derive the training signal yourself.  --- we made some concrete choices about how to clean the data, these are documented on the web page [3] and in the overview report [4].

Keep in mind that there is a high degree of redundancy (due to copy and pasting) on Wikipedia. In an earlier stage in the pipeline, we aligned near duplicates in the paragraphs (using GloVe-based LSH refined with bigram overlap metrics). We replaced all near-duplicate paragraphs with ONE representative paragraph (and therefore one paragraph id). You will therefore find the same paragraph, with the same content and the same paragraph ID in multiple articles.







Sebastian, did this answer your question?

Best,
Laura


--
You received this message because you are subscribed to the Google Groups "TREC Car" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trec-car+u...@googlegroups.com.
To post to this group, send email to trec...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trec-car/dadbd99d-20ff-48c6-967c-2a3f99d0e315%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Laura Dietz

unread,
May 6, 2019, 10:40:07 AM5/6/19
to trec...@googlegroups.com
I forgot to mention:  For the past benchmarks (benchmarkY1train/Y1test/Y2test) we also provide the original pages --- so while they are missing in allButBenchmark, you can take them and put them back together to get the full Wikipedia if you insist to do so.

Laura
Reply all
Reply to author
Forward
0 new messages