Common Mistakes queryId.split("/")

16 views
Skip to first unread message

Laura Dietz

unread,
Jun 27, 2019, 3:05:42 PM6/27/19
to trec...@googlegroups.com

Dear CAR participants,

I see many of you making the same mistake when getting the PageId  from a Query Id

This here is gives you incorrect results

(page_id, section_path) = queryId.split("/")


The problem is that some Wikipedia page names contain slashes.

Example "Fiat X1/9"  (https://en.wikipedia.org/wiki/Fiat_X1/9)

When you split on slash, you will think that the page id is "Fiat X1"  when it is actually "Fiat X1/9".


A common symptom is that you cannot find the page in allButBench or benchmark*-pages.cbor --- the page is there, but you got the wrong page id.

So, if that sounds familiar to you, please go back and make sure that you use the right approach.


The correct approach is to load the outlines.cbor file, get the PageId and get the Heading Ids, if need to, you can keep a mapping.


Best,
Laura

P.S. while we are on the subject, you won't be able to guess the query text from the query id in benchmarkY3.


Reply all
Reply to author
Forward
0 new messages