How to find original contents in MAG for obgn-arxiv dataset?

418 views
Skip to first unread message

黄蔚然

unread,
Sep 15, 2020, 12:12:30 PM9/15/20
to Open Graph Benchmark
I just found this Google group, and I just sent an email to OGB team which I now think could be shared in this group.

TL;DR, I want the raw titles and abstracts instead of preprocessed feature vectors, but the MAG documentation is making me confused, so I need help. Below is the email:

Dear OGB team,

I am now trying to do some work on OGB datasets, and I really appreciate your great job. Currently I focus on the ogbn-arxiv dataset. 

I noticed that you obtain the feature vectors by simply averaging the embedding of each word in the titles and abstracts, and the embeddings are generated by skip-gram. It looks like a simple way to get features which could be inaccurate, and the results of MLP proves it. 

I am trying to figure out how the neighbors and a node itself affect the classification process, so I wonder which level can a model achieve when it only leverage the information of each nodes (i.e with no graph info). I am thinking of feeding the titles and abstracts to Bert and classify these papers and seeing what would happen. 

Thus, I now need the titles and abstracts of each paper in ogbn-arxiv dataset. After checking the MAG website I found it hard to fetch them, so I write this email to ask if you can help me with this. Besides, any advice on this will be also highly appreciated.

Thanks in advance.



Open Graph Benchmark

unread,
Sep 15, 2020, 6:58:41 PM9/15/20
to Open Graph Benchmark
Hi,

Here is what you can do:
2. map the Paper Ids in ogbn-arxiv with those in MAG to get their paper titles and abstracts. 

Best,
OGB Team

2020年9月15日火曜日 9:12:30 UTC-7 黄蔚然:

黄蔚然

unread,
Sep 16, 2020, 3:16:41 AM9/16/20
to Open Graph Benchmark
Thanks for your reply. But I am wondering if your team has the raw data that could share with us? It seems the Azure platform limits the transactions per day.

Weihua Hu

unread,
Sep 16, 2020, 3:30:54 AM9/16/20
to Open Graph Benchmark
It would be hard for us to prepare the additional data promptly, as our member is taking leave now. That's why we suggest you to fetch necessary information on your side. Thanks for your understanding.

Best,
Weihua

2020年9月16日水曜日 0:16:41 UTC-7 黄蔚然:

黄蔚然

unread,
Sep 17, 2020, 6:59:53 AM9/17/20
to Open Graph Benchmark
Dear Weihua,

I tried Azure, it really requires some effort to extract the info we need. I could sign up MAG for free, but it seems when I want to manipulate the dataset directly on Azure (since downloading ~200 GB data locally is hard) requires me to pay. Azure also set a limitation on the number of vCPUs, I need to contact the resource manager to increase my quota, etc. 

I expect the text file of all titles & abstracts is an intermediate step when you were building the dataset, it is also rather small, with only ~170K papers, the file should be ~100MB, and it is likely your team is keeping it. So I wonder whether it is possible to contact the member for it? It would be a very huge help to us and greatly lighten our workload.

Thanks,
Weiran

Uri Alon

unread,
Sep 17, 2020, 3:45:13 PM9/17/20
to Open Graph Benchmark
I agree, scraping ourselves would not enable an exact repro, because of different filters, tokenization, etc.
It would help me a lot as well if the OGB team could release this data. But I totally understand if this data is no longer available.

Thanks a lot,
Uri

黄蔚然

unread,
Sep 18, 2020, 7:04:51 AM9/18/20
to Open Graph Benchmark
Thanks, Uri, glad to know you share a similar concern, how did you solve your problem finally? 
Actually I met some problems exactly the same as you mentioned. 
Somehow the newest MAG dataset (2020.09.11) is not consistent with ogbn-arxiv. When I was extracting, I found some PaperId in ogbn-arxiv was missing in MAG (as many as 1657 papers). 
Not sure what happened, maybe their PaperId were changed, but I'd say I got stuck. :(

黄蔚然

unread,
Sep 18, 2020, 7:11:47 AM9/18/20
to Open Graph Benchmark
Hi Weihua,

I met with some problems. I used the mapping file to map the node idx to paperId, but I found 1657 papers' id cannot be found in MAG-2020-09-11. Not sure whether their paperIds are changed or they are removed from the dataset. Would you please help check this? I can provide you with the info of the 1657 nodes.

在2020年9月16日星期三 UTC+8 下午3:30:54<weih...@gmail.com> 写道:

Yuxiao Dong

unread,
Oct 29, 2020, 2:11:59 AM10/29/20
to Open Graph Benchmark
Hi Weiran, 

Could you share the 1657 paper ids that were missing from your side? Thank you very much! 

Best,
Yuxiao

Open Graph Benchmark

unread,
Nov 3, 2020, 6:56:19 PM11/3/20
to Open Graph Benchmark
Here is where the raw texts of titles and abstracts can be downloaded.

Hope this helps,
OGB Team

2020年10月28日水曜日 23:11:59 UTC-7 Yuxiao Dong:
Reply all
Reply to author
Forward
0 new messages