some records is odds with author name

48 views
Skip to first unread message

xiaome...@gmail.com

unread,
Apr 22, 2019, 2:35:53 AM4/22/19
to Open Academic Graph
Hello there, I am a junior researcher in Big Scholarly Data

I'm trying to put the data into neo4j
I follow the step as : 
   -1 merge the data(author and paper) from MAG and Aminer
   -2 store all the data as an entity CSV file, and get ready to insert into neo4j by neo4j-import tools.
Because of the limitation of string's length in neo4j-import command, some exception was thrown
Then I return to check the data in the origin JSON file and my CSV file
I found an author has an extremely long name
the author id is '2492590802'

then I check the data with the filter like 'author name is abnormally long'
I found the list of authors' name is not right (id : name length)
(2883534986,53789)
(2809509190,231359)
(2151762742,249334)
(2126799623,351576)
(2129913879,699812)
(2130138085,1291987)
(1903406832,1758860)
(2619296525,2496188)
(244247712,3728819)
(2492590802,5276865)

all the abnormal author's name is like :
"name": "HocomAdvies #x F\t\t1\t0\t2016-08-23\r\n2492590803\t18993\tjinho kim\tJinho Kim\t188068037\t4\t31\t2016-08-23\r\n2492590804\t20879\tantonio carlos de andrade s..."
it is combined by a lot of name with id and date

anything wrong? Are these represent a group of authors?


gdk...@gmail.com

unread,
May 13, 2019, 11:30:36 AM5/13/19
to Open Academic Graph


On Monday, April 22, 2019 at 8:35:53 AM UTC+2, xiaome...@gmail.com wrote:
the abnormal author's name is like :
"name": "HocomAdvies #x F\t\t1\t0\t2016-08-23\r\n2492590803\t18993\tjinho kim\tJinho Kim\t188068037\t4\t31\t2016-08-23\r\n2492590804\t20879\tantonio carlos de andrade s..."
it is combined by a lot of name with id and date

It seems to me, that the data is not split at newlines for you: "\r\n" is usually the end of a tuple and you have several of them in one line.

regards,
-1

Fanjin Zhang

unread,
May 14, 2019, 9:45:32 AM5/14/19
to Open Academic Graph
Hi, there may be some encoding issues when processing MAG raw files. You can use ```name.split('\t')[0]``` (in Python) to obtain author names temporarily.

肖濛

unread,
May 22, 2019, 2:23:34 AM5/22/19
to Open Academic Graph
thanks for your attention
this could be some encoding problem. but how can this happen? a paper with 1000+ authors?
I simply deleted all the wrong encoding data
are you a contributor to this organization?
can you mail me if they can solve this problem in the source?

在 2019年5月14日星期二 UTC+8下午9:45:32,Fanjin Zhang写道:

Fanjin Zhang

unread,
May 22, 2019, 9:14:33 AM5/22/19
to Open Academic Graph
It happens when processing some lines with special characters in MAG author csv files. We found python codecs package couldn't deal with this encoding issues well, and maybe other methods works. We will try to fix this in the next update.
Reply all
Reply to author
Forward
0 new messages