Obtaining initial data for iCQA from StackExchange

109 views
Skip to first unread message

Ramis Yamilov

unread,
Oct 14, 2012, 6:03:58 AM10/14/12
to ic...@googlegroups.com
I suggest using The Stackexchange Data Explorer to preprocess data before importing to iCQA. 
The workflow might look like: 
  1. Writing query(for example, query that retrieves questions tagged 'java') and downloading results in CSV file
  2. Preproccessing data 
  3. Importing prepared data to iCQA instance. 
In any case, steps 1 - 2 must be performed manually. In 3rd step user should create proper xml-files for uploading.
So we have to find out how to make 3rd step easier and define more common way of importing data.
Is there any other suggestions?

Nikita Zhiltsov

unread,
Oct 14, 2012, 12:14:18 PM10/14/12
to ic...@googlegroups.com
How curious, thanks for the links. Can we easily patch the default importing mechanism to add support for loading CSV files? Does the StackExchange data explorer have API e.g. to make connections from Python code?

воскресенье, 14 октября 2012 г., 6:03:58 UTC-4 пользователь Ramis Yamilov написал:

Ramis Yamilov

unread,
Oct 14, 2012, 5:08:07 PM10/14/12
to ic...@googlegroups.com
The SX Data Explorer doesn't have api but there is api for StackExchange sites.
Adding support for loading CSV files is pretty straightforward, but the current code of sximporter module is a little messy.

воскресенье, 14 октября 2012 г., 20:14:18 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Oct 18, 2012, 5:53:49 PM10/18/12
to ic...@googlegroups.com
Ok. Let's postpone it and consider alternatives. I'm wondering if it's easier to add filtering by tag (e.g. 'java') when we upload the data from XML as usual? Of course, data consistency must remain: the imported set must contain the answers and comments associated with a question tagged by the chosen tag.


воскресенье, 14 октября 2012 г., 17:08:07 UTC-4 пользователь Ramis Yamilov написал:

Ramis Yamilov

unread,
Oct 22, 2012, 4:21:06 AM10/22/12
to ic...@googlegroups.com
It's not easier. It's just the same for implementing, because we need to parse the whole dump and create a db migration that adds all data. And then we need to create other migration (or edit current migration) that cleans db from unneeded tags and posts. In this case, the cascading delete will be used and the data consistency will be saved. 

пятница, 19 октября 2012 г., 1:53:49 UTC+4 пользователь Nikita Zhiltsov написал:
Message has been deleted

Nikita Zhiltsov

unread,
Oct 22, 2012, 3:26:31 PM10/22/12
to ic...@googlegroups.com
Ok, Ramis,

I have a few questions on data import. I'm populating iCQA with the latest StackOverflow dump (~10.5 GB ZIP file). How can we import data without using admin UI? Is the importing tool able to take advantage of several cores? What configuration should be set up for that?
Message has been deleted

Ramis Yamilov

unread,
Oct 22, 2012, 4:14:55 PM10/22/12
to ic...@googlegroups.com
There is no multithreading support in the 'sximporter' module. 
You can use the django shell to import data:

$ python manage.py shell
>>> from forum_modules.sximporter import importer
>>> from django.contrib.auth.models import User
>>> importer.sximport('/path/to/dir/with/xml/files/', {'authenticated_user': User.objects.get(id=1)})

I've not tested the code above, but it should work.


понедельник, 22 октября 2012 г., 23:26:31 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Oct 22, 2012, 4:33:40 PM10/22/12
to ic...@googlegroups.com
After starting from scratch and adding a new user, I got the following error:

>>> from forum_modules.sximporter import importer
>>> from django.contrib.auth.models import User
>>> importer.sximport('/tmp/se-import-data', {'authenticated_user': User.objects.get(id=2)})
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 829, in sximport
    uidmap = userimport(dump, options)
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 282, in userimport
    readTable(path, "users", callback)
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 92, in readTable
    parser.parse(f)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 304, in end_element
    self._cont_handler.endElement(name)
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 75, in endElement
    if self.el_data: self.callback(self.el_data)
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 190, in callback
    uidmapper[owneruid] = osqau.id
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 147, in __setitem__
    super(IdMapper, self).__setitem__(int(key), int(value))
TypeError: int() argument must be a string or a number, not 'NoneType'

Ramis Yamilov

unread,
Oct 22, 2012, 4:54:26 PM10/22/12
to ic...@googlegroups.com
I guess, there are some problems with ORM and getting user id. Try passing an empty dict to the function

>>> importer.sximport('/tmp/se-import-data', {})


вторник, 23 октября 2012 г., 0:33:40 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Oct 22, 2012, 4:59:10 PM10/22/12
to ic...@googlegroups.com
This doesn't work as well (see the trace below). Ramis, could you please sort out these issues on your own ASAP? This sub-task is of high priority now. 

importer.sximport('/tmp/se-import-data', {})
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/nzhiltsov/Documents/projects/iCQA/qa-engine/forum_modules/sximporter/importer.py", line 861, in sximport
    db.commit_transaction()
  File "/Library/Python/2.7/site-packages/south/db/generic.py", line 964, in commit_transaction
    transaction.commit(using=self.db_alias)
  File "/Library/Python/2.7/site-packages/django/db/transaction.py", line 200, in commit
    set_clean(using=using)
  File "/Library/Python/2.7/site-packages/django/db/transaction.py", line 125, in set_clean
    raise TransactionManagementError("This code isn't under transaction management")
TransactionManagementError: This code isn't under transaction management

Ramis Yamilov

unread,
Oct 23, 2012, 2:05:17 AM10/23/12
to ic...@googlegroups.com
I've tried this and it worked well for me:

>>> from forum_modules.sximporter import importer
>>> from south.db import db
>>> db.start_transaction()
>>>
importer.sximport('/home/ramis/042012 Meta Server Fault', {})
['Users', 'Posts', 'Comments', 'Votes', 'Badges']

Content of '/home/ramis/042012 Meta Server Fault' folder:
badges.xml    
license.txt      
posts.xml  
users.xml
comments.xml  
posthistory.xml  
readme.txt
 
votes.xml


вторник, 23 октября 2012 г., 0:59:10 UTC+4 пользователь Nikita Zhiltsov написал:
readme.txt

Nikita Zhiltsov

unread,
Oct 23, 2012, 1:00:37 PM10/23/12
to ic...@googlegroups.com
Ok, thanks. This works for me too.

Nikita Zhiltsov

unread,
Nov 1, 2012, 4:35:00 PM11/1/12
to ic...@googlegroups.com
Ramis,

I have two questions:

1) I've tried to import the whole StackOverflow. This fails at some point (only a part of the data has been successfully loaded) with the following error (see error.txt). Could you please fix that, i.e. add skipping bad items, because it does not make sense to halt the overall process, when such an error occurs?

2) Importing a large dump, such as StackOverflow's, requires a lot time (e.g. the error interrupted the process after 1.5 day execution on our pretty powerful machine, and my estimate is that it was only 10% of the data). How can we speed it up? Do we really need in some files like posthistory.xml or badges.xml for our current task? Is it possible to load separate files in parallel? In the latter case, we may switch off the MySQL foreign key constraints for a while and populate even related tables at once.
error.txt

Ramis Yamilov

unread,
Nov 1, 2012, 5:30:17 PM11/1/12
to ic...@googlegroups.com
The simpliest way to speed up the importing is the creating sql files based on xml data and executing them in parallel. So in that case, we have to write a utility to convert xml files to sql instead of changing the importing mechanism. I think, such solution will be much faster

пятница, 2 ноября 2012 г., 0:35:00 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 1, 2012, 7:14:32 PM11/1/12
to ic...@googlegroups.com
Ramis,

let's try it, then. How long does it take? And what about posthistory and badges? These files are not necessary indeed? 

Ramis Yamilov

unread,
Nov 2, 2012, 5:21:08 PM11/2/12
to ic...@googlegroups.com
Ok, I'll do it step by step. I've finished the initial functionality and have implemented processing of 'posts.xml' for now: https://github.com/CLLKazan/iCQA/blob/convert_to_sql/qa-engine/convert_to_sql.py 
Please, check it and give me feedback.
Usage: 
$ python convert_to_sql.py /path/to/posts.xml > some.sql


пятница, 2 ноября 2012 г., 3:14:32 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 2, 2012, 6:24:40 PM11/2/12
to ic...@googlegroups.com
It failed:

SET FOREIGN_KEY_CHECKS = 0Traceback (most recent call last):
  File "convert_to_sql.py", line 137, in <module>
    convert(sys.argv[1])
  File "convert_to_sql.py", line 125, in convert
    codecs.open(path, 'r', 'utf-8').readlines(),
  File "/usr/lib/python2.7/codecs.py", line 679, in readlines
    return self.reader.readlines(sizehint)
  File "/usr/lib/python2.7/codecs.py", line 588, in readlines
    data = self.read()
  File "/usr/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
MemoryError

Ramis Yamilov

unread,
Nov 2, 2012, 7:40:16 PM11/2/12
to ic...@googlegroups.com
I've updated the script and switched parser to lxml as described here. So it will not load the entire file to memory. 

суббота, 3 ноября 2012 г., 2:24:40 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 5, 2012, 3:53:47 PM11/5/12
to ic...@googlegroups.com
This works quite well and fast, but we need to split the large output file (~11 GB) into 6 parts (the number of cores on the machine available for experiments). Please rewrite the script for this purpose. I believe the statements that switch off the constraints should be taken out from these 6 files with INSERT statements only. Then, we will be able to parallelize execution of this scripts. You could also proceed with creating scripts for the remaining data (about users and others).

Ramis Yamilov

unread,
Nov 7, 2012, 2:29:00 PM11/7/12
to ic...@googlegroups.com
I've implemented the splitting of output. Now usage looks like:
$ python convert_to_sql.py   /path/to/xml  number_of_files

For example, if we run 
$ python convert_to_sql.py   /path/to/users.xml  4

the script will produce 2 files for 'auth_user' table and 2 files for 'forum_user' table. The output files will be named like 'auth_user-0.sql'

Also I've completed the processing of votes and users.
Importing of 'posthistory.xml' and 'badges.xml' will be added if needed. 

вторник, 6 ноября 2012 г., 0:53:48 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 8, 2012, 1:29:55 PM11/8/12
to ic...@googlegroups.com
I ran convert_to_sql.py for 6 splits. Then, executed the resulting post-0.sql file. This fails with the following error:

ERROR at line 34131: Unknown command '\''.

The content of the lines 34131-34140 is as follows:

INSERT INTO forum_node
            (id, title, tagnames, author_id, body, node_type, parent_id,
            added_at, score, state_string, last_activity_by_id, last_activity_at,
            active_revision_id, extra_count, marked) VALUES (35594,'','',1934,'<p>It really comes down to the nature of your database and application. What kind of application(s) are hitting SQL Server? In my experience, it only handles 5-10 users with a heavy read/write application.</p>
','answer',35559,'2008-08-30 02:15:52',0,'',1,'2008-08-30 02:15:52',22801,0,0),
(35602,'','',1915,'<p>Also mostly news sites but will help you keep up with what is happening the java ecosystem :</p>

<ul>
<li><a href="http://www.theserverside.com/" rel="nofollow">The Server Side</a></li>
<li><a href="http://java.dzone.com/" rel="nofollow">Java.dzone.com</a> (formerly Java Lobby)</li>

Nikita Zhiltsov

unread,
Nov 9, 2012, 6:34:12 PM11/9/12
to ic...@googlegroups.com
Can we borrow the escaping mechanism from the default OSQA importer for XML-to-SQL conversion? 

Ramis Yamilov

unread,
Nov 10, 2012, 4:23:40 PM11/10/12
to ic...@googlegroups.com
Is there the one error? or are there many errors like this? If there are few errors they can be handled manually.
The other solution is using the re.escape() function. I'll check it soon

четверг, 8 ноября 2012 г., 22:29:56 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 11, 2012, 11:02:43 PM11/11/12
to ic...@googlegroups.com
The current problem looks related to necessity of escaping quotes. I guess it's a common problem, though. We need the same escaping as in the standard OSQA importer. 

Ramis Yamilov

unread,
Nov 12, 2012, 11:48:53 AM11/12/12
to ic...@googlegroups.com
i've added the escaping mechanism of MySQLdb library 

понедельник, 12 ноября 2012 г., 8:02:43 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 12, 2012, 12:05:05 PM11/12/12
to ic...@googlegroups.com
This does not work:

Traceback (most recent call last):
  File "convert_to_sql.py", line 303, in <module>
    convert(sys.argv[1], files_count)
  File "convert_to_sql.py", line 292, in convert
    converter.convert(context, files_count)
  File "convert_to_sql.py", line 148, in convert
    values = u',\n' + self.make_sql(elem.attrib)
  File "convert_to_sql.py", line 112, in make_sql
    obj.get('OwnerUserId', '1'), escape_string(obj['Body']),
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 248: ordinal not in range(128)

Nikita Zhiltsov

unread,
Nov 12, 2012, 1:00:06 PM11/12/12
to ic...@googlegroups.com
BTW, the version of Python is 2.7.1. 

Ramis Yamilov

unread,
Nov 12, 2012, 1:42:04 PM11/12/12
to ic...@googlegroups.com
I've added some workaround: 

def escape(string):
   
return smart_unicode(escape_string(smart_str(string)))

It should work


понедельник, 12 ноября 2012 г., 21:05:05 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 12, 2012, 7:08:40 PM11/12/12
to ic...@googlegroups.com
Ok, thanks. It seems to work properly.

Nikita Zhiltsov

unread,
Nov 15, 2012, 1:00:11 PM11/15/12
to ic...@googlegroups.com
Ramis,

I discovered some problem with the loaded data. The steps to reproduce are:
1) Execute a query "select * from forum_tag where name like 'java';"
+----+------+---------------+---------------------+------------+
| id | name | created_by_id | created_at          | used_count |
+----+------+---------------+---------------------+------------+
| 70 | java |             1 | 2012-11-12 13:51:13 |     277652 |
+----+------+---------------+---------------------+------------+

2) Execute a query "select count(*) from forum_node_tags where tag_id=70;"
+----------+
| count(*) |
+----------+
|        4 |
+----------+

3) At the same time, we have the different results from a query "select count(*) from forum_node where tagnames like 'java %';" 
+----------+
| count(*) |
+----------+
|   259290 |
+----------+

In other words, the data are inconsistent. Here is a list of the files I got after conversion for loading into the database:
  • auth_user-0.sql
  • forum_user-0.sql
  • forum_action-0.sql  
  • forum_vote-0.sql 
  • posts-0.sql  
  • posts-1.sql  
  • posts-2.sql
  • posts-3.sql
  • posts-4.sql  
  • posts-5.sql
  • posts-misc.sql
What's wrong with that?

Ramis Yamilov

unread,
Nov 16, 2012, 10:07:37 AM11/16/12
to ic...@googlegroups.com
Try to reload posts-misc.sql. Maybe there were some errors while loading.
I've tested these queries and what I've got:
select * from forum_tag WHERE name LIKE 'cc.complexity-theory';

id , name                  , created_by_id , created_at, used_count

'7', 'cc.complexity-theory', '1', '2012-11-16 18:47:40', '841'


select count(*) from forum_node_tags where tag_id=7

count(*)

'841'


select count(*) from forum_node where tagnames like '%cc.complexity-theory%';

count(*)

'841'



четверг, 15 ноября 2012 г., 22:00:11 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Nov 20, 2012, 9:48:23 PM11/20/12
to ic...@googlegroups.com
Yep, something's wrong with posts-misc.sql file. Here is a log of my bash script for sequential execution of the scripts.

*** auth_user-0.sql start loading ***
*** auth_user-0.sql done ***
*** forum_action-0.sql start loading ***
*** forum_action-0.sql done ***
*** forum_user-0.sql start loading ***
*** forum_user-0.sql done ***
*** forum_vote-0.sql start loading ***
*** forum_vote-0.sql done ***
*** posts-0.sql start loading ***
*** posts-0.sql done ***
*** posts-1.sql start loading ***
*** posts-1.sql done ***
*** posts-2.sql start loading ***
*** posts-2.sql done ***
*** posts-3.sql start loading ***
*** posts-3.sql done ***
*** posts-4.sql start loading ***
*** posts-4.sql done ***
*** posts-5.sql start loading ***
*** posts-5.sql done ***
*** posts-misc.sql start loading ***
ERROR 1062 (23000) at line 31551: Duplicate entry '601-167' for key 'node_id'
*** posts-misc.sql done ***

Nikita Zhiltsov

unread,
Dec 11, 2012, 10:09:42 PM12/11/12
to ic...@googlegroups.com
Ramis,

did you manage to reproduce the error?

Ramis Yamilov

unread,
Dec 12, 2012, 3:39:16 PM12/12/12
to ic...@googlegroups.com
No, I didn't. I can't figure out how duplicate node IDs appear there.

среда, 12 декабря 2012 г., 7:09:42 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Dec 12, 2012, 3:43:23 PM12/12/12
to ic...@googlegroups.com
So, does it mean that you do have the whole StackOverflow dump imported into iCQA on your machine?

Ramis Yamilov

unread,
Dec 12, 2012, 3:45:50 PM12/12/12
to iCQA
No. I don't have the StackOverflow data dump

2012/12/13 Nikita Zhiltsov via iCQA <icqa+noreply-APn2wQepKAJS0lVTFxy...@googlegroups.com>
--
You received this message because you are subscribed to the Google Groups "iCQA" group.
To post to this group, send email to ic...@googlegroups.com.
To unsubscribe from this group, send email to icqa+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/icqa/-/22Ntw_xtNOMJ.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Ramis Yamilov
Kazan Federal University, Institute of Computer Mathematics and Information Technologies

Nikita Zhiltsov

unread,
Dec 12, 2012, 3:53:25 PM12/12/12
to ic...@googlegroups.com
Thus, it means that our common urgent goal is to acquire that.


On Wednesday, December 12, 2012 3:45:50 PM UTC-5, Ramis Yamilov wrote:
No. I don't have the StackOverflow data dump

Ramis Yamilov

unread,
Jan 17, 2013, 2:47:43 AM1/17/13
to ic...@googlegroups.com
I've finally managed to import posts.xml after many tries. There were some problems with memory usage of python script so I used sqlite to store the intermediate data.

mysql> select * from forum_tag where name like 'java';
+----+------+---------------+---------------------+------------+
| id | name | created_by_id | created_at          | used_count |
+----+------+---------------+---------------------+------------+
| 70 | java |             1 | 2013-01-12 23:26:27 |     277652 |
+----+------+---------------+---------------------+------------+
1 row in set (0.11 sec)

mysql> select count(*) from forum_node_tags where tag_id=70;
+----------+
| count(*) |
+----------+
|   277651 |
+----------+
1 row in set (2.03 sec)

mysql> select count(*) from forum_node where tagnames rlike '[[:<:]]java[[:>:]]';
+----------+
| count(*) |
+----------+
|   286190 |
+----------+
1 row in set (27 min 33.82 sec)




четверг, 13 декабря 2012 г., 0:53:25 UTC+4 пользователь Nikita Zhiltsov написал:

Nikita Zhiltsov

unread,
Feb 10, 2013, 1:20:46 AM2/10/13
to ic...@googlegroups.com
All right, didn't have much time to look through that. Let's discuss it and other things by Skype at the beginning of the next week + Azat.

Guofeng

unread,
Jun 13, 2013, 4:59:50 AM6/13/13
to ic...@googlegroups.com
 Hi,

  When I load the sql files like"mysql -u osqa -p osqa < posts-0.sql"

 I got an error like "ERROR : Duplicate entry ‘1’ for key ‘PRIMARY’


Can someone help me?


Many thanks.

Азат Хасаншин

unread,
Jun 13, 2013, 5:42:59 AM6/13/13
to iCQA on behalf of Guofeng
Hello,

You need to delete your administrator user. You can do this from django shell
or mysql client.


--
You received this message because you are subscribed to the Google Groups "iCQA" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icqa+uns...@googlegroups.com.

To post to this group, send email to ic...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Азат Хасаншин
Reply all
Reply to author
Forward
0 new messages