Hi all,
I'm a junior student from Peking University who majors in Computer Science. I just got the information of the GSOC project recently...so sorry for posting this so late.
During my time at the university, I've done lots of research on the "crawler" topic and have joined the "Search Engine and Web Mining" group in my school to obtain more knowledge about crawling data. I always consider crawling the websites very interesting because it demands me to delve into of the particular websites and trace the data to see when and where they are formed, just like a detective work : )
I've went through the list of ideas you provided and found “Support for Spider in other languages” very fascinating! I have used many open-source crawlers in Java like Heritrix and Nutch. Honestly they are very difficult to configure and are not as extensible as Scrapy... So it will be a good news for Java developers if this idea comes true! Besides, I have some experience in Hadoop as well. The Hadoop streamline provides a convenient interface for developers in languages other than java to write their own Mapper/Reducer. I think it's a awesome and feasible idea to make Scrapy's spider as flexible as Hadoop's Mapper/Reducer.
So could anyone tell me whom I should contact next if I want to participate in this project? Hope it's not too late...
Thanks a lot!
- Leo