Fwd: [berkmanfriends] Open Syllabus Project opening

5 views

Skip to first unread message

Brian Keegan

unread,

Apr 4, 2016, 4:49:17 PM4/4/16

---------- Forwarded message ----------

Seeking Full-stack Engineer / Data Scientist, Open Syllabus Project

April 4, 2016

The Open Syllabus Project is an academic data-mining project at Columbia and Stanford that's extracting structured information from a corpus of 1M+ college course syllabi. What's actually being taught in college classrooms? How has this changed over time? What can we learn about the organization of the modern university from large-scale trends in the texts that are being assigned? How can insights from these data be applied to curriculum development, education policy, and lifelong learning?

We launched a beta version of the platform with an op-ed in the New York Times in January, and since then the project has appeared in The Washington Post, Time, , The Chronicle of Higher Education, MarketWatch, Der Spiegel, Business Insider, Lifehacker, FiveThirtyEight, WNYC, QZ, and elsewhere. It's also been picked up by major news outlets in Europe, Russia, China, Japan, South Korea, Ukraine, Egypt, and Mexico.

We're looking for someone who has experience with large-scale data analysis, natural language processing, web archiving, and web application development to help us grow OSP into a comprehensive, feature-rich authority about teaching trends in higher education. Some of the things we're going to be working on in the coming months:

Build a scalable infrastructure for crawling university websites for syllabi, with the goal of growing the corpus to 4-5M documents in the next 6 months.

Expand the universe of books and articles that we search for in syllabi by identifying new bibliographic databases (Citeseer, arXiv) and integrating them into OSP's data extraction pipeline.

Write classifiers to improve the accuracy of the citation and metadata extraction jobs.

As we extract more (and higher quality) data about the corpus, expand the public-facing web application to surface new types of information - visualize change in assignment trends over time, add profile pages for authors and publishers, and build richer ways to explore the citation graph.

Help develop a research program around the data. We're interested in applications to information science, literary studies, education policy, history of science, and canon / university studies.

If these kinds of projects sound interesting, we'd love to hear from you! In terms of specific skills, OSP is very much a full-stack project. The work is split evenly across web application development, devops, and what's often lumped together as "data science." We often begin with tasks that have the look and feel of academic research questions, but the solutions then have to be baked off into well-tested, production-ready code that can work at scale. We'd love to work with someone who has experience with (and enjoys) both of these modes of work.

All of our code can be found on GitHub here and here. We use Python for the data extraction rig and the public-facing website (Flask), Elasticsearch for citation extraction, React+Redux on the front end, and Ansible to manage infrastructure on AWS. Beyond specific technologies, though - first and foremost we're looking for a collaborator and partner who can help us build on what we have and push the project in new directions.

Salary: Competitive

Commitment: Full time, initially one year, longer term possible

Location: Remote or NYC

If interested, please send an email and CV to syllab...@gmail.com

Reply all

Reply to author

Forward

0 new messages