Lesson learned - don't be afraid to ask.

13 views

Skip to first unread message

Matthew Cronin

unread,

Dec 12, 2019, 6:30:03 PM12/12/19

to Penny University

As a new member of the group, I'd like to leave a short review of a chat between John Berryman, Jeremy Jordan, Chang Lee, and myself on the #data channel. While we didn't spin off into a full-on Penny Chat, it immediately made me appreciate what Penny-U offers as a community.

It's been a little over a year since I decided that I wanted to make the transition from academic research as an MRI physicist into industry and the practice of Data Science. Starting out was easy enough as I simply started working on online courses and exercises to learn basic Python, then data manipulation and visualization using Numpy/Pandas/Matplotlib/Seaborn etc, followed by the array of tools and models in Scikit-learn. As I got closer to actually being eligible to apply for jobs, it became clear that it would be beneficial to devise a totally independent side project and document it on Github.

As my data analysis and machine learning experience to date had been primarily based on applying various regression techniques in the numerical modeling of MRI physics and analysis of MRI data, I decided to branch out and take a crack at some NLP/document classification. I am a sometime reader of the /r/personalfinance page on Reddit since moving to the US, and realized that there was a convenient trove of user-classified ('flair' tagged) unstructured text documents staring me in the face, and was fairly certain that there would be an API available to retrieve data. A quick google search lead me to the Python Reddit API Wrapper (https://praw.readthedocs.io/en/latest/) and I set about gathering data, learning different methods of tokenization and classification, and devising some goals beyond simply predicting the user assigned flair of unseen posts.

On the upside, this project has proven to be a powerful learning experience, both in the core skills of data acquisition, restructuring, tokenization and classification; as well as in peripheral skills such as working with virtual environments and a standardized project structure (Thank you Cookiecutter Data Science team) and consistent use of Git to track the project. On the downside, the open-ended project scope (and the learning curve involved) lead me right into the trap of analysis paralysis, particularly when it came to assessing my efforts at classification. I ended up spinning my wheels and felt like my progress towards a project that I'd be pleased to have another person (especially a hiring manager) read over was stalling.

Fortunately, through a conversation with BettyAnn Chodkowski (veteran programmer, long-time MRI researcher, and now Data Scientist at HCA) I had been put in touch with John on LinkedIn, who in turn connected me to Penny University. I explained what I was trying to do and where I was feeling stuck, asked if anybody had any advice, and was very quickly pointed to some useful tools (multi-class confusion matrix new in scikit-learn v0.22) and given constructive advice (can I label misclassified documents any better than my model?) by Jeremy and Chang. This was very helpful in getting me out fo the rut in my classification project and back on the path to completion and pushing on with the next project.

All of that is to say that I think Penny U is a great community and an invaluable resource, especially in a field whose paths to entry are so diverse, and where it could easily be quite intimidating to put your cards on the table when trying to step in from a different career path. Hopefully in the near future I can pay it forward to another new arrival.

Reply all

Reply to author

Forward

0 new messages