A simple data processing pipeline example (for proximity)

63 views
Skip to first unread message

Oren Lederman

unread,
May 20, 2019, 4:08:27 PM5/20/19
to Rhythm Badges
Hi all,

I recently finished anonymizing an analysis environment that I used for some of my studies. It's not perfect, but it can provide you with some nice examples how on to handle your data. The scripts themselves can be found here - https://github.com/HumanDynamics/rhythm-public-analysis-deltav18-anon

What can you expect to find there?
  • The environment is structured based on the  cookiecutter data science project template . It separates code (in github) from data (stored in AWS S3). There are folders for reusable python code (under /src), some scripts for synchronizing your data with amazon S3, and a place for your notebooks (/notebooks)
  • A data processing "pipeline" with the following steps: 
    • download  - pulls datafiles from the server. You'll need to dig into the settings and code and tweak it to your needs, but in general you'll need to use your project key, list the hubs that generated the data, and give it a range of dates you are interested in. The server maintains a daily file for each hub, so this script just iterates on these lists and pull the  files
    • gzip - a hacky stage where we gzip the files before the next steps. It saves a lot of storage space
    • group - create per-hour files from all available data. Makes it easier for the next steps to rung pre-processing in parallel
    • process - preprocessing. Creates the basic data structures (member-2-member, member-to-badge, etc)
    • clean - a set of functions that clean up the data. Based on your configuration (in the config.py file), it will :
      • Remove data before the experiment started and after it ended
      • For each participant (member) it allows you to define when the person joined and left (see members file example for reference)
      • Removes data from when I manually changed the batteries (Sunday night). That's project specific, but I didn't have the chance to generalize it
    • analysis - various analysis steps that I was able to generalize. In particular:
      • Compliance - for each member, this step identifies when the badge was in use, and when it wasn't. Based on this information it will keep member-to-member records from when the badges were actually in use. There is no magic here - you need to structure your experiment in a very specific way to do this - I setup location beacons in the reception area (marked in the metadata as "board" type). When the badges are closest to these beacons, I assume that they are not being used. 
      • Connections - aggregates the member-to-member data frames to create connection tables. 
        • You can set a list of RSSI cutoffs to use, and it will create a different table for each one
        • It aggregates the two temporal resolutions (daily and overall/annual), and grouping level (member-to-member, member-to-company, company-to-company). In my example, people belonged to different companies so it was convenient to group the data and measure the interaction between companies. Companies are defined int he member.csv.
  • Some examples to look at 
I can't make my data available just yet, so I'm attaching my members and beacons metadata files for reference

Hope this helps. Feel free to ask questions.

Oren
beacons.csv
members.csv
Reply all
Reply to author
Forward
0 new messages