best practices for transcript download/analysis

Ellen Hampton Filgo

unread,

Mar 5, 2015, 1:29:52 PM3/5/15

to LibraryH3lp

All,

I'm wanting to probe the libraryh3lp user community to see if anyone has any best practices for downloading/storing/cleaning up/anonymizing transcripts in order to do a textual analysis.

Thanks!

Ellen Filgo

Baylor University

smudgy

unread,

Mar 5, 2015, 3:03:24 PM3/5/15

to Ellen Hampton Filgo, LibraryH3lp

There's some built-in anonymization stuff in the admin console, of course, but that only does things like scrub IP addresses and (ha-ha) AIM usernames, if I'm understanding right. The bigger problem with anonymizing is that, inevitably, people communicate personal information when they're trying to get account help and the like. In this case, any kind of anonymizing is inevitably going to have to be done by hand, which means that some human is going to have to look at the data.I don't know if there are best practices to this kind of thing, but I wonder if a reasonable approach would be to have someone dedicated to anonymizing, and then just have that person not do any coding and the like.

As for downloading and storing, you basically get what LH3 gives you, which is lots of nested files and folders. For doing text analysis we like to get a random sample, but the nested directory structure makes this non-trivial to do. But somewhere I've got a java file that'll automatically crawl a directory for .txt files and copy some percentage of them to a destination directory. Let me know if you're interested.

-dre.

--
You received this message because you are subscribed to the Google Groups "libraryh3lp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to libraryh3lp...@googlegroups.com.
To post to this group, send email to libra...@googlegroups.com.
Visit this group at http://groups.google.com/group/libraryh3lp.
For more options, visit https://groups.google.com/d/optout.

Chad Haefele

unread,

Mar 5, 2015, 3:15:38 PM3/5/15

to smudgy, Ellen Hampton Filgo, LibraryH3lp

Dre's advice is exactly what we've done to anonymize transcripts in the past. We do what we can in the admin console, then have someone read through each one individually. We blank out: usernames, real names, University IDs and any other identifying number, phone numbers, and email addresses. We do this as a first step by itself, not simultaneous with any analysis.

The transcripts are downloaded in that nested structure to begin with, but in windows I just search for all .txt files in the main folder and then copy/paste them all into one folder somewhere.

-Chad

LibraryH3lp Support

unread,

Mar 5, 2015, 3:30:59 PM3/5/15

to Ellen Hampton Filgo, LibraryH3lp

Hi Ellen,

In terms of strictly the mechanics of managing the LH3-specific part, be aware that the "anonymize" option in the admin dashboard nukes IP addresses, phone numbers for texting patrons, IM gateway screennames, and very importantly, it also deletes the transcript. This leaves you with just scrubbed metadata, no transcripts at all. The transcripts are deleted with that automated function because they're usually full of identifying info that's hard to auto-detect.

For managing downloaded transcripts, it's easy to break them out of the default nested directory structure and also to combine a bunch of transcripts into a single file if desired. PC-based instructions here. We can help if needed.

Now, getting to the much more difficult issues: textual analysis and chat transcripts. The identifying info is really hard to auto-detect, so if you need to build a large corpus AND make sure it is scrubbed, it's usually just a big manual job since someone will need to look at each one manually. If the corpus will only be used in-house for internal analysis purposes, consider how much it needs to be scrubbed. Sometimes data gathered and used as part of the day-to-day internal functioning of the library might be treated differently than data to be analyzed as part of a published study or used by third parties, such as researchers or those beyond the library. But that's a matter for each institution's internal policies.

Anyway, all of this is one reason that sampled data can become attractive, as opposed to a truly huge transcript corpus. If you use sampled data, there is less of it to manually scrub. One methodology for getting a good random sample is to download the CSV chat metadata for your desired date range. Open it up in Excel and you'll see that each chat is associated with its own individual ID number. Delete rows for unanswered chats (no operator identified). Look at the number of rows in the spreadsheet (say you have 1000 chats, there will be 1000 rows) and generate X number of random numbers in that range. Then use the Excel line numbers to grab the chat IDs, and download just those specific transcripts. You probably need to build in a rule that says, "if a selected transcript is just a thanks-bye, we will pick the next transcript in its place."

Very best,

Pam

LibraryH3lp support

On Thu, Mar 5, 2015 at 1:29 PM, Ellen Hampton Filgo <eham...@gmail.com> wrote:

--

Margaret Smith

unread,

Mar 5, 2015, 5:38:15 PM3/5/15

to Ellen Hampton Filgo, LibraryH3lp

For the "mining" part, I splurged on this thing called FileLocator Pro ($50 for 3-4 PCs of downloads, but there's a free 30-day trial, if you only need it for a short time). It lets you do boolean, "whole word", and regex searching. Then, it retrieves however many lines before and after the actual words/characters in your search (I set this # absurdly high, to get the full transcript), and you can output these as new text files (which also gets rid of the nested folders).

After that, I batch-anonymize the transcripts using regex in TextCrawler (free). To do this, I search for: \b[A-Z._%+-]+@libraryh3lp.com\b
and replace with:
libr...@libraryh3lp.com

I have no idea if this is the *most efficient* way, but it works great for me. Feel free to contact me directly if you have any questions.

On Thu, Mar 5, 2015 at 1:29 PM, Ellen Hampton Filgo <eham...@gmail.com> wrote:

--

Reply all

Reply to author

Forward