Examine Contents shows nothing

110 views
Skip to first unread message

Jarad Buckwold

unread,
Jul 26, 2018, 3:00:41 PM7/26/18
to archivematica
Hi all,

Fairly new to Archivematica - particularly the appraisal tab - and am having an issue with the "examine contents" part of the analysis window. Namely, it shows nothing. I put 2 test files into the system riddled with personal information (addresses, credit card numbers, emails, etc), but neither of the files show up when selecting that option. I've tried selection the files both on the backlog side of the tab and the file list side, but can't get anything to appear. It just shows what looks like headers of a table that have no content. Am I missing something or totally misunderstanding the purpose of this function? Any assistance would be greatly appreciated. Thanks!

Screenshot for reference:


Max Eckard

unread,
Jul 26, 2018, 3:18:17 PM7/26/18
to archiv...@googlegroups.com
Hi Jared,

You might want to check that the Examine Contents micro-service is turned on in your processing configuration, which is in the Administration tab (I believe it's off by default). If that micro-service doesn't run, there's no Bulk Extractor report for it to show you.

Hope that helps!

Max

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at https://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.


--
Max Eckard
Lead Archivist for Digital Initiatives


1150 Beal Avenue
Ann Arbor, MI 48109-2113

Jarad Buckwold

unread,
Jul 26, 2018, 3:31:13 PM7/26/18
to archivematica
Hi Max,

Thanks for the reply. The Examine Contents micro-service is indeed turned on. I can see the micro-service taking place as the package goes through the transfer process. It just shows nothing for some reason, despite the fact that I purposely loaded the documents with personal info. Either bulk extractor isn't catching them (which is a bit scary), or the dashboard just isn't showing them. Or I'm doing something wrong.

Sarah Romkey

unread,
Jul 26, 2018, 3:48:42 PM7/26/18
to archiv...@googlegroups.com
If you want to check to see if bulk extractor is catching anything, you can do that through the Ingest tab- there's feature there that existed pre-Appraisal tab. Click on "Search transfer backlog" and make sure "Show metadata and logs" directory is checked. You should be able to navigate through the transfer to find the BE logs- they should be nested within the logs directory. If you do not see logs there then no BE logs were written. Archivematica ditches logs that have 0 bytes so it can mean that BE ran but found nothing. You can also check for errors in the Examine Contents micro-service by clicking on the gear cog in the job, which might tell you if there was some kind of error running the tool.

For what it's worth in our sample transfers we made a DemoTransfer ( https://github.com/artefactual/archivematica-sampledata/tree/master/SampleTransfers/DemoTransfer) and we put in both credit card numbers and SSNs. The credit card numbers catch but the SSNs don't- we haven't had the opportunity to figure out why that might be.

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory



To unsubscribe from this group and stop receiving emails from it, send an email to archivematica+unsubscribe@googlegroups.com.

Jarad Buckwold

unread,
Jul 26, 2018, 4:10:54 PM7/26/18
to archivematica
Hi Sarah,

I looked through the logs and though it missed most of it (all CC numbers, names, and addresses), it did log domain names, email addresses (well, one), and zips, none of which were displayed in the appraisal tab. 

I then checked the gear icon as you suggested and did encounter what I have to assume is an error on both files:

Attempt to open /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/examineContentsChoice/pii_test-68eb7c29-02e0-494b-8a5c-aaee4625571b/objects/10-MB-Test.docx
Attempt to open /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/examineContentsChoice/pii_test-68eb7c29-02e0-494b-8a5c-aaee4625571b/objects/pii.txt

Not sure if that's an error that would cause my issue and, if so, what to do about it, but there it is.

Thanks,
Jarad

Sarah Romkey

unread,
Jul 26, 2018, 4:44:58 PM7/26/18
to archiv...@googlegroups.com
Hi Jarad,

The appraisal tab only has an interface right now for showing results from credit card numbers and PII scans (meaning SSNs, I don't know why the BE scanner is named that way) so it makes sense that you didn't see your hits from domains, etc. 

I'm not sure those are actually errors- Archivematica's interface can be misleading sometimes about what is an error and what is just tool output (see: https://github.com/artefactual/archivematica/issues/1181). Is there any other output there that you can share?

Cheers,

Sarah
.

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory



To unsubscribe from this group and stop receiving emails from it, send an email to archivematica+unsubscribe@googlegroups.com.

Jarad Buckwold

unread,
Jul 26, 2018, 4:59:45 PM7/26/18
to archivematica
Yeah, then that'd do it. I take it this feature is fairly new.

There isn't much else in terms of output. "STDOUT"  just gives the MD5 of the "Disk Image", which I'm assuming is just the default vocabulary since these certainly aren't disk images, and then there's general log info (file names, UUIDs, start/end times, duration, etc).

Obviously bulk extractor has some significant limitations in terms of what it can pick up. That's good to know! And good to know my instance of Archivematica isn't wonky.

Thanks for your help!

Jarad

Sarah Romkey

unread,
Jul 26, 2018, 6:45:03 PM7/26/18
to archiv...@googlegroups.com
That's right- the feature is a couple of years old, but in terms of which scanners are displayed in the Appraisal tab, I think the original thinking was we'd make it configurable so you could choose which you see (Max, does that sound familiar?) but other project priorities came first and it didn't make it in. That would still be an excellent enhancement, IMO!

I don't think there's anything wonky here but having said that a., it's always a little hard to tell from afar, and b., it's possible that Archivematica is using Bulk Extractor in a sub-optimal manner somehow. Maybe there are some flags or something that we could add to the command to make it work better. What I'd be interested to know is if anyone has ever tried running the same content through Archivematica's Examine Contents micro-service and also used Bulk Extractor locally, and did they get the same results. Anyone have experience with that?

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Archivematica Program Manager
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory



To unsubscribe from this group and stop receiving emails from it, send an email to archivematica+unsubscribe@googlegroups.com.

Max Eckard

unread,
Jul 27, 2018, 9:30:21 AM7/27/18
to archiv...@googlegroups.com
Hi Sarah,

Yes, that sounds familiar! I think we may have also talked about having the ability to customize Archivematica's use of Bulk Extractor in the first place (or maybe that's just wishful retrospective thinking). Outside of Archivematica, for example, we have used more or less of Bulk Extractor's scanners and reports, most notably customized find lists, depending on what we're working with. So I agree that both would be excellent enhancements!

Thanks,
Max

P.S. At the moment (and correct me if I'm wrong), I don't even think that Archivematica's use of Bulk Extractor is configurable in the FPR like other types of tools (e.g., for normalization, extraction, etc.). We have been able to make edits directly to the command in the code, though: https://github.com/artefactual/archivematica/blob/0c748d9f448b8d18961fc8cb764c0149e56fd11b/src/MCPClient/lib/clientScripts/examineContents.py#L8-L11

Timothy Walsh

unread,
Aug 22, 2018, 10:56:53 AM8/22/18
to archivematica
Chiming in here as it relates a bit to some of my work this summer - bulk_extractor is a very powerful tool, but also one that needs to be configured carefully to have it find what you're looking for. For instance, it offers three "ssn_mode" configuration options for matching SSNs, which use different regular expressions to match SSNs based on their label and syntax. Whether SSNs are found in particular files will also depend on whether scanners necessarily for properly reading through a given file format or decompressing compressed binary data are enabled. I'd recommend using bulk_extractor outside of Archivematica, reading the User Manual, and tinkering with configurations until you get the results you are expecting, at which point it might be interesting to look into how to bring some of those configuration options into Archivematica...

Tim
Reply all
Reply to author
Forward
0 new messages