Script available to scrub HTML

208 views
Skip to first unread message

Sarah Romkey

unread,
Jul 16, 2015, 6:53:12 PM7/16/15
to ica-ato...@googlegroups.com
Dear colleagues,

In our release announcement for AtoM 2.2  we announced that AtoM 2.2 escapes html for security purposes. What this means in layperson terms is that if you have html tags in your description fields, as of AtoM 2.2 they will display as plain text. If you have html tags in your descriptions, you may want to run this script to remove them.

Note: The script is written to work with 2.2. The assumption therefore is that you have completed your upgrade to 2.2 when you run this script. There will also be a script available for the 2.3 release when that time comes.

The first and very important step:

1) Back up your AtoM database before running the script. Information on doing this can be found here:


2) In command-line, change directory to AtoM root directory:

  $ cd <AtoM root directory>

3) Download the HTML translation script:

  $  curl -L http://tinyurl.com/pkf3p27 > remove-html.php

4) Run the script:

  $ php symfony tools:run remove-html.php

Questions? Concerns? Let us know!

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Systems Archivist
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory


L Snider

unread,
Jul 16, 2015, 7:23:12 PM7/16/15
to ica-ato...@googlegroups.com
Hi Sarah,

Is the online demo this version?

Cheers

Lisa

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/CAAr2QtsEhyTEYH-k49WAZ-ys2x6x1yQTEhDsxkNwc2PA9feHpA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sarah Romkey

unread,
Jul 16, 2015, 7:51:55 PM7/16/15
to ica-ato...@googlegroups.com
We haven't yet had a chance to update the online demo- it's still on 2.1.2. We'll let you know via the user forum when it's been updated.

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Systems Archivist
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory



L Snider

unread,
Aug 6, 2015, 12:30:49 PM8/6/15
to ica-ato...@googlegroups.com
Hi Sarah,

Any update on this one being available in the sandbox yet?

Cheers

Lisa

Sarah Romkey

unread,
Aug 6, 2015, 12:56:57 PM8/6/15
to ica-ato...@googlegroups.com
Hello Lisa and all,

At this juncture, since we plan on doing a 2.2.1 release (more details here: https://groups.google.com/d/msg/ica-atom-users/oQGsOO4ZJLo/iFSANCseBQAJ) updating the demo site will come sometime after that release. I would suggest keeping your eye on the user forum for the release announcement and you can expect to see the demo updated some time after that (we will remember to update the forum when the demo site is upgraded).

Cheers,

Sarah

Sarah Romkey, MAS,MLIS
Systems Archivist
Artefactual Systems
604-527-2056
@archivematica / @accesstomemory



Myron Groover

unread,
Dec 8, 2015, 11:08:11 AM12/8/15
to ICA-AtoM Users
We've been working for the better part of the year on transitioning to AtoM, working in 2.1 but planning to upgrade to the latest version shortly before launch.

Our descriptive practise makes very heavy use of HTML tags and of HTML linking in general within the fields of our archival descriptions, so when we tried a 2.2 testbed with our database we immediately noticed that our descriptions were badly broken.

Is there any way to replicate the previous functionality of HTML tags within descriptive elements, in particular as pertains to:

1) link display (as in <a href="http://example.htm">example link</a>
1) text formatting (bold, italic, etc.)

Would really appreciate any help with this; this is a reasonably major launch hiccup for us. Thanks a lot.

--
Myron C. Groover, MA (Hons), MAS, MLIS
Archives and Rare Books Librarian

McMaster University Library
1280 Main Street W.

Hamilton, ON L8S 4L6

905-525-9140 (ext. 22790)

Dan Gillean

unread,
Dec 8, 2015, 12:47:54 PM12/8/15
to ICA-AtoM Users
Hi Myron,

There is good news on the way, though it may mean being patient and waiting for the 2.2.1 release. Please see the following post, made yesterday, where I outline the options for a custom linking markup that is coming to AtoM:

The even better news is that it does sound likely, given the feedback we've received about this change, that we will be backporting the custom linking markup syntax to stable/2.2.x and including it in the 2.2.1 release. The HTML scrub script is also being backported (so you don't have to download it as a patch to use it in 2.2.1), and it has already been updated so that, when run, it will replace HTML links with the new syntax, instead of simply stripping out the HTML and leaving only the raw link.

We still don't have a firm release date for 2.2.1, but given the rapidly-approaching holiday season, I expect that it will be available in early 2016, with AtoM 2.3 following a couple months later. Because I know you have a developer working with you, however, I can let you know when we've backported the linking markup syntax to stable/2.2.x, and you can then do a pull against our GitHub repository if you don't want to wait for us to package a tarball release.

Unfortunately, this solution does not yet include an alternative syntax for styling elements, such as bolding and italics. We would definitely like to see something like this incorporated, but rather than adding further custom markup solutions, our developers would prefer to implement a full markup or markdown library - something like Parsedown, for example. Doing so would require sponsored development for us to be able to undertake... which is where I insert my usual aviso:

If your institution is interested in sponsoring such a development and would like an estimate from Artefactual, please feel free to contact me off-list. Any development sponsored by our community is then incorporated into the next public release, so the entire community benefits form your contributions, and you benefit from the contributions of others. Alternatively, if your developers might be interested in taking on this work, please consider having them post in the user forum, and contribute the work back to the public project as a pull request. Our developers can offer implementation suggestions, and by sharing your work back, Artefactual takes on the maintenance of the feature through successive public releases, so you don't have to maintain it locally. We also have a few developer resources on our wiki; our developers can supplement these initial guidelines via conversations on the user forum.

Let me know if you have any questions.

Cheers,


Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.

Dan Gillean

unread,
Dec 11, 2015, 6:28:09 PM12/11/15
to ICA-AtoM Users
Hi again Myron,

Just wanted to let you know that the backporting has been done - I've tested it in my local VM, and everything worked as expected! Feel free to check out the most recent code from stable/2.2.x in our GitHub repository, or you can wait until we package up the 2.2.1 release - likely in early 2016 at this point.

Regards,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

Clara Rosales

unread,
May 22, 2017, 5:47:52 AM5/22/17
to ica-ato...@googlegroups.com
Recently we update an Archive from ICA-AtoM 1.3.2 to AtoM 2.3.1
We tested the script made by Sarah and we see that it works perfectly in AtoM 2.3.1. The only thing that is only for archival descriptions. For if it is of interest, I leave in the attached file the script for actor, repositories, notes and rights.
Regards!

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
clean html.rar

Dan Gillean

unread,
May 22, 2017, 12:37:12 PM5/22/17
to ICA-AtoM Users
Hello Clara,

Thank you so much for sharing this with the AtoM community!

I have added a section to our community resources, here: https://wiki.accesstomemory.org/Community/Community_resources/Development#HTML_scrub_scripts_for_other_entities

However, I have realized I don't know what institution you work for - who should I credit for this work?

Regards,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To post to this group, send email to ica-ato...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

Darryl Friesen

unread,
May 24, 2017, 12:58:46 PM5/24/17
to AtoM Users
The timing of Clara's post couldn't have been better!  We recently upgraded our AtoM instances from 1.3.x to 2.4 and noticed some "peculiarities" in the display.  Dan's link to the i18n:remove-html-tags command and Clara's additional scripts proved to be super helpful!  Thanks to both of you!

Our cleanup needs went beyond HTML tag removal however, as quite a few of our records had HTML entity issues -- characters like &, ç and î existed in the database as &amp; &ccedil; &icirc; etc.  Unlike AtomM 1.x, these entities get double-encoded by AtoM 2.x so rather than the displaying the desired character in the browser, the entity name itself displays, like this:

    Beardy ** [k&acirc;-m&icirc;yastow&ccedil;sit

instead of:

    Beardy ** [kâ-mîyastowçsit

Using Clara's submitted scripts, and a few additional tweaks for the entity encoding, I was able to modify AtoM's  i18n:remove-html-tags command to remove HTML tabs from additional tables, and to also translate the HTML entities back into their non-entity equivalents.  I've committed my changes to a forked version of AtoM qa/2.4 if anyone want to have a look:


Before I submit this as a pull request back to Artefactual I thought I'd seek our Dan's opinion on whether I should (or anyone else at Artefactual) and to check with Clara if she feels this is alright for me to do (as a good chunk of the code was submitted by her).


- Darryl

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.

To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.

--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.

To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at http://groups.google.com/group/ica-atom-users.

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.

To post to this group, send email to ica-ato...@googlegroups.com.

Dan Gillean

unread,
May 24, 2017, 5:53:39 PM5/24/17
to ICA-AtoM Users
Hi Darryl,

Great work! I personally think that this sounds like a great enhancement to the existing command-line task. If Clara agrees, we would be happy to review that as a pull request for inclusion in an upcoming release!

Regards,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

Clara Rosales

unread,
May 29, 2017, 7:48:20 AM5/29/17
to ica-ato...@googlegroups.com
Ohhh, darryl!
It's perfect!
Cheers

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

Darryl Friesen

unread,
Jun 5, 2017, 11:35:27 AM6/5/17
to AtoM Users
Just to finish off this thread:

I took Clara's changes to the original Artefactual scripts, added a bit of additional checking for HTML entities, and updated the AtoM code that runs as part of the "php symfony i18n:remove-html-tags" command line task.  I submitted these changes back to Artefactual and they have been accepted into the 2.4 code branch.

- Darryl

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To post to this group, send email to ica-ato...@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

Dan Gillean

unread,
Jun 5, 2017, 11:44:54 AM6/5/17
to ICA-AtoM Users
Thank you to Clara and to Darryl for modeling such excellent community collaboration!

Here is the related issue ticket: https://projects.artefactual.com/issues/11207
...and the related pull request: https://github.com/artefactual/atom/pull/568

I also added it to the Roadmap page for the 2.4 release!




https://wiki.accesstomemory.org/Releases/Roadmap#Command-line_tools

Regards,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.

Clara Rosales

unread,
Jun 6, 2017, 4:17:36 AM6/6/17
to ica-ato...@googlegroups.com
Thank you for your perfect work, Darryl and Dan!!!!!

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
Reply all
Reply to author
Forward
0 new messages