Invalid Tag Names

136 views
Skip to first unread message

Robert Sanders

unread,
Jul 25, 2012, 4:21:41 PM7/25/12
to tagsoup...@googlegroups.com
Hi,

I've been tasked with converting some legacy HTML into an XML format for import into a CMS.  The problem is that there is some metadata that is encoded in non-compliant "tags" left over from whatever was used to generate the HTML in the first place (and whomever was doing that is long gone).  And when I say "non compliant" I don't been the "bogons", I mean stuff like:

<body>
  <!!uid 5usc500>
  <h2>This is the page title</h2>


So, is there any relatively simply way to transform the  <!! type tags into something like an element, processing instruction or comment?  If not then I'll probably end up just scanning for them using a regexp and then parsing the rest of the HTML using TagSoup.

Reply all
Reply to author
Forward
0 new messages