Excel quote marks in TXT files

956 views
Skip to first unread message

Ross Higashi

unread,
Apr 1, 2011, 3:43:56 PM4/1/11
to ctat-...@googlegroups.com
Not a strictly CTAT-related question, but we're encountering a very
annoying issue in our mass production process where Excel will
mysteriously enclose SOME text fields in "quotation marks" when saving
as tab-delimited text, while other seemingly identical text fields are
left (correctly) as-is with no surrounding quotes. They're all
General-formatted cells in Excel 2010, being converted to tab-delimited
text using Save As.

I'm currently doing manual cleanup on the tab-delimited .TXT files, but
this is obviously not a good long-term solution in a mass production
process. Has anyone else encountered issues like this, or might happen
to know a good workaround out of general Excel knowledge?

Thanks,
Ross Higashi
Robotics Academy

Ross Higashi

unread,
Apr 1, 2011, 4:32:37 PM4/1/11
to ctat-...@googlegroups.com
Following up on this issue, it's at least not arbitrary which fields get the surrounding quotes... according to Microsoft, it's anything containing a "comma, quotation mark, or other character that could be misinterpreted when you reopen the file." Unfortunately, the fields with this issue are question prompts, and so will sometimes contain punctuation. A related behavior of Excel's is to convert all " that ACTUALLY exist in the text string into "" (two quote marks).

I believe that due to the structure of the tab-delimited files expected by CTAT, the only string modifications that would be needed to neutralize the issue would be:
  • Convert all strings consisting of <tab>" to just <tab> (i.e. strip any quotation mark that follows a tab)
  • Convert all strings consisting of "<tab> to just <tab> (i.e. strip any quotation mark preceding a tab)
  • Convert all strings consisting of "<crlf> to just <crlf> (i.e. strip any quotation mark preceding a line break)
  • Convert all strings consisting of "" to " (i.e. return any double-double-quotes to a regular double quote mark)
This should still permit (and in fact, fix) user strings that deliberately start with a ".

Given Excel's popularity, it might even be nice for end users if these conversions could be built into CTAT's merge function some day (since Excel has had this weird behavior since 2000 and seems unlikely to change). In the mean time, I guess I'll have to fall back on small-scale solutions for the time being.

Thanks,
Ross

Jonathan Sewall

unread,
Apr 5, 2011, 10:29:31 AM4/5/11
to ctat-...@googlegroups.com, Ross Higashi
Thanks very much, Ross.  To be sure I get it, you mean that CTAT should apply the changes you list below as it reads a .txt file, on the assumption that it's been massaged in Excel.  I suppose we could ask about Excel use when we prompt the author for the merge file names, to handle the cases where that might not be true.  One other tweak, my recollection is that Excel will insert a bare line feed (lf, not crlf) in a quoted cell value to indicate an embedded newline.  Here's an example of a file with one row and 4 columns, exposed with the Unix "cat -tev" tool, which shows tabs as ^I and line feeds as $ (followed by a new line) and CRs as ^M:
# cat -tev c:/Temp/Book1.txt
"line$
feed"^I"embed""quote"^I"""leadqu"^I"trailqu"""^M$
--
You received this message because you are subscribed to the Google Groups "CTAT Users" group.
To post to this group, send email to ctat-...@googlegroups.com.
To unsubscribe from this group, send email to ctat-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/ctat-users?hl=en.

Ross Higashi

unread,
Apr 5, 2011, 11:04:36 AM4/5/11
to sew...@cs.cmu.edu, ctat-...@googlegroups.com
Assuming there's not a more clever way of detecting Excel's fingerprints, then a checkbox somewhere would be nice. I would guess that a large portion of potential CTAT content creators would use Excel, due to its popularity at academic institutions.

Line break characters are a debate all on their own, but I guess in the end what matters is that CTAT's CommTextArea and CommHintWindow, Flash's dynamic text object, and Java's various widgets all render them the same way on all operating systems. Ideally, that common ground would preserve manual line breaks in text.

Thanks,
Ross
Reply all
Reply to author
Forward
0 new messages