Hi Nathan,
>
> I'm in need of being able to render inline italics and bold
> formatting.
>
> The readme at http://www.lexicon.net/sjmachin/xlrd.html mentions it
> not being available for xlrd 0.6.1 in the "Formatting features not
> included in xlrd version 0.6.1" section. Is this the case for 0.7.1
> also? I assume it is because I can't seem to figure out how to do it.
It is not included. You are the first to ask, IIRC.
> If this is indeed not possible right now, how much would it take to
> add this functionality? I would be willing to contribute if it seemed
> like a worthy undertaking.
Let's start by roughing out the API:
Sheet object: would need a dictionary rich_text_sstx_map which would, for
each cell that had rich text, map (rowx, colx) to SST (shared string
table) index.
Book object: would need a dictionary rich_text_runlist_map which would,
for each string in the SST that had rich text, map SST index to the
corresponding list of formatting runs.
What is a "list of formatting runs"? Here's an example: Suppose you had a
string "plainbolditalic" formatted as the contents suggest.
plainbolditalic
012345678901234
The list of formatting runs would be:
[
(5, # offset where bold formatting starts
a_non_negative_integer), # an index into Book.font_list
(9, # offset where italic formatting starts
a_non_negative_integer), # an index into Book.font_list
]
Note that a Font object has many attributes besides bold and italic ...
what you do with these is up to you.
So for a given cell, you'd need something like this:
sstx = sheet.rich_text_sstx_map.get((rowx, colx))
if sstx is not None:
run_list = book.rich_text_runlist_map[sstx]
for offset, fontx in runlist:
font = book.font_list[fontx]
render()
Note that if (as in the example) the first run doesn't start with offset
0, you would need to fill in the gap using the "normal" font for the cell:
xf_index = sheet.xf_index(rowx, colx)
xf = book.xf_list[xf_index]
run_list[0:0] = (0, xf.font_index)
You might find it easier to work with the run_list if you appended
(len(the_text), None) to it ...
Comments, please.
Cheers,
John
Great! Thanks for the head start.
Is the xlrd source anywhere accessible so I could maybe do a branch for this?
I think what you're saying makes sense. That was kind of how I assumed
it would end up working since that is congruent with the way the rest
of the package works with fonts, formats, etc.
What was odd to me was when I was stepping through the code, I
couldn't seem to find the place where those styles were getting
stripped out. But I didn't dig too deep into it all. Would this just
be a different record code that is being used for inline fonts?
Should we continue discussing this on the list from here on out or
should I email you personally if I have further questions?
Thanks,
Nathan
> --
> You received this message because you are subscribed to the Google Groups "python-excel" group.
> To post to this group, send an email to python...@googlegroups.com.
> To unsubscribe from this group, send email to python-excel...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/python-excel?hl=en-GB.
>
>
https://secure.simplistix.co.uk/svn/xlrd/trunk/
...but no, I'm afraid you're not getting commit rights at this stage.
Please work up a patch against the trunk.
> What was odd to me was when I was stepping through the code, I
> couldn't seem to find the place where those styles were getting
> stripped out.
It's not a case of stripping anything out, it's a case of writing new
code to parse record types that aren't currently parsed.
> Should we continue discussing this on the list from here on out or
> should I email you personally if I have further questions?
On list please!
Chris
--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
It's not a head start. Discussion of the API and agreement of what the
product is to be are essential prerequisites.
Other requirements:
1. The additional data structures are to be built only if formatting_info
is true.
2. The code must run on Python 2.1 to 2.7 inclusive.
3. No tabs, no camelCase, PEP8 rules about spaces next to punctuation, ...
Hi. Sorry about the previous message -- I hit send by accident. Ignore it,
its content is repeated here.
>
> Great! Thanks for the head start.
>
It's not a head start. Discussion of the API and agreement of what the
product is to be are essential prerequisites.
Other requirements:
1. The additional data structures are to be built only if formatting_info
is true.
2. The code must run on Python 2.1 to 2.7 inclusive.
3. No tabs, no camelCase, PEP8 rules about spaces next to punctuation, ...
> Is the xlrd source anywhere accessible so I could maybe do a branch for
> this?
See Chris's answer :-)
> What was odd to me was when I was stepping through the code, I
> couldn't seem to find the place where those styles were getting
> stripped out. But I didn't dig too deep into it all. Would this just
> be a different record code that is being used for inline fonts?
No. Where it's done is in __init__.py in the function unpack_SST_table,
which is called by the Book method handle_sst ...
Please note that this is critical code, and is complicated by the need to
handle the SST being split over an SST record and 0 or more CONTINUE
records (a legacy of the days when 640Kb of memory was all anyone would
ever need) and the arcane rules about what data can be split where. Care
in coding and lots of testing on both real-world files and try-to-break-it
files are indicated.
You will note that there is a mild kludge with the variable rtsz in that
it holds the size of the rich text runs being skipped AND the size of the
phonetic (Furigana etc) stuff being skipped. You will need to unwind the
kludge by introducing a separate phosz variable.
> Should we continue discussing this on the list from here on out or
> should I email you personally if I have further questions?
On-list, please.
Cheers,
John
Cheers,
John
Yes. I've been thinking about it.
Cheers,
John
> Hello,
>
> Alright, I've got some time to put into this. The hardest part is just
> wrapping your head about the weird way the data is stored.
>
> I have it working now basically how you described earlier. I figured we'd
> get something to work with before we spent too much time finalizing the api.
The contents and semantics of the sheet-level data structure ARE the
API, hence the request for comments rather than code.
Cheers,
John
However this means that it needs to be renamed from rich_text_sstx_map
to rich_text_runlist_map.
I'm strongly suggesting that both the sheet-level mapping and the book-
level mapping are sparse
A. Needs in-code documentation of Sheet.rich_text_runlist_map. Should
point out that it's a sparse mapping. Explain runlist.
B. Book.rich_text_runlist_map (1) needs a _ at the start of the name
(2) could usefully be deleted along with Book._sharedstrings when that
is deleted in some circumstances ... look for "del
self._sharedstrings" in __init__.py
C. Patch has extraneous empty lines and trailing spaces is a few cases
D. Please don't invent new jargon like "inlineStyles" when it's of
dubious interpretation (they're inline font objects, not styles) and
there is existing jargon ("rich text") anyway.
E. Unit tests are fine as far as they go. We also need some code to do
a thrash test that has a high chance of detecting problems with
continue records. This will be aided by the newly-in-svn xlwt code for
writing rich text. I would also like to see a round-trip check of rich
text driven by a text file containing test specs in xlwt.easyxf
format, to be used to create an xls file with xlwt, which can then be
read back using xlrd ... this would need a converter from xlrd.Font
object to xlwt.easyxf text. Any volunteers to help with this?
So I see. I'm sorry that I was not more explictit. In response to your
mentioning an HTML renderer, I said "Let's get this show on the road
first :-)", "this show" meaning the basic rich text functionality.
That was not intended to be an invitation to include code for
including your renderer into xlrd in the patch, and I find it
difficult to imagine how you could construe it otherwise.
At this stage, runlist should be either None or a list (or maybe a
tuple); if a list etc, it should not be empty (sparse, remember) but
in any case "not []" is treated as true; I can't imagine a case where
"or len(runlist) == 0" would be required. Secondly, the "value"
should be formatted according to the "cellfont"; there's no guarantee
that "cellfont" doesn't require any markup.
Cheers,
John
Hi Nathan, I do hope that pointing out a bug or two wasn't a waste of my
time :-)
>
> At this stage, runlist should be either None or a list (or maybe a
>
> tuple); if a list etc, it should not be empty (sparse, remember) but
>
> in any case "not []" is treated as true; I can't imagine a case where
>
> "or len(runlist) == 0" would be required. Secondly, the "value"
>
> should be formatted according to the "cellfont"; there's no guarantee
>
> that "cellfont" doesn't require any markup.
>
> Just an artifact of when there were empty lists in...
Sorry, I don't understand; an empty list doesn't make a difference; "or
len(runlist) == 0" is redundant in any case:
| >>> same = lambda rl: (not rl or len(rl) == 0) == (not rl)
| >>> map(same, [None, [], [(9, 99)]])
| [True, True, True]
I was however wrong about this: """Secondly, the "value"
should be formatted according to the "cellfont"; there's no guarantee
that "cellfont" doesn't require any markup.""" --- I ignored this
shapeshifter:
value = to_html(value, cellfont)
However there's still a bug lurking there; if the "cellfont" includes an
attribute requiring markup, value will become e.g. u"<i>foo</>". Then, if
there is rich text and the runlist doesn't start at offset 0, it starts
operating on the marked-up value when it should work on the raw value:
val = value[:runlist[0][0]]
> Again, thanks for the help in putting this feature into action and I
> appreciate all the work you do for the library--it's a great help to many
> people.
And many thanks to you too.
Progress report:
Changes to xlrd put into working copy. Passes elementary tests.
Efficiency improvement in LABELSST code (just about every text cell in an
Excel 97-2003 file goes through there).
Rounded out the functionality offering by adding support for rich text in
Excel 5.0/95 files (no SST; need to slurp rt info out of the end of
RSTRING records); Excel 2003 files saved as 5.0/95 deliver the same rt
results (limits: max offset 255, max font index 255).
By the way, your test file richtext.xls has a problem: cell A15 contains a
41590-character string. This is OK for OOo Calc and gnumeric but the max
for Excel 2003 and 2007 is 32767. When file is opened, Excel goes into
"recovery" mode which blows away ALL the strings from the SST, leaving
only the number cells. Worse: it doesn't even say what the problem was.
xlwt already (since a month ago) restricts strings to 32767 characters
(was 65535 limit because the length is an unsigned int). I'll add a
warning to xlrd.
I'm making good progress with a combination thrash test and round trip
test. My idea of a nasty file: 384 different fonts {Excel 2003 maxes out
at "about 400" fonts, 100 rows each with a 32767-character string each
with a 32767-element runlist. This is round-tripping OK with xlrd. Excel
is happy enough with this. If you select all cells and change the font
height, it blinks and flashes for say 30 seconds. That's relatively OK.
Trying to open a file with only 8191-byte strings, OOo Calc goes into
lala-land; after about 20 MINUTES of watching it grab extra memory very
slowly in 4Kb chunks, I killed it. Haven't tried Gnumeric yet. I'll do
some more mucking about (this is fun!) and publish the script. Note that a
32K-char unicode string takes 64KB and then you need to add 4 x 32Kb for
the runlist; each CONTINUE record is only about 8200 bytes. So it looks
like we don't have any problems with SST data being split over CONTINUE
records.
Cheers,
John