Readability of html

8 views
Skip to first unread message

Joel Nation

unread,
Sep 22, 2008, 5:21:49 AM9/22/08
to php-text-statistics
The problem with these readability scores is that they don't take into
consideration the way html works. For instance you very rarely put a
full stop in a heading tag (eg: <h1>Hello.</h1>) but this will affect
most of the scores as that word will now be added to the next sentence
and make it longer then it actually is. And with lots of headings you
actually making the page more readable. Ditto for lists as (atleast at
my work) we don't generally put a full stop after a list item. Running
the scores on one of our pages (http://www.accc.gov.au/content/
index.phtml/itemId/815360) I initially get a Flesch Kincaid Grade
Level of 24.9 (if I run it over the entire page) and 18.2 if I strip
the html tags out. But I get a much better score of 8.3 if I add full
stops after the correct tags before stripping the tags out. Of course
running it over just the content I start with a reading level of 11,
but I still end up with the 8.3 I add the full stops. I would suggest
that the code should be modified to take this into consideration. It's
only a few lines of extra code and I'm happy to check in my changes to
a branch if possible.

David Child

unread,
Sep 25, 2008, 2:17:32 PM9/25/08
to php-text-...@googlegroups.com
Hi Joel,

Good points all. I've added you as a member to the project at
http://code.google.com/p/php-text-statistics/ - you should be able to
commit code now. Looking forward to seeing your additions!

Dave

--
AddedBytes.com - Web Marketing and Development

Joel Nation

unread,
Oct 7, 2008, 6:00:58 AM10/7/08
to php-text-statistics
Okay I checked in my first changes. This covers all the HTML tags we
use at my work that should have a full stop in front of them. There
may be a couple of others, but this should cover the vast majority of
HTML use. I didn't use a preg_replace, more comfortable out of the
world of regexps! I don't have PHP4, but I'll check in a PHP4 version
tomorrow hopefully. I'll have to use a strtolower and then just use
str_replace. I don't have the PHPUnit framework at home so I haven't
checked in a test, but I have a test that I've been running that I'll
check in once I have the PHPUnit framework up and running.

I noticed you've added the dale_chall list to the wiki. Are you
planning to add the dale_chall function in? I've already written a
quick implementation for work and I can check that in also if you
want.

-Joel

On Sep 26, 5:17 am, "David Child" <d...@addedbytes.com> wrote:
> Hi Joel,
>
> Good points all. I've added you as a member to the project athttp://code.google.com/p/php-text-statistics/- you should be able to

David Child

unread,
Oct 7, 2008, 6:27:20 AM10/7/08
to php-text-...@googlegroups.com
Hi Joel,

Great work. Will run tests against PHPUnit when at home later, but all
looks fine.

I've been working, sporadically, on a few of the other various
readability scores, including Spache and Dale-Chall. Would be great to
see what you've come up with for Dale-Chall so far.

Some of the readability scores are decidedly ropey, I've come to
realise. Certainly none seem to make use of the power of computers in
any meaningful way. Perhaps it's time to come up with a better
readability score?

Dave

Joel Nation

unread,
Oct 9, 2008, 2:53:37 AM10/9/08
to php-text-statistics
I agree, they do look a little dodgey (especially when it's really
random numbers), but they have all been tested and there relative
effectiveness has been ranked (dale-chall being the best one I know
of). I think we can use computers in another way - to suggest how to
improve the text. Since most of them rely on sentence length, the
easiest thing to do is provide a way for the code to highlight your
longest sentences. I'm doing this at work at the moment, but it's
actually not terribly useful as it's hard to determine what effect
shortening the sentence will have. A smarter way would be for the
system to analyse your longest sentences (say top 10) and then
determine how much of an effect on the reading level would have if you
halved the top few (since on average your probably going to split a
sentence in half). As it goes down the list, the effect on the reading
level would reduce and you could stop when it wasn't reducing it by
more than a certain factor (say half a grade point).

Another way we could improve it is to use the Dale Chall common word
list to highlight the 'complex' words (words not in the common list)
and then suggest alternatives that are in the Dale Chall list. The
problem here is that you have to
- some how work out synonyms (maybe a mash-up with an online
thesaurus)
- determine which words it makes sense to make common (as some words
are unavoidable - proper nouns, domain terms etc)

On 7 Oct, 21:27, "David Child" <d...@addedbytes.com> wrote:
> Hi Joel,
>
> Great work. Will run tests against PHPUnit when at home later, but all
> looks fine.
>
> I've been working, sporadically, on a few of the other various
> readability scores, including Spache and Dale-Chall. Would be great to
> see what you've come up with for Dale-Chall so far.
>
> Some of the readability scores are decidedly ropey, I've come to
> realise. Certainly none seem to make use of the power of computers in
> any meaningful way. Perhaps it's time to come up with a better
> readability score?
>
> Dave
>
> On Tue, Oct 7, 2008 at 11:00 AM, Joel Nation <joel...@cyberone.com.au> wrote:
>
> > Okay I checked in my first changes. This covers all the HTML tags we
> > use at my work that should have a full stop in front of them. There
> > may be a couple of others, but this should cover the vast majority of
> > HTML use. I didn't use a preg_replace, more comfortable out of the
> > world of regexps! I don't have PHP4, but I'll check in a PHP4 version
> > tomorrow hopefully. I'll have to use a strtolower and then just use
> > str_replace. I don't have the PHPUnit framework at home so I haven't
> > checked in a test, but I have a test that I've been running that I'll
> > check in once I have the PHPUnit framework up and running.
>
> > I noticed you've added the dale_chall list to the wiki. Are you
> > planning to add the dale_chall function in? I've already written a
> > quick implementation for work and I can check that in also if you
> > want.
>
> > -Joel
>
> > On Sep 26, 5:17 am, "David Child" <d...@addedbytes.com> wrote:
> >> Hi Joel,
>
> >> Good points all. I've added you as a member to the project athttp://code.google.com/p/php-text-statistics/-you should be able to

David Child

unread,
Oct 11, 2008, 9:00:07 AM10/11/08
to php-text-...@googlegroups.com
I've run the unit tests and your changes work fine on the test text -
great stuff, Joel.

It would be useful to have some test HTML to run unit tests against.
I'll start putting some together. I'm also trying to sort out the
Dale-Chall and Spache unit tests. I've added the word lists to the
repository already so others can have a play with them if they want.

Dave

David Child

unread,
Oct 11, 2008, 9:04:38 AM10/11/08
to php-text-...@googlegroups.com
And I completely forgot to actually reply to the bulk of your message,
Joel. Sorry about that - was distracted by bacon :)

I think that's a great idea - identifying places for improvements, and
highlighting difficult words, would make a really useful tool. Making
blanket suggestions is a good start, and a synonym mashup would be
very cool.

Also, is it my imagination or do none of the readability scores take
account of commas and semi-colons in text? Surely their addition can
make text far more easily readable?

Dave

Reply all
Reply to author
Forward
0 new messages