Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Using mathematica to read website

433 views
Skip to first unread message

kevin

unread,
Jun 13, 2010, 4:13:09 AM6/13/10
to
Hi Guys,

Is there any way to use mathematica to read all the words of a
website, say www.bloomberg.com? Thanks in advance.

Best,
Kevin

Christopher Arthur

unread,
Jun 13, 2010, 6:52:46 PM6/13/10
to

Look at Miscellaneous`Dictionary and Web Services Package for easy
ways, but otherwise it depends on how nice the website is.

kevin a =E9crit :

telefunkenvf14

unread,
Jun 13, 2010, 6:53:50 PM6/13/10
to
On Jun 13, 3:13 am, kevin <kevin999ko...@gmail.com> wrote:
> Hi Guys,
>
> Is there any way to use mathematica to read all the words of =
a
> website, saywww.bloomberg.com?Thanks in advance.
>
> Best,
> Kevin

I'm not experienced with this kind of thing, but for the case at hand
I think what you'll have to do is Import[] as "Source" and then parse
the source code for what you want.

For a specific example, I'll start on the Technology summary page,
with the eventual goal of grabbing the title of each article, the url,
and the corresponding short summary:

raw = Import[
"http://www.bloomberg.com/news/industries/technology.html", "Source"
]

After looking at the raw source code in Firefox, I think this will get
us close to grabbing some relevant info:

StringPosition[ToString[raw], "<a class=\"summheadline\"" ~~ __ ~~ " </
p>"]

And use StringTake[] to grab the above positions:

raw2 = StringTake[raw, %]

My inexperience in parsing strings stops me here... Hopefully someone
else can chime in. I'd like to learn.

-RG

Hans Michel

unread,
Jun 15, 2010, 2:30:42 AM6/15/10
to
Try

ReadList[StringToStream[
StringReplace[Import["http://www.bloomberg.com/", "Source"],
RegularExpression["<(.|\\n)*?>"] -> " "]], Word,
WordSeparators -> {" ", "\t", "\n"}]

This particular page does not parse well as Plaintext. Even the XMLObject is
missing the body element. Thus Data, FullData, Hyperlinks, Plaintext are
either blank or empty.

In[1]:= Import["http://www.bloomberg.com/","Elements"]
Out[1]= {Data,FullData,Hyperlinks,Plaintext,Source,Title,XMLObject}

So read the source and try som brute force regex for the tags, and stream
and parse result to a list by word.

Hans
"kevin" <kevin9...@gmail.com> wrote in message
news:hv23ul$5hl$1...@smc.vnet.net...

Leonid Shifrin

unread,
Jun 15, 2010, 2:31:07 AM6/15/10
to
Hi Kevin,

I think that the notion of "words" is not sharply defined in this context,
since html is not a plain text file - you have words in main text, in links,
image descriptions and many other places.

I have written a simplistic HTML parser a couple of years ago for my own
purposes. I intended (and still do) to polish it, convert into one or
several packages and make available to anybody interested - but never had
time to do it. Your post prompted me to speed things up in this regard, and
as a first step, here is the code as it is now - may be you or someone else
will find it of some use, since it will still take me some time to bring it
to the proper package format.

-------------------------------------------------------------------------------------

ClearAll[listSplit, reconstructIntervals, groupElements,
groupPositions, processPosList, groupElementsNested, getAllUsedTags,
refineTags, getTagTitle, getPairedTags, makeTagReplaceRules,
getTagNames, makeTagHashRules, tagSplit, splitText, preparse,
openCloseEnumerate, getOpenCloseForm, makeTagDepthList,
oneStepParse, tagProcess, openCloseProcess, documentParse,
refineParsed, parseText, removeLeaves, attribContainer, postProcess];

listSplit[x_List, lengthlist_List, headlist_List] :=
MapThread[#1 @@ Take[x, #2] &, {headlist,
Transpose[{Most[#] + 1, Rest[#]} &[
FoldList[Plus, 0, lengthlist]]]}];

reconstructIntervals[listlen_Integer, ints_List] :=
Module[{missed, startint, lastint},
startint = If[ints[[1, 1]] == 1, {}, {1, ints[[1, 1]] - 1}];
lastint =
If[ints[[-1, -1]] == listlen, {}, {ints[[-1, -1]] + 1, listlen}];
missed =
Map[If[#[[2, 1]] - #[[1, 2]] >
1, {#[[1, 2]] + 1, #[[2, 1]] - 1}, {}] &,
Partition[ints, 2, 1]];
missed = Join[missed, {lastint}];
Prepend[Flatten[Transpose[{ints, missed}], 1], startint]];

groupElements[lst_List, poslist_List, headlist_List] /;
And[OrderedQ[Flatten[Sort[poslist]]],
Length[headlist] == Length[poslist]] :=
Module[{totalheadlist, allints, llist},
totalheadlist =
Append[
Flatten[
Transpose[
{Array[Sequence &, {Length[headlist]}], headlist}],
1],
Sequence];
allints = reconstructIntervals[Length[lst], poslist];
llist = Map[If[# === {}, 0, 1 - Subtract @@ #] &, allints];
listSplit[lst, llist, totalheadlist]];

(* To work on general heads, we need this *)

groupElements[h_[x__], poslist_List, headlist_List] :=
h[Sequence @@ groupElements[{x}, poslist, headlist]];

(* If we have a single head *)
groupElements[expr_, poslist_List, head_] :=
groupElements[expr, poslist, Table[head, {Length[poslist]}]];


groupPositions[plist_List] :=
Reap[Sow[Last[#], {Most[#]}] & /@ plist, _, List][[2]];


processPosList[{openlist_List, closelist_List}] :=
Module[{opengroup, closegroup, poslist},
{opengroup, closegroup} = groupPositions /@ {openlist, closelist} ;
poslist =
Transpose[Transpose[Sort[#]] & /@ {opengroup, closegroup}];
If[UnsameQ @@ poslist[[1]],
Return[(Print["Unmatched lists!", {openlist, closelist}]; {})],
poslist =
Transpose[{poslist[[1, 1]], Transpose /@ Transpose[poslist[[2]]]}]
]
];

groupElementsNested[nested_, {openposlist_List, closeposlist_List},
head_] /; Head[head] =!= List :=
Fold[Function[{x, y},
MapAt[groupElements[#, y[[2]], head] &, x, {y[[1]]}]], nested,
Sort[processPosList[{openposlist, closeposlist}],
Length[#2[[1]]] < Length[#1[[1]]] &]];


(* Find all the tags in the document*)
getAllUsedTags[text_String] :=
Module[{htmlTagsposlist, result, chars = Characters[text], x},
htmlTagsposlist =
StringPosition[text,
ShortestMatch["<" ~~ x__ ~~ Whitespace | ">"], Overlaps -> True];
result =
Union[Map[ToLowerCase,
StringJoin @@@
Map[Take[chars, {#[[1]], #[[2]] - 1}] &,
htmlTagsposlist]] /. {"<br/" :> "<br"}]];

(* get rid of tags with non-alphabetic characters *)
refineTags[tags_List] :=
Module[{alphabet = Characters["abcdefghijklmnopqrstuvwxyz/"]},
DeleteCases[tags,
x_ /; ! MemberQ[alphabet, StringTake[x, {2, 2}]]]];

getTagTitle[tag_String] :=
If[StringTake[tag, {2, 2}] === "/", StringDrop[tag, {2}], tag];

(* Find all paired tags *)
getPairedTags[tags_List] :=
Reverse /@
Select[Reap[Sow[#, getTagTitle[#]] & /@ tags, _, #2 &][[2]],
Length[#] == 2 &];

(* Prepare string replacement rules to be used when we tokenize the \
string *)
makeTagReplaceRules[pairedtags_List, unpairedtags_List] :=
Sort[Join[
Rule @@@
Map[{#, {StringDrop[#, 1], "Unpaired", "Open"}} &, unpairedtags],
Rule @@@
Flatten[Transpose[{#, {{StringDrop[#[[1]], 1],
"Open"}, {StringDrop[#[[2]], 2], "Close"}}}] & /@
pairedtags, 1], {Rule[">", {">", "UnpairedClose"}]}],
StringLength[#1[[2, 1]]] > StringLength[#2[[2, 1]]] &];

getTagNames[pairedtags_List, unpairedtags_List] :=
StringDrop[#, 1] & /@ Join[Transpose[pairedtags][[1]], unpairedtags];

makeTagHashRules[tagnames_List] :=
Dispatch[MapThread[Rule, {#, Range[Length[#]]} &[tagnames]]];

(* Tokenize *)
tagSplit[text_String, {tagrules__Rule}] :=
DeleteCases[StringSplit[text, {tagrules}],
x_ /; StringMatchQ[x, Whitespace | "" ~~ ">" ~~ Whitespace | ""]];

splitText[text_String, pairedtags_List, unpairedtags_List] :=
tagSplit[text, makeTagReplaceRules[pairedtags, unpairedtags]];

(* Insert containers for attributes as a pre-parsing stage *)
preparse[text_] :=
Module[{step1},
With[{pos = Position[text, {_, "Open"} | {_, _, "Open"}, Infinity]},
step1 =
ReplacePart[text, attribContainer /@ Extract[text, pos + 1],
pos + 1, List /@ Range[Length[pos]]]]];

(* insert tag depth for each open/close tag sublist in the tokenized \
list *)
openCloseEnumerate[splittext_List, pairedtags_List,
unpairedtags_List] :=
Module[{tagnames, taghashrules, tagtitlecounters,
unpairedstack = {}, temptag},
tagnames = getTagNames[pairedtags, unpairedtags];
taghashrules = makeTagHashRules[tagnames];
tagtitlecounters = Table[0, {Length[tagnames]}];
Map[Switch[#, {x_, "Open"}, {#[[1]],
"Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {x_,
"Close"}, {#[[1]], "Close",
tagtitlecounters[[#[[1]] /. taghashrules]]--},
{x_, "Unpaired", "Open"},
AppendTo[unpairedstack, #]; {#[[1]],
"Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {"/>" |
">", "UnpairedClose"},
If[Length[unpairedstack] > 0,
temptag = unpairedstack[[-1]];
unpairedstack = Most[unpairedstack];
{temptag[[1]], "Close",
tagtitlecounters[[temptag[[1]] /. taghashrules]]--},
(* else *)
#
] ,
_, #] &, splittext]];

getOpenCloseForm[text_String, pairedtags_List, unpairedtags_List] :=
openCloseEnumerate[
preparse@splitText[text, pairedtags, unpairedtags], pairedtags,
unpairedtags];

(* Create a list of all used tags and their maximal depth *)
makeTagDepthList[opencloseform_List, pairedtags_List,
unpairedtags_List] :=
DeleteCases[{#,
Max[Cases[
opencloseform, {#, "Open" | "Close", x_Integer} :> x]]} & /@
getTagNames[pairedtags, unpairedtags], {x_, -Infinity}];

(* Parse a single depth level of a given tag *)
oneStepParse[parsed_, depth_Integer, tag_String, head_] :=
Module[{plist =
Position[parsed, {tag, #, depth}, Infinity] & /@ {"Open",
"Close"}}, groupElementsNested[parsed, plist, head]];

(* Parse given tag, all levels *)
tagProcess[parseme_, {tag_String, maxdepth_Integer}] :=
Module[{hd = ToExpression[ tag <> "Container"], result},
With[{hd1 = hd, ourtag = tag},
hd1[{ourtag, "Open", n_}, x__, {ourtag, "Close", n_}] := hd1[x];
result =
Fold[oneStepParse[#1, #2, ourtag, hd1] &, parseme,
Range[maxdepth, 1, -1]]];
Clear[hd];
result];

(* parse all tags *)
openCloseProcess[opencloseform_List, pairedtags_List,
unpairedtags_List] := Fold[tagProcess, opencloseform,
makeTagDepthList[opencloseform, pairedtags, unpairedtags]];

(* A few higher - level functions to combine steps *)

documentParse[text_String, pairedtags_List, unpairedtags_List] :=
openCloseProcess[
getOpenCloseForm[text, pairedtags, unpairedtags],
pairedtags, unpairedtags];

refineParsed[parsed_] :=
If[# === {}, #, First@#] &@Cases[parsed, _htmlContainer];

parseText[text_String] :=
Module[{tags, paired, unpaired, parsed},
tags = refineTags@getAllUsedTags@text;
paired = getPairedTags@tags;
unpaired = Complement[tags, Flatten@paired];
parsed =
refineParsed@
documentParse[text, paired, unpaired] /. {">",
"UnpairedClose"} :> ">";
{parsed, paired, unpaired}];

(* Keep the page skeleton only *)
removeLeaves[parsed_] := DeleteCases[parsed, _, {-1}];

(* Some optional post-processing - but leads to information loss *)
postProcess[parsed_] := DeleteCases[parsed, ">" | "", Infinity];

-------------------------------------------------------------------------------------

Some tidbits on the implementation: I developed a groupElementsNested
function which can be useful in many contexts, and allows to wrap elements
inside possibly nested expressions in additional wrappers, in reasonably
efficient manner, based on lists of their positions. The parser is based on
this functionality, and is breadth-first rather than depth-first. This
allows it to parse individual html tags and depth levels, if needed - for
example, we could instruct it to parse only <div> tags, etc. This also means
that it will not break completely on malformed html - may be it will just
not parse all tags or levels.

The parser computes used html tags dynamically from the document, and is
completely heuristic in that no strict html format rules are built into it.
So, it is not at all industrial strength or production quality. However, I
used the parser for my purposes and it worked reasonably well for me (in
terms of correctness and efficiency).

What it gives you is a symbolic tree representation of the parsed html
document, with different tags transformed into ...Container[content], for
example divContainer[aContainer["Some text"]]. The container names mirror
the html tag names, except for attribContainer which is for tag attributes.
Also, without post-processing, no information is lost - I was able to
reconstruct back the original html files from this symbolic representation,
I think in all cases I looked at.

Given this structure, one can use Mathematica rules and patterns to further
process the document to one's needs. For example, here we will reconstruct
the text in the wikipedia page about parsers, reasonably well IMO:

page = Import["http://en.wikipedia.org/wiki/Parser", "text"];
parsed = postProcess@parseText[page];
StringJoin @@
Cases[Cases[parsed, _pContainer, Infinity] /. _attribContainer :>
Sequence[], _String, Infinity]

The code was (and is) intended to go under GPL, so you can treat it as such.
Once again, it has not been polished and does certainly contain some number
of bugs. If you discover any, please let me know.

Hope this helps.

Regards,
Leonid

0 new messages