website scraping in Shen/tk

41 views
Skip to first unread message

dr.mt...@gmail.com

unread,
May 16, 2024, 7:24:49 AMMay 16
to Shen
This will eventually evolve into a small set of tools for scraping
text off the web.  Quite an interesting problem actually.

1. Raw ASCII

(time (url "https://shenlanguage.org/"))

run time: 0.09300000220537186 secs
[60 104 116 109 108 62 13 10 32 32 32 32 32 32 32 32 32 32 32 32 ... etc] : (list number)


2. Text as Strings filtering out HTML

(time (let ASCII (url "https://shenlanguage.org/")
                  Text   (url->text ASCII)
                   Text))

run time: 0.13999998569488525 secs
["The" "Shen" "Group" "var" "sc" "_" "project" "=" "12669375" ";" "var" "sc" "_" "invisible" "=" "1" ";" "var" "sc" "_" "security" "=" """ "a31d15ef" """ ";" "&" "nbsp" ";" "The" "Shen" "Group" "&" "nbsp" ";" "&" "nbsp" ";" "Home" "Learn" "Download" "Community" "Open" "Science" "Donate" "Contact" "Our" "mission" "is" "to" "bring" "the" "power" "of" "Shen" "technology" "to" "every" "major" "programming" "platform" "used" "by" "industry" "and" "deliver" "to" "programmers" "the" "great" "power" "of" "Shen" "." "The" "word" "'" "Shen" "'" "means" "'" "highest" "spirit" "'" "in" "Chinese" "and" "indicates" "our" "goal" "is" "to" "transcend" "the" "divisions" "between" "computer" "languages" "." "Since" "2021" "Shen" "has" "been" "based" "on" "the" "S" "series" "kernels" "." "Features" "pattern" "matching" "," "lambda" "calculus" "consistency" "," "macros" "for" "defining" "domain" "specific" "languages" "," "optional" "lazy" "evaluation" "," "static" "type" "checking" "based" "on" "sequent" "calculus" "," "one" "of" "the" "most" "powerful" "systems" "for" "typing" "in" "functional" "programming" "," "an" "integrated" "fully" "functional" "Prolog" "," "an" "inbuilt" "compiler" "-" "compiler" "," "a" "BSD" "kernel" "under" "15" "languages" "(" "Lisp" "," "Python" "," "Javascript" "," "C" "." "." "." ")" "and" "operating" "systems" "(" "Windows" "," "Linux" "," "OS" "/" "X" ")" "," "is" "extensively" "documented" "in" "a" "book" "has" ... etc] : (list string)


Not 100% because some Javascript gets through.   Trying it on the Wikipedia article
on my home town Leeds, a lot more junk gets through because they use more technology outside HTML.  Using the sentence filter improves this.  50 indicates the maximum length of any acceptable sentence.

3. Text as Sentences

(time (let ASCII          (url "https://en.wikipedia.org/wiki/Leeds")
                 Text            (url->text ASCII)
                 Sentences (text->sentences Text 50)
                 Sentences))

run time: 0.20300006866455078 secs

[["." "startUp" """ "," """ "ext" "." "gadget"] ["cx" "." "eventlogging" "." "campaigns" """ "," """ "ext" "." "cx" "." "uls" "." "quick" "." "actions" """ "," """ "wikibase" "." "client" "." "vector" "-" "2022" """ "," """ "ext" "." "checkUser" "." "clientHints" """ "," """ "ext" "." "growthExperiments"] ["SuggestedEditSession" """ "]" ";" "(" "RLQ" "=" "window"] ["mw" "-" "parser" "-" "output" "." "hatnote" "i" "{" "font" "-" "style" ":" "normal" "}" "." "mw" "-" "parser" "-" "output" "." "hatnote" "+" "link" "+" "." "hatnote" "{" "margin" "-" "top" ":" "-" "0" "." "5em" "}" "This" "article" "is" "about" "the" "city"] ["For" "the" "district" "," "see" "City" "of" "Leeds"] ["For" "other" "uses" "," "see" "Leeds" "(" "disambiguation" ")"] ["." "79750" ";" "-" "1" "." "54361" "Leeds" "is" "a" "city" "&" "#" "91" ";" "a" "&" "#" "93" ";" "in" "West" "Yorkshire" "," "England"] ["It" "is" "the" "largest" "settlement" "in" "Yorkshire" "and" "the" "administrative" "centre" "of" "the" "City" "of" "Leeds" "Metropolitan" "Borough" "," "which" "is" "the" "second" "most" "populous" "district" "in" "the" "United" "Kingdom"] ["It" "is" "built" "around" "the" "River" "Aire" "and" "is" "in" "the" "eastern" "foothills" "of" "the" "Pennines"] ["The" "city" "was" "a" "small" "manorial" "borough" "in" "the" "13th" "century" "and" "a" "market" "town" "in" "the" "16th" "century"] ["town" "during" "the" "Industrial" "Revolution" "alongside" "other" "surrounding" "villages" "and" "towns" "in" "the" "West" "Riding" "of" "Yorkshire"] ["(" "M" ")"] [""" "the" "region" "which" "is" "called" "Loidis" """ ")"] ["An" "inhabitant" "of" "Leeds" "is" "locally" "known" "as" "a" "Loiner" "," "a" "word" "of" "uncertain" "origin" "." "&" "#" "91" ";" "23" "&" "#" "93" ";" "The" "term" "Leodensian" "is" "also" "used" "," "from" "the" "city" "'" "s" "Latin" "name"] ["Economic" "development" "[" "edit" "]" "The" "Leeds" "and" "Liverpool" "Canal" "at" "Granary" "Wharf" "The" "Leeds" "Corn" "Exchange" "opened" "in" "1864"] ... etc] : (list (list string))

4. Text as Parsable Sentences 

(time (let ASCII          (url "https://en.wikipedia.org/wiki/Leeds")
                 Text            (url->text ASCII)
                 Sentences (text->sentences Text 50)
                 Parsable    (filter (fn parsable?) Sentences)
                 Parsable))
???

This would filter out the remainder of the junk if the parser was effective.

Mark

dr.mt...@gmail.com

unread,
May 16, 2024, 7:36:48 AMMay 16
to Shen
With an improvement of the sentence filter this is what we get
from scraping the Wikipedia article; abridged because of length.

run time: 0.07799994945526123 secs
[["SuggestedEditSession" """ "]" ";" "(" "RLQ" "=" "window"] ["For" "the" "district" "," "see" "City" "of" "Leeds"] ["For" "other" "uses" "," "see" "Leeds" "(" "disambiguation" ")"] ["It" "is" "the" "largest" "settlement" "in" "Yorkshire" "and" "the" "administrative" "centre" "of" "the" "City" "of" "Leeds" "Metropolitan" "Borough" "," "which" "is" "the" "second" "most" "populous" "district" "in" "the" "United" "Kingdom"] ["It" "is" "built" "around" "the" "River" "Aire" "and" "is" "in" "the" "eastern" "foothills" "of" "the" "Pennines"] ["The" "city" "was" "a" "small" "manorial" "borough" "in" "the" "13th" "century" "and" "a" "market" "town" "in" "the" "16th" "century"] ["An" "inhabitant" "of" "Leeds" "is" "locally" "known" "as" "a" "Loiner" "," "a" "word" "of" "uncertain" "origin" "." "&" "#" "91" ";" "23" "&" "#" "93" ";" "The" "term" "Leodensian" "is" "also" "used" "," "from" "the" "city" "'" "s" "Latin" "name"] ["Economic" "development" "[" "edit" "]" "The" "Leeds" "and" "Liverpool" "Canal" "at" "Granary" "Wharf" "The" "Leeds" "Corn" "Exchange" "opened" "in" "1864"] ["Leeds" "developed" "as" "a" "market" "town" "in" "the" "Middle" "Ages" "as" "part" "of" "the" "local" "agricultural" "economy"] ["The" "new" "charter" "incorporated" "the" "entire" "parish" "," "including" "all" "eleven" "townships" "," "as" "the" "Borough" "of" "Leeds" "and" "withdrew" "the" "earlier" "charter"] ["Leeds" "Borough" "Police" "force" "was" "formed" "in" "1836" "," "and" "Leeds" "Town" "Hall" "was" "completed" "by" "the" "corporation" "in" "1858"] ["In" "1866" "," "Leeds" "and" "each" "of" "the" "other" "townships" "in" "the" "borough" "became" "civil" "parishes"] ["It" "gained" "both" "borough" "and" "city" "status" "and" "is" "known" "as" "the" "City" "of" "Leeds"] ["Initially" "," "local" "government" "services" "were" "provided" "by" "Leeds" "City" "Council" "and" "West" "Yorkshire" "County" "Council"] ["When" "the" "county" "council" "was" "abolished" "in" "1986" "," "the" "city" "council" "absorbed" "its" "functions" "," "and" "some" "powers" "passed" "to" "organisations" "such" "as" "the" "West" "Yorkshire" "Passenger" "Transport" "Authority"] ["Suburban" "growth" "[" "edit" "]" "1866" "map" "of" "Leeds" "19th" "-" "century" "Briggate" "," "Leeds" "In" "1801" "," "42" "%" "of" "the" "population" "of" "Leeds" "lived" "outside" "the" "township" "," "in" "the" "wider" "borough"] ["Cholera" "outbreaks" "in" "1832" "and" "1849" "caused" "the" "authorities" "to" "address" "the" "problems" "of" "drainage" "," "sanitation" "," "and" "water" "supply"] ["Water" "was" "pumped" "from" "the" "River" "Wharfe" "," "but" "by" "1860" "it" "was" "too" "heavily" "polluted" "to" "be" "usable"] ["When" "pollution" "became" "a" "problem" "," "the" "wealthier" "residents" "left" "the" "industrial" "conurbation" "to" "live" "in" "Headingley" "," "Potternewton" "and" "Chapel" "Allerton" "which" "led" "to" "a" "50" "%" "increase" "in" "the" "population" "of" "Headingley" "and" "Burley" "from" "1851" "to" "1861"] ["The" "slums" "of" "Quarry" "Hill" "were" "replaced" "by" "the" "innovative" "Quarry" "Hill" "flats" "," "which" "were" "demolished" "in" "1975"] ["Another" "36" "," "000" "houses" "were" "built" "by" "private" "sector" "builders" "," "creating" "suburbs" "in" "Gledhow" "," "Moortown" "," "Alwoodley" "," "Roundhay" "," "Colton" "," "Whitkirk" "," "Oakwood" "," "Weetwood" "," "and" "Adel"] ["Many" "developments" "boasting" "luxurious" "penthouse" "apartments" "have" "been" "built" "close" "to" "the" "city" "centre"] ["The" "northern" "boundary" "follows" "the" "River" "Wharfe" "for" "several" "miles" "but" "crosses" "the" "river" "to" "include" "the" "part" "of" "Otley" "which" "lies" "north" "of" "the" "river"] ["Briggate" "," "the" "principal" "north" "–" "south" "shopping" "street" "," "is" "pedestrianised" "and" "Queen" "Victoria" "Street" "," "a" "part" "of" "the" "Victoria" "Quarter" "," "is" "enclosed" "under" "a" "glass" "roof"] ["Millennium" "Square" "is" "a" "significant" "urban" "focal" "point"] ["River" "Aire" "Inner" "and" "southern" "areas" "of" "Leeds" "lie" "on" "a" "layer" "of" "coal" "measure" "sandstones" "forming" "the" "Yorkshire" "Coalfield"] ["Leeds" "centre" "," "there" "are" "a" "number" "of" "suburbs" "and" "exurbs" "within" "the" "district"] ["The" "district" "ranges" "from" "1" "," "115" "feet" "(" "340" "&" "#" "160" ";" "m" ")" "in" "the" "far" "west" "on" "the" "slopes" "of" "Ilkley" "Moor" "to" "about" "33" "feet" "(" "10" "&" "#" "160" ";" "m" ")" "where" "the" "rivers" "Aire" "and" "Wharfe" "cross" "the" "eastern" "boundary"] ["Land" "rises" "to" "198" "&" "#" "160" ";" "m" "(" "650" "&" "#" "160" ";" "ft" ")" "in" "Cookridge" "," "just" "6" "miles" "(" "9" "." "7" "&" "#" "160" ";" "km" ")" "from" "the" "city" "centre"] ["The" "northern" "boundary" "follows" "the" "River" "Wharfe" "for" "several" "miles" "(" "several" "kilometres" ")" "," "but" "it" "crosses" "the" "river" "to" "include" "the" "part" "of" "Otley" "which" "lies" "north" "of" "the" "river"] ["Larger" "outlying" "towns" "and" "villages" "are" "exempt" "from" "the" "green" "belt" "area"] ["However" "," "smaller" "villages" "," "hamlets" "and" "rural" "areas" "are" "'" "washed" "over" "'" "by" "the" "designation"] ["Newsam" "Park" "and" "House" "with" "golf" "course" "," "Rothwell" "Country" "Park" "," "Middleton" "Park" "," "Kirkstall" "Abbey" "ruins" "and" "surrounding" "park" "," "Bedquilts" "recreation" "grounds" "," "Waterloo" "lake" "," "Roundhay" "castle" "and" "park" "," "and" "Morwick" "," "Cobble" "and" "Elmete" "Halls"] ["Climate" "[" "edit" "]" "Sunny" "early" "-" "June" "2006" "day" "at" "Park" "Square" "Leeds" "has" "a" "climate" "that" "is" "oceanic" "(" "K" "ö" "ppen" ":" "Cfb" ")" "," "and" "influenced" "by" "the" "Pennines"] ["Summers" "are" "usually" "mild" "," "with" "moderate" "rainfall" "," "while" "winters" "are" "chilly" "," "cloudy" "with" "occasional" "snow" "and" "frost"] ["Temperatures" "above" "30" "&" "#" "160" ";" "°" "C" "(" "86" "&" "#" "160" ";" "°" "F" ")" "and" "below" "?" "10" "&" "#" "160" ";" "°" "C" "(" "14" "&" "#" "160" ";" "°" "F" ")" "are" "not" "very" "common" "but" "can" "happen" "occasionally"] ["It" "is" "likely" "this" "was" "exceeded" "during" "the" "heatwaves" "of" "July" "2019" "and" "July" "2022" "where" "many" "other" "areas" "broke" "their" "all" "time" "records"] ["However" "Leeds" "weather" "centre" "closed" "in" "the" "2000s"] ["As" "is" "typical" "for" "many" "sprawling" "cities" "in" "areas" "of" "varying" "topography" "," "temperatures" "can" "change" "depending" "on" "location"] ["This" "is" "2" "&" "#" "160" ";" "°" "C" "(" "3" "." "6" "&" "#" "160" ";" "°" "F" ")" "milder" "than" "the" "typical" "summer" "temperature" "at" "Leeds" "Bradford" "airport" "weather" "station" "(" "shown" "in" "the" "chart" "below" ")" "," "at" "an" "elevation" "of" "208" "metres" "(" "682" "feet" ")"] ["Situated" "on" "the" "eastern" "side" "of" "the" "Pennines" "," "Leeds" "is" "among" "the" "driest" "cities" "in" "the" "United" "Kingdom" "," "with" "an" "annual" "rainfall" "of" "660" "&" "#" "160" ";" "mm" "(" "25" "." "98" "&" "#" "160" ";" "in" ")"] ["Though" "extreme" "weather" "in" "Leeds" "is" "relatively" "rare" "," "thunderstorms" "," "blizzards" "," "gale" "-" "force" "winds" "and" "even" "tornadoes" "have" "struck" "the" "city"] ["The" "population" "density" "was" "4" "," "066" "inhabitants" "per" "square" "kilometre" "(" "10" "," "530" "/" "sq" "&" "#" "160" ";" "mi" ")" "," "slightly" "higher" "than" "the" "rest" "of" "the" "West" "Yorkshire" "Urban" "Area"] ["It" "accounts" "for" "20" "%" "of" "the" "area" "and" "62" "%" "of" "the" "population" "of" "the" "City" "of" "Leeds"] ["Leeds" "is" "the" "largest" "component" "of" "the" "West" "Yorkshire" "Urban" "Area" "&" "#" "91" ";" "62" "&" "#" "93" ";" "and" "is" "counted" "by" "Eurostat" "as" "part" "of" "the" "Leeds" "-" "Bradford" "larger" "urban" "zone"] ["Leeds" "has" "seen" "many" "new" "different" "countries" "of" "birth" "as" "of" "the" "UK" "Census" "including" "Zimbabwe" "," "Iran" "," "India" "and" "Nigeria" "all" "included" "in" "the" "top" "ten" "countries" "of" "birth" "in" "the" "city"] ["Large" "Pakistani" "communities" "can" "be" "seen" "in" "wards" "such" "as" "Gipton" "and" "Harehills"] ["The" "City" "of" "Leeds" "is" "the" "local" "government" "district" "covering" "Leeds" "," "and" "the" "local" "authority" "is" "Leeds" "City" "Council"] ["The" "council" "is" "composed" "of" "99" "councillors" "," "three" "for" "each" "of" "the" "district" "'" "s" "wards"] ["Elections" "are" "held" "three" "years" "out" "of" "four" "," "on" "the" "first" "Thursday" "of" "May"] ["One" "third" "of" "the" "councillors" "are" "elected" "," "for" "a" "four" "-" "year" "term" "," "in" "each" "election"] ["The" "council" "is" "currently" "controlled" "by" "Labour"] ... etc]

Mark

dr.mt...@gmail.com

unread,
May 17, 2024, 5:52:10 AMMay 17
to Shen
Reply all
Reply to author
Forward
0 new messages