| epub from the posts | Attila Lendvai | 04/12/13 22:16 | hi,
i've downloaded an epub of the lesswrong posts, and was reading it with great pleasure. then i saw that there are some broken external links, so i looked at how it was generated. then i saw that the whole lesswrong site is opensource, so here i am looking for feedback on the following plans of mine: i'm considering implementing a feature that generates ebooks in epub format from the posts, just like the numerous (?!) crawlers that are available, but as an integral part of the lesswrong site working from the database. http://dato.github.io/lesswrong-bundle/ https://github.com/jb55/lesswrong-print http://hg.ciphergoth.org/scrape-sequences/ https://github.com/OneWhoFrogs/lw2ebook.git would this be desirable? is the database dump available for download? if not, then is there any way i could work on this feature? any hints on what to use as an organizational structure? i've just started to consume the content, so that part is beyond me for a while... i.e. how many standalone, non-interlinked books can be generated? maybe just one big epub? what ordering and/or categorization to use? and another idea would be to write a piece of code that checks for broken links and creates a report, so that admins can deal with the situation either by linking to archive.org or by looking up the new home of the content if available. maybe a way to store and present link alternatives? -- • attila lendvai • PGP: 963F 5D5F 45C7 DFCD 0A39 -- Your conscience never stops you from doing anything. It just stops you from enjoying it. |
| Re: epub from the posts | Attila Lendvai | 10/12/13 09:05 | is anyone listening who has the authority to answer my inquiry?
i'm asking because i'm trying to decide whether to move on to a different forum with my questions, or just wait until people catch up with life. “A private central bank issuing the public currency is a greater menace to the liberties of the people than a standing army. [...] We must not let our rulers load us with perpetual debt.” — Thomas Jefferson (1743–1826) |
| Re: epub from the posts | Jeff Schwaber | 10/12/13 09:46 | You probably want to email the tricycle folks directly. They do respond here sometimes, but not as readily. Jeff -- |
| Re: epub from the posts | Matthew Fallshaw | 11/12/13 17:40 | On Thursday, December 5, 2013 5:16:09 PM UTC+11, Attila Lendvai wrote: …format from the posts… Implementing this in the code doesn't seem to be significantly better than implementing an independent scraper, and it increases the amount of code we have to maintain. I think this is not a desirable feature. … No it is not. The db dump is full of secrets like private messages and information that could link public user accounts with personal info. We do not share the dump easily. … and another idea would be to write a piece of code that checks forbroken links and creates a report, so that admins can deal with the There are a lot of broken links, and updating them all manually is lots of work. Rather than implementing this feature in the codebase I again suggest scraping. If you or any of the editors has the appetite for this work I suggest using an existing link checker (eg. http://wummel.github.io/linkchecker/). … but if you're keen to contribute programming skill to the project, there is lots we'd love help with. Have a look at https://code.google.com/p/lesswrong/issues/list and see if there's anything there you'd like to have a go at, then see https://github.com/tricycle/lesswrong/wiki for an intro to developing the codebase. (Sorry about the delay responding.) |
| Re: epub from the posts | Attila Lendvai | 13/12/13 00:53 |
if you exclude the implementer's efforts from the weighting then you're right, it's not significantly better. i understand your perspective: it would be more LoC in something you maintain, and if you don't see enough value in the feature itself, then it's just extra burden. my perspective is different because i think there's value in a publicly available and automatically generated epub snapshot (i am often at places where there's no internet connection and i have time to read).
well, of course. sorry for not being clear enough: what i meant is whether anything exists already that slices out the public part of the database (and potentially offers for public backup). i've seen too many useful stuff disappearing from the internet to be slightly worried in general about this. and it helps contributors also to test their changes.
the same applies again regarding the weighting. i don't care about broken links on the site in general, i only care about broken links in the more valuable content - the posts. writing a remote scraper that filters out the useful content is wasted effort compared to going through the database, and setting up a cron job as part of the site that drops a mail to interested parties every once in a while with the results.
i had gone through those links already, but motivation is a bitch... ;) i see enough value in a nicely organized and packaged ebook of the posts to motivate me to consider working on it - hence this mail and my research on the past, arguably wasted and duplicated, efforts. but having to process derived (and for my purposes obfuscated) information, that can change and render all my code obsolete... is weight on the scale, next to the fact that there are already epubs of varying quality available.
no worries, thank you for your time, and for lesswrong in general!
“If pigs could vote, the man with the slop bucket would be elected swineherd every time, no matter how much slaughtering he did on the side.” — Orson Scott Card (1951–) |