Reverse citations / WhatLinksHere implementation? Letting pages list other pages which link to it

46 views
Skip to first unread message

Gwern Branwen

unread,
Nov 29, 2019, 12:43:32 PM11/29/19
to hakyll
Has anyone seen or made a Hakyll equivalent of Wikipedia's
WhatLinksHere, where a page lists reverse citations (ie all the pages
which link *to* it)? These can be interesting for a reader reading
about a topic, to know where it comes up in other contexts.

But such a thing can't be implemented on a local per-page basis
easily, and manual creation of reverse citation lists are a serious
pain, and it's not obvious to me how you would go about it in Hakyll.
I can see vaguely that some sort of two-pass approach with versioning
might work: do the first normal pass, parse the final HTMLs for all
internal links, then do a second pass HTML->HTML and substitute in the
hits. But not how that'd actually work in Hakyll.

--
gwern
https://www.gwern.net

Beerend Lauwers

unread,
Nov 30, 2019, 8:59:30 AM11/30/19
to hakyll
Perhaps you could preprocess this information similar to how buildTagsWith does it: use getMatches to load in all of the content's identifiers you want to be link-aware. This could be a simple list, something like 

main = hakyll $ do
    ...
    identifiers <- getMatches "content/*"
    ...

Then, you can get the bodies from the identifiers with something like resourceBody (https://jaspervdj.be/hakyll/reference/Hakyll-Core-Provider.html#v:resourceBody).

After extracting the URLs from internal paths (Hakyll even has a getUrls function that works with TagSoup), you can keep these in a map from Identifier to [URL]. You could also go over that map and transform it to a map of URL -> [Identifier], which would be the reverse citation list of that URL.

main = hakyll $ do
    ...
    reverseCitations <- magicalExtraction identifiers
    ...

Then, in the compile step for your content, you should be able to get the route (which is essentially the relative URL for the content you're compiling) with getRoute:

match "content/*" $ do 
    route $ setExtension "html" -- Or something
    compile $ do
        ...
        theRoute <- getRoute
        ....

You can use that result to look up the reverse citations in your URL map. If it works out better code-wise, you can also pass in the reverseCitations map in your compiler context (see how it's done with tags for an example) and then you can expose it as a ListField or something and just iterate over it inside your Hakyll template and print it out.

Op vrijdag 29 november 2019 18:43:32 UTC+1 schreef Gwern Branwen:

Gwern Branwen

unread,
Dec 3, 2019, 11:11:01 AM12/3/19
to hakyll
I see. Thanks for the outline, but I think that's a bit beyond my
Hakyll skills these days. If there's no more straightforward way to do
it, I'll have to drop the idea of this feature. It's complex enough
that it may be something Hakyll should be packaging up and including
in the context, perhaps, since it could be useful for other things:
you could have front pages which have auto-populated 'Key Posts'
sections, based on how many pages link another page, for example.

--
gwern
https://www.gwern.net

Ashton Charbonneau

unread,
Dec 3, 2019, 3:17:09 PM12/3/19
to hakyll
This is a bit beyond my skill level too, but here's how I would consider attempting it. The first bit is based on code I use to transform LaTeX/TikZ in math environments to SVG files, which was inspired by this post. It's not really all the complicated, but it isn't very clean, it isn't very fast, and it has some limitations.

I'll try to implement magicalExtraction. First we want a data type to hold our reverse citations.

```
data ReverseCitation = ReverseCitation
    { source      :: Identifier -- ^ Identifier of the page hosting the link
    , destination :: String     -- ^ Destination of the link
    } deriving (Show)
```

I don't know how to use the Provider type in Hakyll, which is required as an argument for resourceBody. Instead I just parse the raw files sitting on disk. Once we have the pandoc AST then we can easily search it for a list of the links on the page. Should look roughly like this.

```
-- | This function needs to happen outside of the Compiler monad
getReverseCitationsR :: Pattern -> Rules [ReverseCitation]
getReverseCitationsR p = do
    identifiers <- getMatches p
    preprocess $ concatMapM getReverseCitationsIO identifiers

getReverseCitationsIO :: Identifier -> IO [ReverseCitation]
getReverseCitationsIO identifier = do
    fileContents <- TI.readFile $ toFilePath identifier
    ast <- runIOorExplode $ readMarkdown pandocReaderOptions fileContents -- Copy readPandocWith for other types or actual error handling
    return $ query getReverseCitationsAST identifier ast

getReverseCitationsAST :: Identifier -> Inline -> [ReverseCitation]
getReverseCitationsAST identifier (Link _ _ (url, _))
    | isInternal url = [ReverseCitation identifier url]
    | otherwise = []
getReverseCitationsAST _ _ = []
```

Here's an iffy way to determine whether a link is internal. It might actually be useful to collect external links, but let's toss them for now.

```
isInternal :: String -> Bool
isInternal url
    | "https://www.gwern.net" `isPrefixOf` url = True
    | "www.gwern.net" `isPrefixOf` url = True
    | "gwern.net" `isPrefixOf` url = True
    | "." `isPrefixOf` url = True
    | "#" `isPrefixOf` url = False -- Throw these out for now, they're basically self-referencing.
    | "//" `isPrefixOf` url = False -- Protocol relative links.
    | "/" `isPrefixOf` url = True
    | otherwise = False
```

Now we can get a big list of reverse citations with the magicalExtraction function.

```
main :: IO ()
main = hakyll $ do
    ...
    reverseCitations <- getReverseCitationsR "**.page"
    ...
```

Injecting the relevant citations into a context for each page as described by Beerend Lauwers isn't something that I've done elsewhere, so I don't really know how to write it. The signature could be something like `reverseCitationsField :: String -> [ReverseCitation] -> Context String`. In that context, we would need to use getRoute to get the route of the current identifier, filter (and deduplicate) the list of reverse citations, then return a ListField with metadata about the source of the citation (title, url, etc).

```
reverseCitationsField :: String -> [ReverseCitation] -> Context String
reverseCitationsField = do
    routeString <- getRoute $ itemIdentifier item
    citations <- getRelevantReverseCitations routeString rcList
    return 
```

Once we have that function, it would be used like so:

```
main :: IO ()
main = hakyll $ do
    ...
    reverseCitations <- getReverseCitationsR "**.page"
    match "**.page" $ do
        route idRoute
        compile $ pandocCompiler
              >>= loadAndApplyTemplate "templates/default.html" (reverseCitationsField "rc" reverseCitations)
    ...
```

And the template would be easy:

```
$if(rc)$
<h2>What Links Here</h2>
<ul>
$for(rc)$
<li><a href="$url$">$title$</a></li>
$endfor$
</ul>
$endif$
```

It wouldn't be very hard to expand the ReverseCitation data type to also hold the text of the link, or other attributes about the page that holds the link (as long as they are in the AST). Those could be integrated into the list to give more information about why certain pages link there. It should also be possible to have a default list of reverse citations included in the metadata for pages (if your page is linked by an external site you may want to display it in the same list) by simply adding it to the reverseCitations list:


```
main :: IO ()
main = hakyll $ do
    ...
    reverseCitations <- (++ defaultCitations) <$> getReverseCitationsR "**.page"
    ...
```

I'm not sure how I would generalize the process for a 'Key Posts' type of section. I suppose you would search the reverseCitations list for the most common destinations and expose those in a Context. Converting the destination to an identifier could get tricky, as you would have to undo the routing to parse it (you are no longer in the Compiler monad where that info is available).

Ashton
Reply all
Reply to author
Forward
0 new messages