Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Use Pandoc With Pygments To Highlight Source Code

6 views
Skip to first unread message

Ozell Marcel

unread,
Dec 5, 2023, 1:00:32 PM12/5/23
to
Until recently, I used a JavaScript plugin on this blog to format sourcecode. This bothered me, since using JavaScript just to display somesource code seems like overkill and makes people have to turn onJavaScript in their browsers just to see the source code formattednicely. I wanted to do better than that.
Use pandoc with Pygments to highlight source code
Download Zip https://t.co/o7pSDxAXzk
The way I normally write my blog posts is, I start with a Markdownarticle and then use pandoc to convert it toHTML which I then copy and paste into Wordpress (if there is a betterway to do this, please contact me). I noticed pandoc provides a switch--filter where you can specify a executable that can transform thepandoc output. The only problem is, you have to write a filter.Luckily, I found a GitHub gistthat has already figured out how to write one. Here is some Haskell foryou:
Note that this program invokes another program, pygmentize to actuallyhighlight the source code (pygmentize is part of thePygments project). So, install pygmentize withyour favorite package manager, install Haskell if you have not done soalready, and then compile pygments.hs with:
By default, pandoc will use LaTeX to create the PDF, which requires that a LaTeX engine be installed (see --pdf-engine below). Alternatively, pandoc can use ConTeXt, roff ms, or HTML as an intermediate format. To do this, specify an output file with a .pdf extension, as before, but add the --pdf-engine option or -t context, -t html, or -t ms to the command line. The tool used to generate the PDF from the intermediate format may be specified using --pdf-engine.
When using LaTeX, the following packages need to be available (they are included with all recent versions of TeX Live): amsfonts, amsmath, lm, unicode-math, iftex, listings (if the --listings option is used), fancyvrb, longtable, booktabs, graphicx (if the document contains images), hyperref, xcolor, soul, geometry (with the geometry variable set), setspace (with linestretch), and babel (with lang). If CJKmainfont is set, xeCJK is needed. The use of xelatex or lualatex as the PDF engine requires fontspec. lualatex uses selnolig. xelatex uses bidi (with the dir variable set). If the mathspec variable is set, xelatex will use mathspec instead of unicode-math. The upquote and microtype packages are used if available, and csquotes will be used for typography if the csquotes variable or metadata field is set to a true value. The natbib, biblatex, bibtex, and biber packages can optionally be used for citation rendering. The following packages will be used to improve output quality if present, but pandoc does not require them to be present: upquote (for straight quotes in verbatim environments), microtype (for better spacing adjustments), parskip (for better inter-paragraph spaces), xurl (for better line breaks in URLs), bookmark (for better PDF bookmarks), and footnotehyper or footnote (to allow footnotes in tables).
Shift heading levels by a positive or negative integer. For example, with --shift-heading-level-by=-1, level 2 headings become level 1 headings, and level 3 headings become level 2 headings. Headings cannot have a level less than 1, so a heading that would be shifted below level 1 becomes a regular paragraph. Exception: with a shift of -N, a level-N heading at the beginning of the document replaces the metadata title. --shift-heading-level-by=-1 is a good choice when converting HTML or Markdown documents that use an initial level-1 heading for the document title and level-2+ headings for sections. --shift-heading-level-by=1 may be a good choice for converting Markdown documents that use level-1 headings for sections to HTML, since pandoc uses a level-1 heading to render the document title.
Preserve tabs instead of converting them to spaces. (By default, pandoc converts tabs to spaces before parsing its input.) Note that this will only affect tabs in literal code spans and code blocks. Tabs in regular text are always treated as spaces.
Specifies a custom abbreviations file, with abbreviations one to a line. If this option is not specified, pandoc will read the data file abbreviations from the user data directory or fall back on a system default. To see the system default, use pandoc --print-default-data-file=abbreviations. The only use pandoc makes of this list is in the Markdown reader. Strings found in this list will be followed by a nonbreaking space, and the period will not produce sentence-ending space in formats like LaTeX. The strings may not contain spaces.
Note: some readers and writers (e.g., docx) need access to data files. If these are stored on the file system, then pandoc will not be able to find them when run in --sandbox mode and will raise an error. For these applications, we recommend using a pandoc binary compiled with the embed_data_files option, which causes the data files to be baked into the binary instead of being stored on the file system.
Determine how text is wrapped in the output (the source code, not the rendered version). With auto (the default), pandoc will attempt to wrap lines to the column width specified by --columns (default 72). With none, pandoc will not wrap lines at all. With preserve, pandoc will attempt to preserve the wrapping from the source document (that is, where there are nonsemantic newlines in the source, there will be nonsemantic newlines in the output as well). In ipynb output, this option affects wrapping of the contents of markdown cells.
Specifies the coloring style to be used in highlighted source code. Options are pygments (the default), kate, monochrome, breezeDark, espresso, zenburn, haddock, and tango. For more information on syntax highlighting in pandoc, see Syntax highlighting, below. See also --list-highlight-styles.
Prints a JSON version of a highlighting style, which can be modified, saved with a .theme extension, and used with --highlight-style. This option may be used with -o/--output to redirect output to a file, but -o/--output must come before --print-highlight-style on the command line.
Instructs pandoc to load a KDE XML syntax definition file, which will be used for syntax highlighting of appropriately marked code blocks. This can be used to add support for new languages or to use altered syntax definitions for existing languages. This option may be repeated to add multiple syntax definitions.
Use the listings package for LaTeX code blocks. The package does not support multi-byte encoding for source code. To handle UTF-8 you would need to use a custom template. This issue is fully documented here: Encoding issue with the listings package.
For each name, the first layout found with that name will be used. If no layout is found with one of the names, pandoc will output a warning and use the layout with that name from the default reference doc instead. (How these layouts are used is described in PowerPoint layout choice.)
You can also modify the default reference.pptx: first run pandoc -o custom-reference.pptx --print-default-data-file reference.pptx, and then modify custom-reference.pptx in MS PowerPoint (pandoc will use the layouts with the names listed above).
Determines how ipynb output cells are treated. all means that all of the data formats included in the original are preserved. none means that the contents of data cells are omitted. best causes pandoc to try to pick the richest data block in each output cell that is compatible with the output format. The default is best.
The default is to render TeX math as far as possible using Unicode characters. Formulas are put inside a span with class="math", so that they may be styled differently from the surrounding text if needed. However, this gives acceptable results only for basic math, usually you will want to use --mathjax or another of the following options.
Print information about command-line arguments to stdout, then exit. This option is intended primarily for use in wrapper scripts. The first line of output contains the name of the output file specified with the -o option, or - (for stdout) if no output file was specified. The remaining lines contain the command-line arguments, one per line, in the order they appear. These do not include regular pandoc options and their arguments, but do include any options appearing after a -- separator at the end of the line.
For bidirectional documents, native pandoc spans and divs with the dir attribute (value rtl or ltr) can be used to override the base direction in some output formats. This may not always be necessary if the final renderer (e.g. the browser, when generating HTML) supports the Unicode Bidirectional Algorithm.
In LaTeX, smart means to use the standard TeX ligatures for quotation marks (`` and '' for double quotes, ` and ' for single quotes) and dashes (-- for en-dash and --- for em-dash). If smart is disabled, then in reading LaTeX pandoc will parse these characters literally. In writing LaTeX, enabling smart tells pandoc to use the ligatures when possible; if smart is disabled pandoc will use unicode quotation mark and dash characters.
When converting from docx, read all docx styles as divs (for paragraph styles) and spans (for character styles) regardless of whether pandoc understands the meaning of these styles. This can be used with docx custom styles. Disabled by default.
In addition to standard indented code blocks, pandoc supports fenced code blocks. These begin with a row of three or more tildes () and end with a row of tildes that must be at least as long as the starting row. Everything between these lines is treated as code. No indentation is necessary:
Here mycode is an identifier, haskell and numberLines are classes, and startFrom is an attribute with value 100. Some output formats can use this information to do syntax highlighting. Currently, the only output formats that use this information are HTML, LaTeX, Docx, Ms, and PowerPoint. If highlighting is supported for your output format and language, then the code block above will appear highlighted, with numbered lines. (To see which languages are supported, type pandoc --list-highlight-languages.) Otherwise, the code block above will appear as follows:
eebf2c3492
0 new messages