Convert a PhD thesis to markdown

Say, you have your own little website, which you made as a hobby. And say you wrote a PhD thesis in LaTeX. Would it not be nice to also publish it on LaTeX? How hard can it be?

As it turns out, not that hard actually. Because of the wonderful tool called pandoc. Pandoc converts many document types into other document types. And since this website is made using Hugo, which uses markdown files, I can simply convert the LaTeX files to markdown using pandoc, and paste them into my website.

There are some bumps on the road though. And I thought I would share them, just for myself, and who knows who else this might be useful for. Do note that some of the instructions below are quite Hugo-specific, so just skip those if you don't want them.

How to convert .tex to .md for Hugo

Before you start, install pandoc!

Start by running:

pandoc -t markdown-citations -s main.tex -o ./md/thesis.md --bibliography thesislib-manual.bib --citeproc --csl https://raw.githubusercontent.com/citation-style-language/styles/master/nature.csl

The .csl file I put in there determines what citations will look like. You can pick any if them here.

The -t argument makes sure there are no markdown-style citations, because Hugo does not support those m(.

This is a good start, and really does the vast majority of the heavy lifting.

Now there is quite a bit of stuff that still needs (semi-)manual fixing. You will need a text editor with regex find-and-replace support, like vscode or Notepad++. The rest of the steps I took. It is messy, but should be clear I think (?) - so good luck!

  1. Replace ^X^ (where x is a number) to the following format: \\(^X\\). This is inline KaTex math. Maybe a link to the separate ref page later on?
  2. Now because there is no coupling between cite place and the bibliography, the numbering must be kept the same, we just generate the list of entries in the right order and put those on a separate webpage:
    1. First remove csl tag: find and replace \{#ref-.+\} (regex!) with whitespace.
    2. Find+replace all {.csl-left-margin} and {.csl-right-inline} tags with whitespace.
    3. Fix numbering: Find+replace (regex mode in vscode): (\[)([0-9]+. )(\]), replace with $2, which is the second group in the search pattern.
    4. Remove starting square bracket: find+replace (regex) ([0-9]\. )\[ with $1.
    5. Remove trailing square bracket: find+replace (regex) \.\] with ..
    6. Remove latex commands, especially the extra linebreak positions manually inserted for names and such:
      • find+replace (not regex!): \-`{=latex} with whitespace.
      • find+replace (not regex!): \enskip`{=latex} with whitespace
      • find+replace (regex!): \[([\S\s]+)\]\{.nocase\} with $1
    7. That should fix 99% of problems, other problems are fixed by just manually going through when rendered in Hugo.
    8. You could make every entry a low-level heading (#####) on the reference page with a named anchor ( {#name} after entry). Then you can refer in text to the correct ref. To do that, you first need to meake sure the entire reference is on 1 line (not split over multiple). To enforce that: run find+replace (with regex): ([0-9]+\. .+)\n(.+) into $1 $2, run it a few times until no more matches are found and done! To add the heading and anchor: find+replace (with regex): ([0-9]+)(\.) (.+) into $1$2 ##### $3 {#$1}.
  3. Now, In all the other files, each ref needs to become a link with anchor. So {{< ref "document.md#anchor" >}} needs to be added. So a link will look like: Blahblah[\\(^1\\)]({{< ref "refs.md#1" >}}). So, the find and replace will be: \^([0-9]+)\^, into [\\\\(^{$1}\\\\)]({{< ref "refs.md#$1" >}}). However, this does NOT take into account references like 12-15 or 12,23,43. Shit. How to convert that. Ok so these options:
    1. 32: from: \^([0-9]+)\^, into [\\\\(^{$1}\\\\)]({{< ref "refs.md#$1" >}})
    2. 32,33: from: \^([0-9]+),([0-9]+)\^ into: [\\\\(^{$1}\\\\)]({{< ref "refs.md#$1" >}})[\\\\(^{,$2}\\\\)]({{< ref "refs.md#$2" >}})
    3. 32,33,34: from: \^([0-9]+),([0-9]+),([0-9]+)\^ into: [\\\\(^{$1}\\\\)]({{< ref "refs.md#$1" >}})[\\\\(^{,$2}\\\\)]({{< ref "refs.md#$2" >}})[\\\\(^{,$3}\\\\)]({{< ref "refs.md#$3" >}})
    4. 34–36: from: \^([0-9]+)--([0-9]+)\^ into: [\\\\(^{$1}\\\\)]({{< ref "refs.md#$1" >}})[\\\\(^{-$2}\\\\)]({{< ref "refs.md#$2" >}})
    5. And the rest must just be done by hand :(
  4. Replace math: \$(.+?)\$ into \\\\($1\\\\). But it will need manual intervention for sure.
  5. References are weird, interchapter they def need to be fixed manually, remove the latex stuff ({reference-type=...) with find+replace (regex): \{(reference-type=)(.+\nreference=)(.+?)\}, replace with nothing, and \{reference-type="ref" reference=(.+?)\} with nothing (sometimes broken over line, sometimes not?).
  6. Replace some of my custom latex stuff to do with equations:
    1. Replace \celsius with °C, \um with μm
    2. Look for SI, num, especially math-wise to replace.
      1. Try: \\\\\(\\SI\{(.+?)\}\{(.+?)\}\\\\\) into $1 $2
      2. Also smart, look for _, these sometimes need to be escaped, even in equations!
    3. * basically always needs to be escaped in equations!
  7. Figures:
    1. Replace with correct links, very little I can do there I am afraid.
    2. Note that in the caption Katex, links and html tags are mostly unavailable, which is a massive pain, but ok.
    3. As a sidonote, a pdf file can be converted into an image using using imagemagick, with a command like convert -density 300 blag.pdf -resize 100% blag.png, or ghostscript (easier): gs -dSAFER -r150 -sDEVICE=pngalpha -o new.png original.pdf
  8. References:
    1. To Figures must be removed, but the number must stay:
      • from (\[)([0-9]+?)(\.)([0-9]+?)(\])(\(#.+?\)) into $4
    2. Other references look like this: {reference-type="ref" reference="fig:emergence"}b)., and needs manual fixing (to other chapters etc.). Look for them with ctrl+f: {reference-type=
    3. Inter-chapter references can still be made! Just look for section and chapter and manually fix the references: [Chapter 2]({{< ref "2.metcon.md" >}})
  9. Tables: should be okay, but also, just check them to be sure.

Comments

Enter your comment: