Using Markdown AST to migrate Cloudflare's documentation from Hugo shortcodes to Astro MDX


AST stands for Abstract Syntax Tree. Abstract Syntax Trees represent details relevant to the structure of what you’re parsing. For example, a JavaScript AST doesn’t care if you use single or double quotes when assigning a string literal since that isn’t relevant to the syntax, they both define a StringLiteral. If you end a declaration in a semi-colon or not, the VariableDeclaration doesn’t change.

Code
const foo = "bar";
AST
{
"type": "File",
"start": 0,
"end": 18,
"loc": {
"start": {
"line": 1,
"column": 0,
"index": 0
117 collapsed lines
},
"end": {
"line": 1,
"column": 18,
"index": 18
}
},
"range": [0, 18],
"errors": [],
"program": {
"type": "Program",
"start": 0,
"end": 18,
"loc": {
"start": {
"line": 1,
"column": 0,
"index": 0
},
"end": {
"line": 1,
"column": 18,
"index": 18
}
},
"range": [0, 18],
"sourceType": "module",
"interpreter": null,
"body": [
{
"type": "VariableDeclaration",
"start": 0,
"end": 18,
"loc": {
"start": {
"line": 1,
"column": 0,
"index": 0
},
"end": {
"line": 1,
"column": 18,
"index": 18
}
},
"range": [0, 18],
"declarations": [
{
"type": "VariableDeclarator",
"start": 6,
"end": 17,
"loc": {
"start": {
"line": 1,
"column": 6,
"index": 6
},
"end": {
"line": 1,
"column": 17,
"index": 17
}
},
"range": [6, 17],
"id": {
"type": "Identifier",
"start": 6,
"end": 9,
"loc": {
"start": {
"line": 1,
"column": 6,
"index": 6
},
"end": {
"line": 1,
"column": 9,
"index": 9
},
"identifierName": "foo"
},
"range": [6, 9],
"name": "foo"
},
"init": {
"type": "StringLiteral",
"start": 12,
"end": 17,
"loc": {
"start": {
"line": 1,
"column": 12,
"index": 12
},
"end": {
"line": 1,
"column": 17,
"index": 17
}
},
"range": [12, 17],
"extra": {
"rawValue": "bar",
"raw": "\"bar\""
},
"value": "bar"
}
}
],
"kind": "const"
}
],
"directives": []
},
"comments": [],
"tokens": [...snip...]
}

Why did we pick ASTs?

Well, we could use regular expressions or lots of replaceAll but that’s going to be very fragile. We have 4,813 Markdown files in the cloudflare-docs repository and these will vary greatly, and may have Hugo quirks.

For example, Hugo won’t choke on technically invalid tags such as <br> whereas acorn (the parser used by MDX) expects a well-formed <br/>. Hugo shortcodes that don’t have both an opening and closing tag with inner content aren’t self-closing, whereas MDX components will be.

Since there’s lots of these small differences which we need to be mindful of, and also lots that we’re not bothered about like spacing or other formatting aspects, just transforming the structure of the document is perfect.

mdast

So, mdast is just an Abstract Syntax Tree that represents the structure of a Markdown document! What structures can mdast represent?

  • Blockquote
  • Break
  • Code
  • Definition
  • Emphasis
  • Heading
  • Html
  • Image
  • ImageReference
  • InlineCode
  • Link
  • LinkReference
  • List
  • ListItem
  • Paragraph
  • Root
  • Strong
  • Text
  • ThematicBreak
  • YAML (with mdast-util-frontmatter)

We’re not interested in all of these, but there’s a few that will make the migration a lot easier! Notably, we want to handle:

  • YAML
  • Text
    • We want to remove Hugo’s Go templating braces, {{ and }}
  • HTML
    • These will be our shortcodes, like <glossary> , and the invalid HTML we want to fix along the way like <br>.
  • Image
  • Heading
    • We want to remove headings with a depth of one, since Starlight adds those automatically based on the frontmatter title.
  • Code
    • We need to lift our custom frontmatter solution to attributes on the opening code fence that Starlight expects.

Whilst this blog won’t cover them all, let’s take a look at Code nodes.

Code

interface Code <: Literal {
type: 'code'
lang: string?
meta: string?
}

Our code snippets will be moving from an in-house design using Prism and frontmatter within the codeblock to Expressive Code. Configuring a given code block’s title, line markers (i.e highlight) and line-wrapping happens on the opening code fence.

Hugo
```js
---
title: index.js
highlight: 1
---
const foo = "bar";
```
Expressive Code
```js title="index.js" {1}
const foo = "bar";
```

With mdast, the Hugo code block is represented like this:

{
"type": "root",
"children": [
{
"type": "code",
"lang": "js",
"meta": null,
"value": "---\ntitle: Example Worker\nhighlight: 1\n---\nconst foo = \"bar\";",
"position": {
26 collapsed lines
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 7,
"column": 4,
"offset": 71
}
}
}
],
"position": {
"start": {
"line": 1,
"column": 1,
"offset": 0
},
"end": {
"line": 7,
"column": 4,
"offset": 71
}
}
}

The real power here lies in being able to convert the AST back to Markdown after we’re done. What we will do here is move the attributes inside the frontmatter to their EC equivalents in the meta property.

astray is an awesome tool by someone who used to be on our very own Developer Relations team, Luke Edwards. Infact, this is the very same tool (and approach) that he used during the migration from Gatsby to Hugo!

import { fromMarkdown } from 'mdast-util-from-markdown'
import { toMarkdown } from 'mdast-util-to-markdown'
import * as astray from "astray";
import type * as MDAST from 'mdast';
import fm from "front-matter";
const markdown = await Bun.file("example.md").text();
/*
```js
---
title: Example Worker
highlight: 1
---
const foo = "bar";
```
*/
const AST = fromMarkdown(markdown);
astray.walk<MDAST.Root, void, any>(AST, {
code(node: MDAST.Code) {
const { attributes, body } = fm(node.value);
const { title, highlight } = attributes;
if (title) {
node.meta = `title="${title}"`;
}
if (highlight) {
node.meta += ` {${highlight}}`
}
node.value = body;
return;
}
})
console.log(toMarkdown(AST))
/*
```js title="Example Worker" {1}
const foo = "bar";
```
*/

As opposed to having to find opening code fences, keep track of when we’ve seen an opening fence but not the closing fence yet, and then replace the opening code fence line with our new string - we can just rely on mdast giving an AST, using astray to walk the AST, using front-matter to extract attributes and write them to the meta property of the code node and then mdast-util-to-markdown to write our new Starlight-ready code block!

Back to home