Vendored Markdown - turning MDX into Markdown for AI consumption
When picking a format for your content to be best consumed by AI, you are likely going to pick Markdown!
It’s plain text and the lack of markup - like <html>
or { "key": "value" }
- makes it cheaper on input tokens.
# Hello
- Unordered lists
1. Numbered lists
*Strong*
However, in Cloudflares Docs, we use MDX where components and even JavaScript expressions are used heavily.
This makes it easier for us to write content but means the source foo.mdx
file depends on other files, components and build processes that AI isn’t aware of.
Components
import { Render } from "~/components";
<Render file="simple-props" params={{ name: "world", }}/>
In this case, our Render
component lets you import reusable Markdown snippets. This means our MDX source isn’t always representative of the final content shown on the HTML page & isn’t too helpful for AI.
Interactive elements that have no Markdown representation
A very common pattern is tabs, panels of information that are hidden from the user until they click on a tab.
This is powered by JavaScript and so, in a plain-text format like Markdown, doesn’t really exist. If we run our tabs component through an off-the-shelf tool like turndown
then we’ll get this:
- [One](#tab-panel-6)- [Two](#tab-panel-7)
One Content
Two Content
The anchor links which used to swap panels now appear in a list & the content has nothing signifying which panel it is related to.
Transforming custom components into plain HTML
I created a custom rehype
plugin which:
-
Removing non-content tags (
script
,style
,link
, etc) via a tags allowlist-
const ALLOWED_ELEMENTS = [// Content sectioning"address","article","aside","footer","header","h1","h2","h3",// ...
-
-
Transforming custom elements like
starlight-tabs
into standard unordered lists-
if (tag === "starlight-tabs") {const tabs = selectAll('[role="tab"]', element);const panels = selectAll('[role="tabpanel"]', element);element.tagName = "ul";element.properties = {};element.children = [];for (const tab of tabs) {// ...
-
-
Adapting our Expressive Code codeblocks HTML to the HTML that CommonMark expects
-
const language = element.properties.dataLanguage;if (!language) return;const code = element.children.find((child) => child.type === "element" && child.tagName === "code",);if (!code) return;(code as Element).properties.className = [`language-${language}`];
-
Taking the Tabs
example from the previous section and running it through our plugin will now give us a normal unordered list with the content properly associated with a given list item:
- One
One Content
- Two
Two Content
Saving on tokens
Most AI pricing is around input & output tokens and our approach greatly reduces the amount of input tokens required.
For example, let’s take a look at the amount of tokens required for the Workers Get Started using OpenAI’s tokenizer:
- HTML: 15,229 tokens
- turndown: 3,401 tokens (4.48x less than HTML)
- index.md: 2,110 tokens (7.22x less than HTML)
When providing our content to AI, we can see a real-world ~7x saving in input tokens cost.
Conclusion
If you would like to take a look at all of the code involved:
filter-elements.ts
plugin which has an element allowlist & transforms component markupmarkdown.ts
where we turn a HTML document into a Markdown file with frontmatter
This MDX
-> HTML
-> MD
pipeline powers our llms-full.txt
files and our index.md
files.
We actually started a “How we docs” series of pages in the Cloudflare Docs & one of them is about our other approaches to AI consumability - take a look!