Jul 24, 2025

Vendored Markdown - turning MDX into Markdown for AI consumption

When picking a format for your content to be best consumed by AI, you are likely going to pick Markdown!

It’s plain text and the lack of markup - like <html> or { "key": "value" } - makes it cheaper on input tokens.

# Hello

- Unordered lists

1. Numbered lists

*Strong*

However, in Cloudflares Docs, we use MDX where components and even JavaScript expressions are used heavily.

This makes it easier for us to write content but means the source foo.mdx file depends on other files, components and build processes that AI isn’t aware of.

Components

import { Render } from "~/components";

<Render
  file="simple-props"
  params={{
    name: "world",
  }}
/>

In this case, our Render component lets you import reusable Markdown snippets. This means our MDX source isn’t always representative of the final content shown on the HTML page & isn’t too helpful for AI.

Interactive elements that have no Markdown representation

A very common pattern is tabs, panels of information that are hidden from the user until they click on a tab.

This is powered by JavaScript and so, in a plain-text format like Markdown, doesn’t really exist. If we run our tabs component through an off-the-shelf tool like turndown then we’ll get this:

- [One](#tab-panel-6)
- [Two](#tab-panel-7)

One Content

Two Content

The anchor links which used to swap panels now appear in a list & the content has nothing signifying which panel it is related to.

Transforming custom components into plain HTML

I created a custom rehype plugin which:

Removing non-content tags (script, style, link, etc) via a tags allowlist

const ALLOWED_ELEMENTS = [
// Content sectioning
"address",
"article",
"aside",
"footer",
"header",
"h1",
"h2",
"h3",
// ...

Transforming custom elements like starlight-tabs into standard unordered lists

if (tag === "starlight-tabs") {
    const tabs = selectAll('[role="tab"]', element);
    const panels = selectAll('[role="tabpanel"]', element);

    element.tagName = "ul";
    element.properties = {};
    element.children = [];

    for (const tab of tabs) {
    // ...

Adapting our Expressive Code codeblocks HTML to the HTML that CommonMark expects

const language = element.properties.dataLanguage;
if (!language) return;

const code = element.children.find(
(child) => child.type === "element" && child.tagName === "code",
);
if (!code) return;

(code as Element).properties.className = [`language-${language}`];

Taking the Tabs example from the previous section and running it through our plugin will now give us a normal unordered list with the content properly associated with a given list item:

- One

  One Content

- Two

  Two Content

Saving on tokens

Most AI pricing is around input & output tokens and our approach greatly reduces the amount of input tokens required.

For example, let’s take a look at the amount of tokens required for the Workers Get Started using OpenAI’s tokenizer:

HTML: 15,229 tokens
turndown: 3,401 tokens (4.48x less than HTML)
index.md: 2,110 tokens (7.22x less than HTML)

When providing our content to AI, we can see a real-world ~7x saving in input tokens cost.

Conclusion

If you would like to take a look at all of the code involved:

filter-elements.ts plugin which has an element allowlist & transforms component markup
markdown.ts where we turn a HTML document into a Markdown file with frontmatter

This MDX -> HTML -> MD pipeline powers our llms-full.txt files and our index.md files.

We actually started a “How we docs” series of pages in the Cloudflare Docs & one of them is about our other approaches to AI consumability - take a look!

Back to home