Why Markdown is the Perfect Format for OCR
You have a scanned PDF, a photo of a contract, or a file you cannot select text from. Retyping is the old option. OCR is the faster one, but many tools return one big plain-text blob and throw away headings, lists, and tables.
OCRMD outputs Markdown instead. This post covers what Markdown is, why we use it for OCR, and the syntax you will see in exported files.
What is Markdown? A Brief History
Markdown is a lightweight markup language John Gruber released in 2004, with help from Aaron Swartz. The idea is simple: write in plain text that humans can read, then convert it to HTML when you need to publish.
Technical Specifications
| Specification | Details |
|---|---|
| Internet Media Type | text/markdown |
| Uniform Type Identifier | net.daringfireball.markdown |
| Type of Format | Open file format |
| Filename Extensions | .md, .markdown |
| Latest Release | 1.0.1 (December 17, 2004) |
| Extended To | Pandoc, MultiMarkdown, Markdown Extra, CommonMark, RMarkdown |
The core philosophy behind Markdown is readability. A Markdown file should be publishable as-is, as plain text, without looking like it's been cluttered with tags or formatting instructions. This is a stark contrast to heavier languages like HTML or RTF, which are difficult for humans to read in their raw state.
Over the years, Markdown's popularity has led to the development of several variants, or "flavors," that extend its core functionality. Extensions like GitHub Flavored Markdown (GFM) and Markdown Extra add support for features like tables, footnotes, and fenced code blocks. In 2014, a standardization effort called CommonMark was launched to create an unambiguous specification for Markdown, and it has since been adopted by platforms like GitHub, Reddit, and Stack Overflow.
Why Markdown fits OCR output
Most OCR tools return plain text. Headings, lists, bold text, and tables collapse into one block.
OCRMD maps structure in your source file to Markdown: headings become # lines, lists keep their bullets, tables stay tables.
Here's why that matters:
- Structure is preserved: A heading in your PDF becomes
# My Header. A list stays a list. - Editable immediately: Change text, add list items, or fix a table without rebuilding layout from scratch.
- Works everywhere: Notes apps, static site generators, and CMS tools all speak Markdown.
- Less cleanup: Skip the hour of reformatting after every scan.
Markdown output turns a static scan into something you can search, edit, and reuse.
The Building Blocks: Markdown Basics
Markdown is incredibly easy to learn. Here are the fundamental elements.
-
Headers: Use hash symbols (
#) to create headings. The number of hashes corresponds to the heading level.# Header 1 ## Header 2 ### Header 3 -
Paragraphs: A paragraph is simply one or more consecutive lines of text separated by a blank line.
-
Emphasis: Use asterisks or underscores for italics and bold.
_This text is italic._ _This is also italic._ **This text is bold.** **This is also bold.** -
Lists: Use numbers for ordered lists and asterisks, pluses, or hyphens for unordered lists.
1. First item 2. Second item - Bullet one * Bullet two - Bullet three -
Links: Create a link by wrapping the link text in square brackets followed by the URL in parentheses.
[Visit our site](http://ocrmd.com) -
Images: Image syntax is similar to links but is prefixed with an exclamation mark.
 -
Blockquotes: Use the greater-than sign (
>), familiar from email, to create a blockquote.> This is a quote. -
Code: Wrap inline code in backticks (
`), and indent a block of code with four spaces or a tab.
The Complete Guide: Markdown Syntax Deep Dive
For those who want to master every aspect of Markdown, this detailed syntax guide covers it all, based on the official documentation.
Block Elements
Paragraphs and Line Breaks
A paragraph is one or more lines of text followed by a blank line. If you want to create a line break (<br>) without starting a new paragraph, end a line with two or more spaces and then press return.
Headers Markdown supports two styles of headers.
- Atx-style: Uses 1-6 hash marks at the start of the line.
# Header 1## Header 2 - Setext-style: Uses equal signs for H1 and hyphens for H2 on the line below the text.
Header 1========Header 2--------
Blockquotes
Use > for blockquotes. They can be nested by adding more > symbols.
> This is a quote.
>
> > This is a nested quote.
Lists
Unordered lists use *, +, or -. Ordered lists use numbers followed by a period. The actual numbers you use don't matter; the HTML output will be a correctly ordered list. To create a multi-paragraph list item, indent the subsequent paragraphs by 4 spaces or one tab.
Code Blocks
To create a pre-formatted code block, indent every line of the block by at least 4 spaces or one tab. Ampersands (&) and angle brackets (<, >) are automatically converted to HTML entities.
Horizontal Rules
Create a horizontal rule (<hr/>) by placing three or more hyphens, asterisks, or underscores on a line by themselves.
---
---
---
Span Elements
Links Markdown supports two link styles:
- Inline:
[An example](http://example.com/ "Title"). The title is optional. - Reference:
[An example][id]. You then define the link elsewhere in the document:[id]: http://example.com/ "Title". There is also an "implicit link name" shortcut. If the link text is the same as your reference name, you can write[Google][]and then define[Google]: http://google.com/.
Emphasis
Use one asterisk or underscore for <em> (italic) and two for <strong> (bold).
*italic* _italic_
**bold** __bold__
Code
To indicate inline code, wrap it with backtick quotes (`).
To include a literal backtick, use double backticks: There is a backtick here. `.
Images Image syntax resembles link syntax but with a leading exclamation mark.
- Inline:
 - Reference:
![Alt text][id]with a definition elsewhere:[id]: /path/to/img.jpg "Title"
Miscellaneous Elements
Automatic Links
Markdown automatically turns URLs and email addresses into clickable links if you enclose them in angle brackets.
<http://example.com/>
Backslash Escapes
If you want to use a character that has special meaning in Markdown's syntax as a literal character, use a backslash escape. For example, \*this is not italic\* will render as *literal asterisks*.
Inline HTML
For any markup not covered by Markdown's syntax, you can simply use HTML itself. For block-level elements like <table> or <div>, separate them from surrounding content with blank lines.
Freedom to Use: The Markdown License
Markdown is free software, available under a BSD-style open-source license. This is incredibly cool because it means anyone can use, modify, and build upon it without restriction. This permissive license has fostered a vibrant ecosystem of tools, parsers, and editors. It has allowed for the creation of important extensions like CommonMark and GFM, ensuring that Markdown continues to evolve to meet the needs of its users.
Put it into Practice: Markdown Resources
Ready to try it out? Here are some great tools:
- Official Dingus: A simple web-based tool from John Gruber to test out basic Markdown syntax.
- Markdown Guide: A comprehensive resource for learning Markdown, including basic syntax, extended syntax, and best practices.
- Markpad: A free, beautiful, and more advanced online Markdown editor that provides a live preview as you type.
Conclusion
Markdown is plain text with enough structure to survive OCR. OCRMD uses it so you keep headings, lists, and tables instead of one undifferentiated string.
Sources
- Daring Fireball: The official home of Markdown by its creator, John Gruber.
- Wikipedia: For general history and facts about Markdown and its extensions.