Let's Write a JSON Parser From Scratch
Hi,
It's been a long time since I wrote something in this newsletter. Recently I was learning about language parsing and abstract syntax trees. After getting some knowledge about this, I decided to write a JSON parser from scratch.
Parsing:
It is the process of analysing the structure of a string (basically any programming language syntax). Parsing helps us to determine the meaning of the text. Writing a parser for a programming language is a very complex task because programming languages generally have a lot of keywords and syntax rules. Handling all those syntax and keywords can be overwhelming and highly difficult. But in the case of JSON we have a very limited number of keywords and syntax rules. So writing a JSON parser is a relatively easier task.
Tokenization:
It is the process that is done before parsing. Tokenization means breaking down and categorizing string of character into smallest units called tokens. Below table will give you a solid idea of what tokens look like.
JSON:
{
"name": "iPhone 6s",
"price": 649.99,
"isAvailable": true
}
Token : Type
------------:--------------
{ : BRACE_OPEN
name : STRING
: : COLON
iPhone 6s : STRING
, : COMMA
price : STRING
: : COLON
649.99 : NUMBER
, : COMMA
isAvailable : STRING
: : COLON
true : TRUE
} : BRACE_CLOSE
Once tokenization break down string into tokens than these tokens are given to Parser which created an Abstract Syntax Tree. We will discuss it later in this post. But your 1st step is to create a tokenizer.
Writing The Tokenizer:
Now let’s get into the code part where we tokenize a JSON string.
I wrote a function called Tokenize
which takes a JSON string and returns a list of tokens. It loops through the string, character by character, and breaks it down into meaningful pieces like {
, "key"
, :
or 123
. These pieces are what we call tokens.
Here’s the full code, broken down step by step.
We will first start with creating some basic types for all the tokens.
We start with a current
pointer to keep track of where we are in the string. stringLength
helps us not go out of bounds, and tokens
is the slice where we’ll collect all the tokens we generate.
Skipping Whitespace
We loop through the entire string. If we hit whitespace, we skip it because whitespace doesn't matter in JSON.
Switch through known single-character tokens
Than we handle all the simple symbols here. These don’t need much logic — just push them to the tokens list and move on.
Handling Strings
When we encounter a "
, we start reading a string. We look for the closing quote, while also making sure to skip escaped quotes like \"
. If the string is never closed, we throw an error. Otherwise, we extract the string and add it as a token.
Literals and Numbers
We check for true
, false
, and null
first. If we see one of these keywords, we push it to the tokens list and jump ahead accordingly.
Numbers (slightly tricky)
JSON numbers can get complex. They might contain decimals, negative signs, and exponential notation (like 1.2e+10
). We carefully walk through each character to build the number string. I also added validations to reject bad formats like 00
, multiple dots, or missing exponent digits.
If nothing matches
If none of the above matched, the character is invalid in JSON — so we just throw an error.
Number validation helper
This method is used to check for bad leading numbers in JSON.
Printer
This just prints the list of tokens in a nice readable format. Super handy when testing your tokenizer.
Tokenization Output
If we give below JSON to our tokenizer we will get following output.
{
"name": "iPhone 6s",
"price": 649.99,
"isAvailable": true
}
Now our JSON is nicely tokenized and each character had been given its appropriate token type.
Complete Tokenizer Code: GitHub
Parsing & AST
We have now created a tokeniser that converts JSON objects to tokens. The next step is to create a parser that can convert these tokens into an abstract syntax tree. But first, let's understand what an abstract syntax tree is.
Abstract Syntax Tree (AST)
It is a Tree structure that represent syntactic structure of source code. To learn more about it refer to this awesome article: https://dev.to/balapriya/abstract-syntax-tree-ast-explained-in-plain-english-1h38
Writing The Parser:
This is the base interface for all AST node types. Every node will implement the Type()
method, which is a simple way to identify what kind of data (Object, Array, String, etc.) it holds.
This is the main function you call to parse the token stream.
It checks if there’s anything to parse. Then it initializes a
current
pointer (used as an index into thetokens
slice).Delegates to
parseValue
, which handles all the different types.
Basic safety check: if the current token is past the end, return an error.
Now comes the type-checking:
String token → wrap it in
StringNode
.Number → parse into float64.
Booleans and null are direct mappings.
Delegates to specialized functions for objects (
{}
) and arrays ([]
).Anything unexpected = throw an error.
Skip the
{
token.Initialize an
ObjectNode
.Extract the key from the object and parse the value using the already written
parseValue()
function.
Start parsing array (skip the
[
token).Loop through all values.
Each value is parsed with
parseValue
and append value to initialized array.Ensure the array ends correctly with
]
.
Output:
Below is the output we will get if we parse the tokens of JSON that we used above.
{
map[isAvailable:{true}
name:{iPhone 6s}
price:{649.99}]
}
Additional Challenge:
Try converting this AST node to native Golang data structure and use it.
Let’s Connect
I'm always willing to get to make new connections.
GitHub | LinkedIn | Twitter | Instagram
Source Code
Final Thoughts
So this was how we could implement a simple JSON tokenizer and parser from scratch. If you have any suggestions or doubts, you can always comment below. Consider subscribing to my newsletter to get notified for new posts.