String.Parser
Functions for turning unstructured strings into structured data.
Parsers
A Parser helps turn a String into nicely structured data. For example,
we can run the int parser to turn String to Int:
run int "123456" == Ok 123456
The cool thing is that you can combine Parser values to handle much more
complex scenarios.
Try a parser. Here are some examples using the keyword parser:
run (keyword "true") "true" == Ok {}
run (keyword "true") "True" == Err ...
run (keyword "true") "false" == Err ...
run (keyword "true") "true!" == Ok {}
Notice the last case! A Parser will chomp as much as possible and not worry
about the rest. Use the end parser to ensure you made it to the end
of the string!
Building Blocks
Parse integers.
run int "1" == Ok 1
run int "1234" == Ok 1234
run int "0123" == Ok 0
run int "123a" == Ok 123
run int "1.34" == Ok 1
run int "1e31" == Ok 1
run int "0x1A" == Ok 0
run int "-789" == Err ...
This parser does not look ahead, which is why "0123" and "123a" succeeds with the digits it did succeed to parse.
If you want to handle a leading + or - you should do it with a custom
parser like this:
myInt : Parser Int
myInt =
oneOf
[ succeed negate
|> skip (token "-")
|> keep int
, int
]
Parse exactly the given string, without any regard to what comes next.
A potential pitfall when parsing keywords is getting tricked by variables that
start with a keyword, like let in letters or import in important. This
is especially likely if you have a whitespace parser that can consume zero
characters. So the keyword parser is defined with token and a
trick to peek ahead a bit:
keyword : String -> Parser {}
keyword kwd =
succeed identity
|> skip (backtrackable (token kwd))
|> keep (oneOf
[ map (\_ -> True) (backtrackable (chompIf isVarChar))
, succeed False
])
|> andThen (checkEnding kwd)
checkEnding : String -> Bool -> Parser {}
checkEnding kwd isBadEnding =
if isBadEnding then
problem ("expecting the `" ++ kwd ++ "` keyword")
else
commit {}
isVarChar : Char -> Bool
isVarChar char =
Char.isAlphaNum char || char == '_'
This definition is specially designed so that (1) if you really see let you
commit to that path and (2) if you see letters instead you can backtrack and
try other options. If I had just put a backtrackable around the whole thing
you would not get (1) anymore.
Parse keywords like let, when, and type.
run (keyword "let") "let" == Ok {}
run (keyword "let") "var" == Err ... (ExpectingKeyword "let") ...
run (keyword "let") "letters" == Err ... (ExpectingKeyword "let") ...
Note: Notice the third case there! keyword actually looks ahead one
character to make sure it is not a letter, number, or underscore. The goal is
to help with parsers like this:
succeed identity
|> skip (keyword "let")
|> skip spaces
|> keep grenVar
|> skip spaces
|> skip (token "=")
The trouble is that spaces may chomp zero characters (to handle expressions
like [1,2] and [ 1 , 2 ]) and in this case, it would mean letters could
be parsed as let ters and then wonder where the equals sign is! Check out the
token docs if you need to customize this!
Create a parser for variables. If we wanted to parse type variables in Gren, we could try something like this:
import Char
import Parser exposing (..)
typeVar : Parser String
typeVar =
variable
{ start = Char.isLower
, inner = \c -> Char.isAlphaNum c || c == '_'
}
This is saying it must start with a lower-case character. After that, characters can be letters, numbers, or underscores. It is also saying that if you run into any of these reserved names, it is definitely not a variable.
Check if you have reached the end of the string you are parsing.
justAnInt : Parser Int
justAnInt =
succeed identity
|> keep int
|> skip end
-- run justAnInt "90210" == Ok 90210
-- run justAnInt "1 + 2" == Err ...
-- run int "1 + 2" == Ok 1
Parsers can succeed without parsing the whole string. Ending your parser
with end guarantees that you have successfully parsed the whole string.
Pipelines
A parser that succeeds without chomping any characters.
run (succeed 90210 ) "mississippi" == Ok 90210
run (succeed 3.141 ) "mississippi" == Ok 3.141
run (succeed {} ) "mississippi" == Ok {}
run (succeed Nothing) "mississippi" == Ok Nothing
Seems weird on its own, but it is very useful in combination with other functions. The docs for keep and andThen have some neat examples.
Keep values in a parser pipeline. For example, we could say:
type alias Point = { x : Int, y : Int }
point : Parser Point
point =
succeed (\x y -> { x = x, y = y })
|> skip (token "(")
|> skip spaces
|> keep int
|> skip spaces
|> skip (token ",")
|> skip spaces
|> keep int
|> skip spaces
|> skip (token ")")
All the parsers in this pipeline will chomp characters and produce values. So
token "(" will chomp one paren and produce a {} value. Similarly, float
will chomp some digits and produce a Float value. The skip and keep
functions just decide whether we give the values to the Point function.
So in this case, we skip the {} from token "(", we skip the {} from
spaces, we keep the Float from float, etc.
Skip values in a parser pipeline. For example, maybe we want to parse some JavaScript variables:
var : Parser String
var =
succeed {}
|> skip (chompIf isStartChar)
|> skip (chompWhile isInnerChar)
|> getChompedString
isStartChar : Char -> Bool
isStartChar char =
Char.isAlpha char || char == '_' || char == '$'
isInnerChar : Char -> Bool
isInnerChar char =
isStartChar char || Char.isDigit char
chompIf isStartChar can chomp one character and produce a {} value.
chompWhile isInnerChar can chomp zero or more characters and produce a {}
value. The skip function is saying to still chomp all the characters, but
skip the two {} values that get produced. No one cares about them.
Helper to define recursive parsers. Say we want a parser for simple boolean expressions:
true
false
(true || false)
(true || (true || false))
Notice that a boolean expression might contain other boolean expressions. That means we will want to define our parser in terms of itself:
type Boolean
= MyTrue
| MyFalse
| MyOr Boolean Boolean
boolean : Parser Boolean
boolean =
oneOf
[ succeed MyTrue
|> skip (keyword "true")
, succeed MyFalse
|> skip (keyword "false")
, succeed MyOr
|> skip (token "(")
|> skip spaces
|> keep (lazy (\_ -> boolean))
|> skip spaces
|> skip (token "||")
|> skip spaces
|> keep (lazy (\_ -> boolean))
|> skip spaces
|> skip (token ")")
]
Notice that boolean uses boolean in its definition! In Gren, you can
only define a value in terms of itself if it is behind a function call. So
lazy helps us define these self-referential parsers. (andThen can be used
for this as well!)
Parse one thing andThen parse another thing. This is useful when you want
to check on what you just parsed. For example, maybe you want U.S. zip codes
and int is not suitable because it does not allow leading zeros. You could
say:
zipCode : Parser String
zipCode =
getChompedString (chompWhile Char.isDigit)
|> andThen checkZipCode
checkZipCode : String -> Parser String
checkZipCode code =
if String.length code == 5 then
succeed code
else
problem "a U.S. zip code has exactly 5 digits"
First we chomp digits andThen we check if it is a valid U.S. zip code. We
succeed if it has exactly five digits and report a problem if not.
Note: If you are using andThen recursively and blowing the stack, check
out the loop function to limit stack usage.
Indicate that a parser has reached a dead end. "Everything was going fine until I ran into this problem." Check out the andThen docs to see an example usage.
Branches
If you are parsing JSON, the values can be strings, floats, booleans,
arrays, objects, or null. You need a way to pick oneOf them! Here is a
sample of what that code might look like:
type Json
= Number Float
| Boolean Bool
| Null
json : Parser Json
json =
oneOf
[ map Number float
, map (\_ -> Boolean True) (keyword "true")
, map (\_ -> Boolean False) (keyword "false")
, map (\_ -> Null) keyword "null"
]
This parser will keep trying parsers until oneOf them starts chomping
characters. Once a path is chosen, it does not come back and try the others.
Transform the result of a parser. Maybe you have a value that is
an integer or null:
nullOrInt : Parser (Maybe Int)
nullOrInt =
oneOf
[ map Just int
, map (\_ -> Nothing) (keyword "null")
]
-- run nullOrInt "0" == Ok (Just 0)
-- run nullOrInt "13" == Ok (Just 13)
-- run nullOrInt "null" == Ok Nothing
-- run nullOrInt "zero" == Err ...
It is quite tricky to use backtrackable well! It can be very useful, but
also can degrade performance and error message quality.
commit is almost always paired with backtrackable in some way, and it
is tricky to use well.
Loops
Handle things like lists and records, but you can customize the details however you need. Say you want to parse C-style code blocks:
import Parser exposing (Parser, Trailing(..))
block : Parser (List Stmt)
block =
Parser.sequence
{ start = "{"
, separator = ";"
, end = "}"
, spaces = spaces
, item = statement
, trailing = Mandatory -- demand a trailing semi-colon
}
-- statement : Parser Stmt
Note: If you need something more custom, do not be afraid to check out the implementation and customize it for your case. It is better to get nice error messages with a lower-level implementation than to try to hack high-level parsers to do things they are not made for.
Whatβs the deal with trailing commas? Are they Forbidden?
Are they Optional? Are they Mandatory? Welcome to shapes
club!
A parser that can loop indefinitely. This can be helpful when parsing repeated structures, like a bunch of statements:
statements : Parser (Array Stmt)
statements =
loop [] statementsHelp
statementsHelp : Array Stmt -> Parser (Step (Array Stmt) (Array Stmt))
statementsHelp stmts =
oneOf
[ succeed (\stmt -> Loop (Array.pushLast stmt stmts))
|> keep statement
|> skip spaces
|> skip (token ";")
|> skip spaces
, succeed {}
|> map (\_ -> Done stmts)
]
IMPORTANT NOTE: Parsers like succeed {} and chompWhile Char.isAlpha can
succeed without consuming any characters. So in some cases you may want to use
getOffset to ensure that each step actually consumed characters.
Otherwise you could end up in an infinite loop!
Note: Anything you can write with loop, you can also write as a parser
that chomps some characters andThen calls itself with new arguments. The
problem with calling andThen recursively is that it grows the stack, so you
cannot do it indefinitely. So loop is important because enables tail-call
elimination, allowing you to parse however many repeats you want.
Decide what steps to take next in your loop.
If you are Done, you give the result of the whole loop. If you decide to
Loop around again, you give a new state to work from. Maybe you need to add
an item to a list? Or maybe you need to track some information about what you
just saw?
Whitespace
Parse zero or more ' ', '\n', and '\r' characters.
The implementation is pretty simple:
spaces : Parser {}
spaces =
chompWhile (\c -> c == ' ' || c == '\n' || c == '\r')
So if you need something different (like tabs) just define an alternative with the necessary tweaks! Check out lineComment and multiComment for more complex situations.
Parse single-line comments:
gren : Parser {}
gren =
lineComment "--"
js : Parser {}
js =
lineComment "//"
python : Parser {}
python =
lineComment "#"
This parser is defined like this:
lineComment : String -> Parser {}
lineComment str =
token str
|> skip (chompUntilEndOr "\n")
So it will consume the remainder of the line. If the file ends before you see a newline, that is fine too.
Parse multi-line comments. So if you wanted to parse JS whitespace, you could say:
js : Parser {}
js =
loop 0 <| ifProgress <|
oneOf
[ lineComment "//"
, multiComment "/*" "*/" NotNestable
, chompWhile (\c -> c == ' ' || c == '\n' || c == '\r' || c == '\t')
]
ifProgress : Parser a -> Int -> Parser (Step Int {})
ifProgress parser offset =
succeed identity
|> skip parser
|> keep getOffset
|> map (\newOffset -> if offset == newOffset then Done {} else Loop newOffset)
Note: The fact that chompWhile comes last in the definition of js is very
important! It can succeed without consuming any characters, so if it were the
first option, it would always succeed and bypass the others! This possibility of
success without consumption is also why wee need the ifProgress helper. It
detects if there is no more whitespace to consume.
Not all languages handle multi-line comments the same. Multi-line comments
in C-style syntax are NotNestable, meaning they can be implemented like this:
js : Parser {}
js =
token "/*"
|> skip (chompUntil "*/")
In fact, multiComment "/*" "*/" NotNestable is implemented like that! It is
very simple, but it does not allow you to nest comments like this:
/*
line1
/* line2 */
line3
*/
It would stop on the first */, eventually throwing a syntax error on the
second */. This can be pretty annoying in long files.
Some languages allow you to nest multi-line comments, but your parser needs
to be a bit fancier to handle this. After you start a comment, you have to
detect if there is another one inside it! And then you have to make sure all
the /* and */ match up properly! Saying multiComment "/*" "*/" Nestable
does all that for you.
Chompers
Sometimes parsers like int or variable cannot do exactly what you
need. The "chomping" family of functions is meant for that case! Maybe you
need to parse valid PHP variables like $x and $txt:
php : Parser String
php =
succeed {}
|> skip (chompChar '$')
|> skip (chompIf (\c -> Char.isAlpha c || c == '_'))
|> skip (chompWhile (\c -> Char.isAlphaNum c || c == '_'))
|> getChompedString
The idea is that you create a bunch of chompers that validate the underlying
characters. Then getChompedString extracts the underlying String efficiently.
Note: Maybe it is helpful to see how you can use getOffset and getSource to implement this function:
getChompedString : Parser a -> Parser String
getChompedString parser =
succeed String.slice
|> keep getOffset
|> skip parser
|> keep getOffset
|> keep getSource
Chomp one character if it passes the test.
chompUpper : Parser {}
chompUpper =
chompIf Char.isUpper
So this can chomp a character like T and produces a {} value.
Chomp one specific character.
chompChar 'c'
This is the same as
chompIf (\c -> c == 'c')
Chomp zero or more characters if they pass the test. This is commonly useful for chomping whitespace or variable names:
whitespace : Parser {}
whitespace =
chompWhile (\c -> c == ' ' || c == '\t' || c == '\n' || c == '\r')
grenVar : Parser String
grenVar =
succeed {}
|> skip (chompIf Char.isLower)
|> skip (chompWhile (\c -> Char.isAlphaNum c || c == '_'))
|> getChompedString
Note: a chompWhile parser always succeeds! This can lead to tricky
situations, especially if you define your whitespace with it. In that case,
you could accidentally interpret letx as the keyword let followed by
"spaces" followed by the variable x. This is why the keyword and number
parsers peek ahead, making sure they are not followed by anything unexpected.
Chomp until you see a certain string. You could define C-style multi-line comments like this:
comment : Parser {}
comment =
token "/*"
|> skip (chompUntil "*/")
I recommend using multiComment for this particular scenario though. It can be trickier than it looks!
Chomp until you see a certain string or until you run out of characters to chomp! You could define single-line comments like this:
gren : Parser {}
gren =
token "--"
|> skip (chompUntilEndOr "\n")
A file may end with a single-line comment, so the file can end before you see a newline. Tricky!
I recommend just using lineComment for this particular scenario.
This works just like getChompedString but gives a bit more flexibility. For example, maybe you want to parse Gren doc comments and get (1) the full comment and (2) all of the names listed in the docs.
You could implement mapChompedString like this:
mapChompedString : (String -> a -> b) -> Parser a -> Parser String
mapChompedString func parser =
succeed (\start value end src -> func (String.slice start end src) value)
|> keep getOffset
|> keep parser
|> keep getOffset
|> keep getSource
Errors
A parser can run into situations where there is no way to make progress.
When that happens, I record the row and col where you got stuck and the
particular problem you ran into. That is a DeadEnd!
Note: I count rows and columns like a text editor. The beginning is row=1
and col=1. As I chomp characters, the col increments. When I reach a \n
character, I increment the row and set col=1.
When you run into a DeadEnd, I record some information about why you
got stuck. This data is useful for producing helpful error messages.
Note: If you feel limited by this type (i.e. having to represent custom
problems as strings) I highly recommend switching to Parser.Advanced. It
lets you define your own Problem type. It can also track "payload" which
can improve error messages a ton! This is how the Gren compiler produces
relatively nice parse errors!
Turn all the DeadEnd data into a string that is easier for people to
read.
Positions
Code editors treat code like a grid, with rows and columns. The start is
row=1 and col=1. As you chomp characters, the col increments. When you
run into a \n character, the row increments and col goes back to 1.
In the Gren compiler, the start and end position of every expression is tracked like this:
type alias Located a =
{ start : { row: Int, col: Int }
, value : a
, end : { row: Int, col: Int }
}
located : Parser a -> Parser (Located a)
located parser =
succeed (\start value end -> { start = start, value = value, end = end })
|> keep getPosition
|> keep parser
|> keep getPosition
So if there is a problem during type inference, I use this saved position information to underline the exact problem!
Note: Tabs count as one character, so if you are parsing something like
Python, I recommend sorting that out after parsing. So if I wanted the ^^^^
underline like in Gren, I would find the row in the source code and do
something like this:
makeUnderline : String -> Int -> Int -> String
makeUnderline row minCol maxCol =
String.toList row
|> List.indexedMap (toUnderlineChar minCol maxCol)
|> String.fromList
toUnderlineChar : Int -> Int -> Int -> Char -> Char
toUnderlineChar minCol maxCol col char =
if minCol <= col && col <= maxCol then
'^'
else if char == '\t' then
'\t'
else
' '
So it would preserve any tabs from the source line. There are tons of other ways to do this though. The point is just that you handle the tabs after parsing but before anyone looks at the numbers in a context where tabs may equal 2, 4, or 8.
This is a more efficient version of map (\pos -> pos.row) getPosition. Maybe
you just want to track the line number for some reason? This lets you do that.
See getPosition for an explanation of rows and columns.
This is a more efficient version of map (\pos -> pos.col) getPosition.
This can be useful when you need to track or check the current column position
during parsing.
See getPosition for an explanation of rows and columns.
Editors think of code as a grid, but behind the scenes it is just a flat
array of UTF-16 characters. getOffset tells you your index in that flat
array. So if you chomp "\n\n\n\n" you are on row 5, column 1, and offset 4.
Note: JavaScript uses a somewhat odd version of UTF-16 strings, so a single
character may take two slots. So in JavaScript, 'abc'.length === 3 but
'πππ'.length === 6. Try it out! And since Gren runs in JavaScript, the offset
moves by those rules.
Get the full string that is being parsed. You could use this to define
getChompedString or mapChompedString if you wanted:
getChompedString : Parser a -> Parser String
getChompedString parser =
succeed String.slice
|> keep getOffset
|> skip parser
|> keep getOffset
|> keep getSource