gren-lang/core - String.Parser

type alias Parser a = Parser {} Problem a

A Parser helps turn a String into nicely structured data. For example, we can run the int parser to turn String to Int:

run int "123456" == Ok 123456

The cool thing is that you can combine Parser values to handle much more complex scenarios.

run : Parser a -> String -> Result (Array DeadEnd) a

Try a parser. Here are some examples using the keyword parser:

run (keyword "true") "true"  == Ok {}
run (keyword "true") "True"  == Err ...
run (keyword "true") "false" == Err ...
run (keyword "true") "true!" == Ok {}

Notice the last case! A Parser will chomp as much as possible and not worry about the rest. Use the end parser to ensure you made it to the end of the string!

int : Parser Int

Parse integers.

run int "1"    == Ok 1
run int "1234" == Ok 1234
run int "0123" == Ok 0
run int "123a" == Ok 123
run int "1.34" == Ok 1
run int "1e31" == Ok 1
run int "0x1A" == Ok 0

run int "-789" == Err ...

This parser does not look ahead, which is why "0123" and "123a" succeeds with the digits it did succeed to parse.

If you want to handle a leading + or - you should do it with a custom parser like this:

myInt : Parser Int
myInt =
  oneOf
    [ succeed negate
        |> skip (token "-")
        |> keep int
    , int
    ]

token : String -> Parser {}

Parse exactly the given string, without any regard to what comes next.

A potential pitfall when parsing keywords is getting tricked by variables that start with a keyword, like let in letters or import in important. This is especially likely if you have a whitespace parser that can consume zero characters. So the keyword parser is defined with token and a trick to peek ahead a bit:

keyword : String -> Parser {}
keyword kwd =
  succeed identity
    |> skip (backtrackable (token kwd))
    |> keep (oneOf
        [ map (\_ -> True) (backtrackable (chompIf isVarChar))
        , succeed False
        ])
    |> andThen (checkEnding kwd)

checkEnding : String -> Bool -> Parser {}
checkEnding kwd isBadEnding =
  if isBadEnding then
    problem ("expecting the `" ++ kwd ++ "` keyword")
  else
    commit {}

isVarChar : Char -> Bool
isVarChar char =
  Char.isAlphaNum char || char == '_'

This definition is specially designed so that (1) if you really see let you commit to that path and (2) if you see letters instead you can backtrack and try other options. If I had just put a backtrackable around the whole thing you would not get (1) anymore.

keyword : String -> Parser {}

Parse keywords like let, when, and type.

run (keyword "let") "let"     == Ok {}
run (keyword "let") "var"     == Err ... (ExpectingKeyword "let") ...
run (keyword "let") "letters" == Err ... (ExpectingKeyword "let") ...

Note: Notice the third case there! keyword actually looks ahead one character to make sure it is not a letter, number, or underscore. The goal is to help with parsers like this:

succeed identity
  |> skip (keyword "let")
  |> skip spaces
  |> keep grenVar
  |> skip spaces
  |> skip (token "=")

The trouble is that spaces may chomp zero characters (to handle expressions like [1,2] and [ 1 , 2 ]) and in this case, it would mean letters could be parsed as let ters and then wonder where the equals sign is! Check out the token docs if you need to customize this!

variable :

{ start : Char -> Bool

, inner : Char -> Bool

}

-> Parser String

Create a parser for variables. If we wanted to parse type variables in Gren, we could try something like this:

import Char
import Parser exposing (..)

typeVar : Parser String
typeVar =
  variable
    { start = Char.isLower
    , inner = \c -> Char.isAlphaNum c || c == '_'
    }

This is saying it must start with a lower-case character. After that, characters can be letters, numbers, or underscores. It is also saying that if you run into any of these reserved names, it is definitely not a variable.

end : Parser {}

Check if you have reached the end of the string you are parsing.

justAnInt : Parser Int
justAnInt =
  succeed identity
    |> keep int
    |> skip end

-- run justAnInt "90210" == Ok 90210
-- run justAnInt "1 + 2" == Err ...
-- run int       "1 + 2" == Ok 1

Parsers can succeed without parsing the whole string. Ending your parser with end guarantees that you have successfully parsed the whole string.

succeed : a -> Parser a

A parser that succeeds without chomping any characters.

run (succeed 90210  ) "mississippi" == Ok 90210
run (succeed 3.141  ) "mississippi" == Ok 3.141
run (succeed {}     ) "mississippi" == Ok {}
run (succeed Nothing) "mississippi" == Ok Nothing

Seems weird on its own, but it is very useful in combination with other functions. The docs for keep and andThen have some neat examples.

keep : Parser keep -> Parser (keep -> b) -> Parser b

Keep values in a parser pipeline. For example, we could say:

type alias Point = { x : Int, y : Int }

point : Parser Point
point =
  succeed (\x y -> { x = x, y = y })
    |> skip (token "(")
    |> skip spaces
    |> keep int
    |> skip spaces
    |> skip (token ",")
    |> skip spaces
    |> keep int
    |> skip spaces
    |> skip (token ")")

All the parsers in this pipeline will chomp characters and produce values. So token "(" will chomp one paren and produce a {} value. Similarly, float will chomp some digits and produce a Float value. The skip and keep functions just decide whether we give the values to the Point function.

So in this case, we skip the {} from token "(", we skip the {} from spaces, we keep the Float from float, etc.

skip : Parser ignore -> Parser kept -> Parser kept

Skip values in a parser pipeline. For example, maybe we want to parse some JavaScript variables:

var : Parser String
var =
  succeed {}
    |> skip (chompIf isStartChar)
    |> skip (chompWhile isInnerChar)
    |> getChompedString

isStartChar : Char -> Bool
isStartChar char =
  Char.isAlpha char || char == '_' || char == '$'

isInnerChar : Char -> Bool
isInnerChar char =
  isStartChar char || Char.isDigit char

chompIf isStartChar can chomp one character and produce a {} value. chompWhile isInnerChar can chomp zero or more characters and produce a {} value. The skip function is saying to still chomp all the characters, but skip the two {} values that get produced. No one cares about them.

lazy : ({} -> Parser a) -> Parser a

Helper to define recursive parsers. Say we want a parser for simple boolean expressions:

true
false
(true || false)
(true || (true || false))

Notice that a boolean expression might contain other boolean expressions. That means we will want to define our parser in terms of itself:

type Boolean
  = MyTrue
  | MyFalse
  | MyOr Boolean Boolean

boolean : Parser Boolean
boolean =
  oneOf
    [ succeed MyTrue
        |> skip (keyword "true")
    , succeed MyFalse
        |> skip (keyword "false")
    , succeed MyOr
        |> skip (token "(")
        |> skip spaces
        |> keep (lazy (\_ -> boolean))
        |> skip spaces
        |> skip (token "||")
        |> skip spaces
        |> keep (lazy (\_ -> boolean))
        |> skip spaces
        |> skip (token ")")
    ]

Notice that boolean uses boolean in its definition! In Gren, you can only define a value in terms of itself if it is behind a function call. So lazy helps us define these self-referential parsers. (andThen can be used for this as well!)

andThen : (a -> Parser b) -> Parser a -> Parser b

Parse one thing andThen parse another thing. This is useful when you want to check on what you just parsed. For example, maybe you want U.S. zip codes and int is not suitable because it does not allow leading zeros. You could say:

zipCode : Parser String
zipCode =
  getChompedString (chompWhile Char.isDigit)
    |> andThen checkZipCode

checkZipCode : String -> Parser String
checkZipCode code =
  if String.length code == 5 then
    succeed code
  else
    problem "a U.S. zip code has exactly 5 digits"

First we chomp digits andThen we check if it is a valid U.S. zip code. We succeed if it has exactly five digits and report a problem if not.

Note: If you are using andThen recursively and blowing the stack, check out the loop function to limit stack usage.

problem : String -> Parser a

Indicate that a parser has reached a dead end. "Everything was going fine until I ran into this problem." Check out the andThen docs to see an example usage.

oneOf : Array (Parser a) -> Parser a

If you are parsing JSON, the values can be strings, floats, booleans, arrays, objects, or null. You need a way to pick oneOf them! Here is a sample of what that code might look like:

type Json
  = Number Float
  | Boolean Bool
  | Null

json : Parser Json
json =
  oneOf
    [ map Number float
    , map (\_ -> Boolean True) (keyword "true")
    , map (\_ -> Boolean False) (keyword "false")
    , map (\_ -> Null) keyword "null"
    ]

This parser will keep trying parsers until oneOf them starts chomping characters. Once a path is chosen, it does not come back and try the others.

map : (a -> b) -> Parser a -> Parser b

Transform the result of a parser. Maybe you have a value that is an integer or null:

nullOrInt : Parser (Maybe Int)
nullOrInt =
  oneOf
    [ map Just int
    , map (\_ -> Nothing) (keyword "null")
    ]

-- run nullOrInt "0"    == Ok (Just 0)
-- run nullOrInt "13"   == Ok (Just 13)
-- run nullOrInt "null" == Ok Nothing
-- run nullOrInt "zero" == Err ...

backtrackable : Parser a -> Parser a

It is quite tricky to use backtrackable well! It can be very useful, but also can degrade performance and error message quality.

commit : a -> Parser a

commit is almost always paired with backtrackable in some way, and it is tricky to use well.

sequence :

{ start : String

, separator : String

, end : String

, spaces : Parser {}

, item : Parser a

, trailing : Trailing

}

-> Parser (Array a)

Handle things like lists and records, but you can customize the details however you need. Say you want to parse C-style code blocks:

import Parser exposing (Parser, Trailing(..))

block : Parser (List Stmt)
block =
  Parser.sequence
    { start = "{"
    , separator = ";"
    , end = "}"
    , spaces = spaces
    , item = statement
    , trailing = Mandatory -- demand a trailing semi-colon
    }

-- statement : Parser Stmt

Note: If you need something more custom, do not be afraid to check out the implementation and customize it for your case. It is better to get nice error messages with a lower-level implementation than to try to hack high-level parsers to do things they are not made for.

type Trailing

= Forbidden

| Optional

| Mandatory

What’s the deal with trailing commas? Are they Forbidden? Are they Optional? Are they Mandatory? Welcome to shapes club!

loop : state -> (state -> Parser (Step state a)) -> Parser a

A parser that can loop indefinitely. This can be helpful when parsing repeated structures, like a bunch of statements:

statements : Parser (Array Stmt)
statements =
  loop [] statementsHelp

statementsHelp : Array Stmt -> Parser (Step (Array Stmt) (Array Stmt))
statementsHelp stmts =
  oneOf
    [ succeed (\stmt -> Loop (Array.pushLast stmt stmts))
        |> keep statement
        |> skip spaces
        |> skip (token ";")
        |> skip spaces
    , succeed {}
        |> map (\_ -> Done stmts)
    ]

IMPORTANT NOTE: Parsers like succeed {} and chompWhile Char.isAlpha can succeed without consuming any characters. So in some cases you may want to use getOffset to ensure that each step actually consumed characters. Otherwise you could end up in an infinite loop!

Note: Anything you can write with loop, you can also write as a parser that chomps some characters andThen calls itself with new arguments. The problem with calling andThen recursively is that it grows the stack, so you cannot do it indefinitely. So loop is important because enables tail-call elimination, allowing you to parse however many repeats you want.

type Step state a

= Loop state

| Done a

Decide what steps to take next in your loop.

If you are Done, you give the result of the whole loop. If you decide to Loop around again, you give a new state to work from. Maybe you need to add an item to a list? Or maybe you need to track some information about what you just saw?

spaces : Parser {}

Parse zero or more ' ', '\n', and '\r' characters.

The implementation is pretty simple:

spaces : Parser {}
spaces =
  chompWhile (\c -> c == ' ' || c == '\n' || c == '\r')

So if you need something different (like tabs) just define an alternative with the necessary tweaks! Check out lineComment and multiComment for more complex situations.

lineComment : String -> Parser {}

Parse single-line comments:

gren : Parser {}
gren =
  lineComment "--"

js : Parser {}
js =
  lineComment "//"

python : Parser {}
python =
  lineComment "#"

This parser is defined like this:

lineComment : String -> Parser {}
lineComment str =
  token str
    |> skip (chompUntilEndOr "\n")

So it will consume the remainder of the line. If the file ends before you see a newline, that is fine too.

multiComment : String -> String -> Nestable -> Parser {}

Parse multi-line comments. So if you wanted to parse JS whitespace, you could say:

js : Parser {}
js =
  loop 0 <| ifProgress <|
    oneOf
      [ lineComment "//"
      , multiComment "/*" "*/" NotNestable
      , chompWhile (\c -> c == ' ' || c == '\n' || c == '\r' || c == '\t')
      ]

ifProgress : Parser a -> Int -> Parser (Step Int {})
ifProgress parser offset =
  succeed identity
    |> skip parser
    |> keep getOffset
    |> map (\newOffset -> if offset == newOffset then Done {} else Loop newOffset)

Note: The fact that chompWhile comes last in the definition of js is very important! It can succeed without consuming any characters, so if it were the first option, it would always succeed and bypass the others! This possibility of success without consumption is also why wee need the ifProgress helper. It detects if there is no more whitespace to consume.

type Nestable

= NotNestable

| Nestable

Not all languages handle multi-line comments the same. Multi-line comments in C-style syntax are NotNestable, meaning they can be implemented like this:

js : Parser {}
js =
  token "/*"
    |> skip (chompUntil "*/")

In fact, multiComment "/*" "*/" NotNestable is implemented like that! It is very simple, but it does not allow you to nest comments like this:

/*
line1
/* line2 */
line3
*/

It would stop on the first */, eventually throwing a syntax error on the second */. This can be pretty annoying in long files.

Some languages allow you to nest multi-line comments, but your parser needs to be a bit fancier to handle this. After you start a comment, you have to detect if there is another one inside it! And then you have to make sure all the /* and */ match up properly! Saying multiComment "/*" "*/" Nestable does all that for you.

getChompedString : Parser a -> Parser String

Sometimes parsers like int or variable cannot do exactly what you need. The "chomping" family of functions is meant for that case! Maybe you need to parse valid PHP variables like $x and $txt:

php : Parser String
php =
  succeed {}
    |> skip (chompChar '$')
    |> skip (chompIf (\c -> Char.isAlpha c || c == '_'))
    |> skip (chompWhile (\c -> Char.isAlphaNum c || c == '_'))
    |> getChompedString

The idea is that you create a bunch of chompers that validate the underlying characters. Then getChompedString extracts the underlying String efficiently.

Note: Maybe it is helpful to see how you can use getOffset and getSource to implement this function:

getChompedString : Parser a -> Parser String
getChompedString parser =
  succeed String.slice
    |> keep getOffset
    |> skip parser
    |> keep getOffset
    |> keep getSource

chompIf : (Char -> Bool) -> Parser {}

Chomp one character if it passes the test.

chompUpper : Parser {}
chompUpper =
  chompIf Char.isUpper

So this can chomp a character like T and produces a {} value.

chompChar : Char -> Parser {}

Chomp one specific character.

chompChar 'c'

This is the same as

chompIf (\c -> c == 'c')

chompWhile : (Char -> Bool) -> Parser {}

Chomp zero or more characters if they pass the test. This is commonly useful for chomping whitespace or variable names:

whitespace : Parser {}
whitespace =
  chompWhile (\c -> c == ' ' || c == '\t' || c == '\n' || c == '\r')

grenVar : Parser String
grenVar =
  succeed {}
    |> skip (chompIf Char.isLower)
    |> skip (chompWhile (\c -> Char.isAlphaNum c || c == '_'))
    |> getChompedString

Note: a chompWhile parser always succeeds! This can lead to tricky situations, especially if you define your whitespace with it. In that case, you could accidentally interpret letx as the keyword let followed by "spaces" followed by the variable x. This is why the keyword and number parsers peek ahead, making sure they are not followed by anything unexpected.

chompUntil : String -> Parser {}

Chomp until you see a certain string. You could define C-style multi-line comments like this:

comment : Parser {}
comment =
  token "/*"
    |> skip (chompUntil "*/")

I recommend using multiComment for this particular scenario though. It can be trickier than it looks!

chompUntilEndOr : String -> Parser {}

Chomp until you see a certain string or until you run out of characters to chomp! You could define single-line comments like this:

gren : Parser {}
gren =
  token "--"
    |> skip (chompUntilEndOr "\n")

A file may end with a single-line comment, so the file can end before you see a newline. Tricky!

I recommend just using lineComment for this particular scenario.

mapChompedString : (String -> a -> b) -> Parser a -> Parser b

This works just like getChompedString but gives a bit more flexibility. For example, maybe you want to parse Gren doc comments and get (1) the full comment and (2) all of the names listed in the docs.

You could implement mapChompedString like this:

mapChompedString : (String -> a -> b) -> Parser a -> Parser String
mapChompedString func parser =
  succeed (\start value end src -> func (String.slice start end src) value)
    |> keep getOffset
    |> keep parser
    |> keep getOffset
    |> keep getSource

type alias DeadEnd = { row : Int, col : Int, problem : Problem }

A parser can run into situations where there is no way to make progress. When that happens, I record the row and col where you got stuck and the particular problem you ran into. That is a DeadEnd!

Note: I count rows and columns like a text editor. The beginning is row=1 and col=1. As I chomp characters, the col increments. When I reach a \n character, I increment the row and set col=1.

type Problem

= Expecting String

| ExpectingInt

| ExpectingVariable

| ExpectingKeyword String

| ExpectingEnd

| UnexpectedChar

| Problem String

| BadRepeat

When you run into a DeadEnd, I record some information about why you got stuck. This data is useful for producing helpful error messages.

Note: If you feel limited by this type (i.e. having to represent custom problems as strings) I highly recommend switching to Parser.Advanced. It lets you define your own Problem type. It can also track "payload" which can improve error messages a ton! This is how the Gren compiler produces relatively nice parse errors!

deadEndsToString : String -> Array DeadEnd -> String

Turn all the DeadEnd data into a string that is easier for people to read.

getPosition : Parser { row : Int, col : Int }

Code editors treat code like a grid, with rows and columns. The start is row=1 and col=1. As you chomp characters, the col increments. When you run into a \n character, the row increments and col goes back to 1.

In the Gren compiler, the start and end position of every expression is tracked like this:

type alias Located a =
  { start : { row: Int, col: Int }
  , value : a
  , end : { row: Int, col: Int }
  }

located : Parser a -> Parser (Located a)
located parser =
  succeed (\start value end -> { start = start, value = value, end = end })
    |> keep getPosition
    |> keep parser
    |> keep getPosition

So if there is a problem during type inference, I use this saved position information to underline the exact problem!

Note: Tabs count as one character, so if you are parsing something like Python, I recommend sorting that out after parsing. So if I wanted the ^^^^ underline like in Gren, I would find the row in the source code and do something like this:

makeUnderline : String -> Int -> Int -> String
makeUnderline row minCol maxCol =
  String.toList row
    |> List.indexedMap (toUnderlineChar minCol maxCol)
    |> String.fromList

toUnderlineChar : Int -> Int -> Int -> Char -> Char
toUnderlineChar minCol maxCol col char =
  if minCol <= col && col <= maxCol then
    '^'
  else if char == '\t' then
    '\t'
  else
    ' '

So it would preserve any tabs from the source line. There are tons of other ways to do this though. The point is just that you handle the tabs after parsing but before anyone looks at the numbers in a context where tabs may equal 2, 4, or 8.

getRow : Parser Int

This is a more efficient version of map (\pos -> pos.row) getPosition. Maybe you just want to track the line number for some reason? This lets you do that.

See getPosition for an explanation of rows and columns.

getCol : Parser Int

This is a more efficient version of map (\pos -> pos.col) getPosition. This can be useful when you need to track or check the current column position during parsing.

See getPosition for an explanation of rows and columns.

getOffset : Parser Int

Editors think of code as a grid, but behind the scenes it is just a flat array of UTF-16 characters. getOffset tells you your index in that flat array. So if you chomp "\n\n\n\n" you are on row 5, column 1, and offset 4.

Note: JavaScript uses a somewhat odd version of UTF-16 strings, so a single character may take two slots. So in JavaScript, 'abc'.length === 3 but '🙈🙉🙊'.length === 6. Try it out! And since Gren runs in JavaScript, the offset moves by those rules.

getSource : Parser String

Get the full string that is being parsed. You could use this to define getChompedString or mapChompedString if you wanted:

getChompedString : Parser a -> Parser String
getChompedString parser =
  succeed String.slice
    |> keep getOffset
    |> skip parser
    |> keep getOffset
    |> keep getSource

String.Parser

Parsers

Building Blocks

Pipelines

Branches

Loops

Whitespace

Chompers

Errors

Positions