Documentation

Regex

Regex #

A regular expression engine written in Lean 4.

This library is based on the Rust regex crate and extended for compatibility with PCRE2.

Contents:

  1. Syntax
  2. Unicode support
  3. Api
  4. Examples
  5. Performance
  6. Inspect tool

Syntax #

There are to syntax flavors:

The following features are not yet implemented in the Pcre flavor:

The following features are not implemented in the Rust flavor:

Unicode #

This library implements part of “Basic Unicode Support” (Level 1) as specified by the Unicode Technical Standard #18.

RL1.1 Hex Notation #

RL1.1

Hex Notation refers to the ability to specify a Unicode code point in a regular expression via its hexadecimal code point representation. These forms of hexadecimal notation are supported:

RL1.2 Properties #

RL1.2

A unicode character may have properties. A property has a name and a type (Unicode.PropertyType) and may have values.

These forms of property notation are supported:

The type Unicode.PropertyName contains all supported property names.

RL1.3 Subtraction and Intersection #

RL1.3

This library implements union, intersection and subtraction of sets of characters.

Api #

Main api for Regex library

Examples #

Get captures of "∀ (n : Nat), 0 ≤ n" :

def Main : IO Unit := do
  let re := regex% r"^\p{Math}\s*.(?<=\\()([a-z])[^,]+,\s*(\p{Nd})\s*(\p{Math})\s*\1$"
  let captures := Regex.captures "∀ (n : Nat), 0 ≤ n" re
  IO.println s!"{captures}"

Output is

fullMatch: '∀ (n : Nat), 0 ≤ n', 0, 22
groups: #[(some ('n', 5, 6)), (some ('0', 15, 16)), (some ('≤', 17, 20))]

Components of regular expression:

Performance #

The performance is tested for 2 different kinds of regular expressions and haystacks.

The performance data is generated with the benchmark tool .

Test 1 #

The performance is tested for the regular expression a?<sup>n</sup>a<sup>n</sup> and the haystack a<sup>n</sup> where n means repetition i.e. a?a?aa for n = 2. To find a match for this expression the BoundedBacktracker has to traverse the entire search space.

nsecondsvisited valuesbacktracks
1000.005204045050
5000.150502004125250
10000.6002004004500500
20002.20080080042001000

The visited values are the encoded (StateID, HaystackOffset) pairs that have been visited by the backtracker within a single search (BoundedBacktracker.SearchState).

Test 2 #

The performance is tested for the regular expression (?i)Xyz and a haystack which contains xyz as last element.

size of haystacksecondsvisited values
1 kb0.1705010
100 kb0.340500010
1000 kb1.7205000010

Inspect tool #

The inspect tool command line options:

Output of inspect ast 'a|b'

Alternation
  0: Syntax.AstItems.Literal (a|b,0,1) Verbatim 'a'
  1: Syntax.AstItems.Literal (a|b,2,3) Verbatim 'b'

Output of inspect hir 'a|b'

Alternation
  0: Syntax.HirKind.Literal 'a'
  1: Syntax.HirKind.Literal 'b'

Output of inspect compile 'a|b'

0: NFA.Checked.State.UnionReverse [ 2 3 ]
1: NFA.Checked.State.Empty => 0
2: NFA.Checked.State.SparseTransitions [ 0x00-0xd7ff => 1 0xe000-0x10ffff => 1 ]
3: NFA.Checked.State.Capture (pid=0, group=0, slot=0) => 6
4: NFA.Checked.State.ByteRange a-a => 7
5: NFA.Checked.State.ByteRange b-b => 7
6: NFA.Checked.State.Union [ 4 5 ]
7: NFA.Checked.State.Empty => 8
8: NFA.Checked.State.Capture (pid=0, group=0, slot=1) => 9
9: NFA.Checked.State.Match (0)

Output of inspect captures 'a|b' a

fullMatch: 'a', 0, 1
groups: #[]