Documentation

Regex

Regex #

A regular expression engine written in Lean 4.

This library is based on the Rust regex crate and extended for compatibility with PCRE2.

Contents:

  1. Syntax
  2. Unicode support
  3. Api
  4. Examples
  5. Performance
  6. Inspect tool

Syntax #

There are to syntax flavors:

The following features are not yet implemented in the Pcre flavor:

The following features are not implemented in the Rust flavor:

Unicode #

This library implements part of “Basic Unicode Support” (Level 1) as specified by the Unicode Technical Standard #18.

RL1.1 Hex Notation #

RL1.1

Hex Notation refers to the ability to specify a Unicode code point in a regular expression via its hexadecimal code point representation. These forms of hexadecimal notation are supported:

RL1.2 Properties #

RL1.2

A unicode character may have properties. A property has a name and a type (Unicode.PropertyType) and may have values.

These forms of property notation are supported:

The type Unicode.PropertyName contains all supported property names.

RL1.3 Subtraction and Intersection #

RL1.3

This library implements union, intersection and subtraction of sets of characters.

Api #

Main api for Regex library

Examples #

Get captures of "∀ (n : Nat), 0 ≤ n" :

def Main : IO Unit := do
  let re := regex% r"^\p{Math}\s*.(?<=\\()([a-z])[^,]+,\s*(\p{Nd})\s*(\p{Math})\s*\1$"
  let captures := Regex.captures "∀ (n : Nat), 0 ≤ n" re
  IO.println s!"{captures}"

Output is

fullMatch: '∀ (n : Nat), 0 ≤ n', 0, 22
groups: #[(some ('n', 5, 6)), (some ('0', 15, 16)), (some ('≤', 17, 20))]

Components of regular expression:

Performance #

The performance is tested for 2 different kinds of regular expressions and haystacks.

The performance data is generated with the benchmark tool .

Test 1 #

The performance is tested for the regular expression a?<sup>n</sup>a<sup>n</sup> and the haystack a<sup>n</sup> where n means repetition i.e. a?a?aa for n = 2. To find a match for this expression the BoundedBacktracker has to traverse the entire search space.

n seconds visited values backtracks
100 0.005 20404 5050
500 0.150 502004 125250
1000 0.600 2004004 500500
2000 2.200 8008004 2001000

The visited values are the encoded (StateID, HaystackOffset) pairs that have been visited by the backtracker within a single search (BoundedBacktracker.SearchState).

Test 2 #

The performance is tested for the regular expression (?i)Xyz and a haystack which contains xyz as last element.

size of haystack seconds visited values
1 kb 0.170 5010
100 kb 0.340 500010
1000 kb 1.720 5000010

Inspect tool #

The inspect tool command line options:

Output of inspect ast 'a|b

Alternation
  0: Literal (a|b,0,1) Verbatim 'a'
  1: Literal (a|b,2,3) Verbatim 'b'

Output of inspect hir 'a|b

Alternation
  0: Literal 'a'
  1: Literal 'b'

Output of inspect compile 'a|b

0: UnionReverse [ 2 3 ]
1: Empty => 0
2: SparseTransitions [ 0x00-0xd7ff => 1 0xe000-0x10ffff => 1 ]
3: capture(pid=0, group=0, slot=0) => 6
4: byte-range a-a => 7
5: byte-range b-b => 7
6: Union [ 4 5 ]
7: Empty => 8
8: capture(pid=0, group=0, slot=1) => 9
9: Match(0)

Output of inspect captures 'a|b' a

fullMatch: 'a', 0, 1
groups: #[]