PLY
is a Python library that, in essence, is an implementation of popular parsing tools like lex and yacc. The library consists of two modules lex.py
and yacc.py
, which create a lexer and a parser, respectively. In this Answer, we’ll explore the lex.py
module to create a lexical analyzer for a simple calculator.
For simplicity, we’ll create a lexer that can tokenize a string of numbers and basic arithmetic operators (+, -, *, /) and print an error if it encounters any other character. Let’s start by importing the library:
import ply.lex as lex
Next, we'll create a list of token names that our lexer will be able to recognize:
# Define token namestokens = ('PLUS','MINUS','MULTIPLY','DIVIDE','INTEGER')
Once the tokens have been defined, the next step is to define the regex for each token. t_
is prefixed with each token name to help PLY
distinguish token rules from other functions. Regex for the simpler tokens can be defined as shown below:
t_PLUS = r'\+'t_MINUS = r'\-'t_MULTIPLY = r'\*'t_DIVIDE = r'/'
For the complex tokens, the rules can be defined using a function:
def t_INTEGER(t):r'\d+'t.value = int(t.value)return t
In the rule for the INTEGER
token, we first define the regex to identify numbers. We then convert the string number to a Python integer and return it. This is done to make sure that the numbers are treated as integers and not as strings.
Since the input string can have whitespaces, we'd like our lexer to ignore them. We can use the t_ignore
rule, reserved by ply.lex
for characters that should be ignored, to disregard any whitespaces in the input data:
# Define regex to ignore whitespacet_ignore = ' \t'
Finally, we'll define a rule to handle any errors while lexing:
# Define a rule to handle errors while lexingdef t_error(t):print(f"Illegal character '{t.value[0]}'")t.lexer.skip(1)
We have defined the rules for our lexer. All we have to do now is to create an instance of the lexer, feed it the input string, and iterate over the generated tokens using a while
loop:
lexer = lex.lex()test = '1 + 1 * 3 - 4 / 2'lexer.input(test)while True:token = lexer.token()if not token:breakprint(token)
Let's combine all of the code snippets and test the lexer we created. For each token
in the input string test
, the following code will print its type, value, line number, and column number:
import ply.lex as lex# Define token namestokens = ('PLUS','MINUS','MULTIPLY','DIVIDE','INTEGER')# Define regex for each tokent_PLUS = r'\+'t_MINUS = r'\-'t_MULTIPLY = r'\*'t_DIVIDE = r'/'def t_INTEGER(t):r'\d+'t.value = int(t.value)return t# Define regex to ignore whitespacet_ignore = ' \t'# Define a rule to handle errors while lexingdef t_error(t):print(f"Illegal character '{t.value[0]}'")t.lexer.skip(1)lexer = lex.lex()test = '1 + 1 * 3 - 4 / 2'lexer.input(test)while True:token = lexer.token()if not token:breakprint(token)
Our lexer functions as intended. The ply.lex
module is pretty straightforward and easy to use. Feel free to extend this code to construct a lexer capable of recognizing and extracting complex tokens.
Free Resources