How to create a lexer using PLY in Python

PLY is a Python library that, in essence, is an implementation of popular parsing tools like lex and yacc. The library consists of two modules lex.py and yacc.py, which create a lexer and a parser, respectively. In this Answer, we’ll explore the lex.py module to create a lexical analyzer for a simple calculator.

Creating a lexer

For simplicity, we’ll create a lexer that can tokenize a string of numbers and basic arithmetic operators (+, -, *, /) and print an error if it encounters any other character. Let’s start by importing the library:

import ply.lex as lex

Defining tokens

Next, we'll create a list of token names that our lexer will be able to recognize:

# Define token names
tokens = (
'PLUS',
'MINUS',
'MULTIPLY',
'DIVIDE',
'INTEGER'
)

Defining token rules

Once the tokens have been defined, the next step is to define the regex for each token. t_ is prefixed with each token name to help PLY distinguish token rules from other functions. Regex for the simpler tokens can be defined as shown below:

t_PLUS = r'\+'
t_MINUS = r'\-'
t_MULTIPLY = r'\*'
t_DIVIDE = r'/'

For the complex tokens, the rules can be defined using a function:

def t_INTEGER(t):
r'\d+'
t.value = int(t.value)
return t

In the rule for the INTEGER token, we first define the regex to identify numbers. We then convert the string number to a Python integer and return it. This is done to make sure that the numbers are treated as integers and not as strings.

Since the input string can have whitespaces, we'd like our lexer to ignore them. We can use the t_ignore rule, reserved by ply.lex for characters that should be ignored, to disregard any whitespaces in the input data:

# Define regex to ignore whitespace
t_ignore = ' \t'

Finally, we'll define a rule to handle any errors while lexing:

# Define a rule to handle errors while lexing
def t_error(t):
print(f"Illegal character '{t.value[0]}'")
t.lexer.skip(1)

Lexer execution and token generation

We have defined the rules for our lexer. All we have to do now is to create an instance of the lexer, feed it the input string, and iterate over the generated tokens using a while loop:

lexer = lex.lex()
test = '1 + 1 * 3 - 4 / 2'
lexer.input(test)
while True:
token = lexer.token()
if not token:
break
print(token)

Testing the lexer

Let's combine all of the code snippets and test the lexer we created. For each token in the input string test, the following code will print its type, value, line number, and column number:

import ply.lex as lex
# Define token names
tokens = (
'PLUS',
'MINUS',
'MULTIPLY',
'DIVIDE',
'INTEGER'
)
# Define regex for each token
t_PLUS = r'\+'
t_MINUS = r'\-'
t_MULTIPLY = r'\*'
t_DIVIDE = r'/'
def t_INTEGER(t):
r'\d+'
t.value = int(t.value)
return t
# Define regex to ignore whitespace
t_ignore = ' \t'
# Define a rule to handle errors while lexing
def t_error(t):
print(f"Illegal character '{t.value[0]}'")
t.lexer.skip(1)
lexer = lex.lex()
test = '1 + 1 * 3 - 4 / 2'
lexer.input(test)
while True:
token = lexer.token()
if not token:
break
print(token)

Our lexer functions as intended. The ply.lex module is pretty straightforward and easy to use. Feel free to extend this code to construct a lexer capable of recognizing and extracting complex tokens.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved