GPT-3 Can Write Regular Expressions

I have written here numerous times about GPT-3, the language model developed by OpenAI, which has produced stunning and sometimes scary results on problems such as text completion, question answering, writing computer code, and generating text-based adventure games. Just search for “GPT-3” in the site search box for links.

Now, “Aarya” has applied GPT-3 to one of the most arcane corners of programming: composing regular expressions to match patterns in text. If you have never encountered regular expressions, you may consider yourself as having lived a privileged life, although perhaps deemed too sheltered by gnarly-fingered programmers. Regular expressions pack a lot of power in a few characters, but can drive you crazy to write and debug. A simple example might be to find words that begin with a vowel and end with “ology”.

^[aeuouy][a-z]*ology$

Here’s one that validates MasterCard numbers.

^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$

How about one that matches a HTML tag in a document and its matching closing tag?

<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)

The GPT-3 based tool is called AutoRegex, and requires you to create a free account or sign in with Google in order to use it. You can enter an English description of what you want to match, and it generates a regular expression to do that. For example:

duplicate words
\b(\w+)\b\s+\1

words beginning with a vowel and ending with “ology”
/^[aeiouAEIOU]\w+ology$/

valid IPv4 address
^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25

You can also give it a regular expression and ask it to explain, in English, what it does. Let’s try the HTML tag matcher from above.

<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)
The regular expression matches an opening HTML tag (<), followed by one or more characters ([a-z]+) for the tag name, followed by zero or more characters ([^<]+) that are not “<” characters, followed by a closing angle bracket (>). The regular expression also includes a non-capturing group (?:>) that matches either a closing HTML tag with the same name as the opening tag, or a space followed by a forward slash (/).

As with everything GPT-3 related, it is not perfect, but it is phenomenal. If you get any interesting or enlightening results, please post them here as comments.

5 Likes

This is related: Using GPT-3 to explain how code works

There’s an unreleased ‘assistant’ model that excels at this.

1 Like

Not sure if it has a sense of humor or is just being obtuse but when I typed “IPv4 Address” {omitting the word valid) it gave me:

\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}

Which has the pattern of a IP address but will let you use any numbers.

2 Likes

It’s also wrong in that it doesn’t escape the periods, which means they’re interpreted as meta-characters that match anything. Thus it would consider “123-201:218=11” valid.

1 Like