"A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects." (Robert A. Heinlein)

Saturday, 3 October 2009

Random text generation with Polygen

I've already been talking about random text generation showing some simple database technique in my early posts. I'm now going to talk about Polygen: a simple Linux program that can be programmed to produce virtually random text of any complexity desired.


Polygen is installed on Ubuntu simply by apt-get command:
sudo atp-get install polygen, polygen-data
once installed it can be easly tested calling it with one of the example grammars as parameter. For example:
polygen /usr/share/polygen/eng/genius.grm

it should write a random answer text like this
How can I do for receiving a RW space bar from Photoshop NT?

You neither should mount the modem to the desktop, nor have to click a ROM virus to the DVD driver but from Office and from the control tools inside Internet Explorer 2000 you neither can ever unmount a printer, nor can load the wordprocessor for pinging a display on a BIOS display.

Writing a simple grammar

Polygen generates it's output by interpreting a set of syntactical rules defined in a grammar text file.
Let's see a simple example grammar to generate a set of “I'm doing something” phrases.
The first, simpler but less flexible, solution could be enumerating all phrases I want be generated like this:
S ::= "I'm watching TV" |
"I am going home" |
"I'm listening to the radio" |
"I'm reading a book";
The “S” symbol is called “non terminal symbol” while the phrases in “ are called “terminal symbols”.This grammar will randomly select one of the pipe separated phrases. Many non terminal symbols can be defined, they are identified by the capital letter, by default the “S” symbol is the output one. Using more non terminal symbols we can try to write a more flexible grammar like this:
S ::= ("I'm" | "I am") Verb Target;
Verb ::= "watching" | "going" | "listening to" | "reading";
Target ::= "TV" | "home" | "a book" | "the radio";
This grammar can generate a wider set of phrases but not all of them make sense. We can refine a little more our grammar like this
S ::= ("I'm" | "I am") (Go Places | Watch ThingsToLook | Listen ThingsToListen | Read ThingsToRead);
Go ::= "going" | "walking" | "running" | "driving";
Watch ::= "watching" | "looking";
Listen ::= "listening to" | "earing" | "overhearing";
Read ::= "reading" | "studying";
Places ::= "home" | "to" ("Genoa" | "Camogli");
ThingsToLook ::= "TV" | "a movie" | "a play";
ThingsToListen ::= "the radio" | "the music";
ThingsToRead ::= "a" ["good"] "book" | "the newspaper";
this grammar can produce a quite acceptable output like this:
maxx@eeepc900:~$ polygen test3.grm
I'm walking to Genoa
maxx@eeepc900:~$ polygen test3.grm
I am reading a good book

Polygen is, to me, mainly a program to have some fun “the old way”. When I was 15 I had a lot of fun writing programs like this, not so complex indeed, with my first computer. But even if fun is the main use of Polygen is fun theory lying behind it is far from being trivial. The educational value of using a program like Polygen and writing a grammar for it shouldn't be underestimated.