View Full Version : Chatterbot ideas

07-06-2008, 01:51 AM
Hi again :)

I've been tinkering with OpenCyc yet again, along with putting together the framework for a chatbot system.

My idea is as follows:

Get a word list that allows me to reference the parts of speech for any english word.
Create a system which parses words, sentences, paragraphs, pages, chapters, and books (conceptual, just a large number of related chapters.)
Record the valid structures for each construct (sentences, paragraphs, etc.) Each type of sentence will be correlated to a particular Cyc microtheory. This will allow me to assign a meta-data tag to any given sentence that is known.

A Microtheory is a concept in the Cyc ontology system that encapsulates particular ideas that may be factually or semantically different than other concepts. Think of inheritance and polymorphism, but utilized to create a heirarchical view of the world, using hundreds of thousands of basic common sense concepts and millions of assertions related to those concepts. It allows an AI system to use the proper granularity for arbitrary contexts, by drawing on inferences, hypotheses, and static data.

The problem with most chatbots is lack of real intelligent flexibility. You can create a billion template/response structures and your bot will imitate intelligent conversation, but what you're really doing is an if/then routine over and over again. It takes only 3 degrees of separation between the original input and confusing the 'natural' train of thought before you hit the limitations of most chatbots based on this structure.

However, training a bot over time on actual conversation has some serious drawbacks as well. You end up with rules being created on the fly that may be entirely wrong, with no way of preventing it unless you edit the data manually.

My solution is to semantically tag valid english sentence structures, and assign a particular microtheory to each tag, or class of tags. By using those tags in Cyc microtheories, telling the system what questions it has to answer about particular sentences in order to 'understand' them, I can create an input parser that takes any given english sentences and store them as semantically valid data constructs.

Once stored, I will again turn to Cyc and classify sentence types, and potential 'proper' responses for particular sentence classes. Statements, queries, imperatives, and so on will each have responses appropriate to their semantic data. I will program responses and link the response categories to each sentence class.

So I have a concept for a chatbot system which can take arbitrary (syntactically valid) english sentences, understand them by linking them to an ontological database and creating new concepts as necessary, and return a response based on the actual content of the sentence.

I've gotten a few Cyc microtheories drawn up. I have a 210,000 word 'parts of speech' database. What I need now is ideas for large quantities of text which I can parse to get as broad a range as possible for sentence, paragraph, page, chapter, and book structures.

Does anyone know where I can find really large corpora?

A potential offshoot of this is classification of corpora, and using each class as inputs in a neural net, in order to train patterns specific to styles of writing and genre... so you could create a microtheory that described a story, and have the chatbot output a novel. Given a large enough corpus, you could train on particular authors' styles of writing. I would of course market this software, make millions of dollars, and take over the world.

Anyway, what I'm looking for is ideas as to where to look for parsable data. I'd need structured content, like news articles, books, and so on. The only hardcoding I'm going to do is for things like predictive spellchecking for unknown words, and dealing with broad classes of inputs and responses. I'm hoping that such a system would be able to handle specific inputs and outputs dynamically, and easily pass the Turing test.

I'm also considering IRC logs, chatroom logs, and other "conversation" corpora, but those present problems such as slang, deliberate misspellings, horrible grammar, and extreme ambiguity. I think I should leap one hurdle at a time... so the first is a consistent, pre-edited, dry corpus.

07-06-2008, 02:39 AM
I would generally think of Corpora as just word-list. In which case, probably the 12dicts dictionaries. Check Sourceforge.

Given that you're looking for large amounts of contextual text, I would probably consider parsing the Gutenburg Project - but that has to be done somewhat carefully since much of it is pretty arcane and employs depricated language.

Any of that helpful?

07-06-2008, 03:28 AM
Yup! Actually, I've been using the 12-dicts files for my framework, so far. You're right, the Gutenberg project dictionary is a little archaic. I've been getting a list of AP feeds together, so that I can parse news articles. I'll probably get as many free e-books as I can as well; I'm also searching the python pyNLP project forums and mailing list for big text files, but its a no go so far.

I'm also wondering if theres a programatic way to find all possible valid english sentence structures up to a given length/number of words... For paragraphs however, there's too much variation dependent on the content and context for me to assume that I can determine valid paragraph or page structures. That's where a neural net can provide the 'creative' intelligence.

Does anyone know if the metadata in copyrighted content is copyrighted? If it is then I would have to obtain permission to parse published content. By metadata I mean the particular way in which the words and sentences are strung together.

07-06-2008, 04:23 AM
That sounds like a legal advice question, which you should consult an attorney for.

In my personal opinion and understanding, anything that is commited to media or reduced to publication is automagically copyrighted.