Frequently Asked Questions (FAQ)
for people wanting to
study computational linguistics with me

Michael A. Covington
Institute for Artificial Intelligence
The University of Georgia

Thank you for your interest in working with me. Some of what I say here may seem discouraging. That's not intentional. Computational linguistics is a hot area, often leading to high salaries, but requires special talent and hard work, and I don't want people to underestimate what they're getting into.

Because I can't work with an unlimited number of students, I have to choose the ones who are most likely to be successful. Supervising each student takes time.

More importantly, I do externally funded research, not just teaching. Research projects are not exercises for students — they are jobs where we need to get the work done, whether the work is easy or hard, and whether students are involved are not. This is radically different from taking courses and doing homework designed for students.


(1) "What are natural language processing and computational linguistics?"

Natural language processing is the use of computers to handle information expressed in human languages. That includes both shallow natural language processing (information retrieval and related technologies for searching and indexing texts) and deep natural language processing (trying to understand texts in a humanlike way).

Computational linguistics is where computation meets linguistics. That is, it is roughly the same thing as deep natural language processing. It involves getting computers to recognize the structure and meaning of human-language utterances. It includes speech synthesis and recognition, syntactic parsing, and semantic analysis (but not simple text searching or vocabulary analysis).

Computational linguistics is not just linguistics with computers. All linguists should be using computers as tools, but that isn't computational linguistics unless genuinely new techniques are being developed.


(2) "Does the University of Georgia have a program in computational linguistics?"

The easiest way to study computational linguistics is to enroll for the two-year master's degree in artificial intelligence (M.S. in AI) and choose computational linguistics as your specialty.

You can also take computational linguistics options in various degrees in the Linguistics Program and the Department of Computer Science, providing you have good ground in those subjects. That is the only way to do a Ph.D. in computational linguistics.

Note that AI, Linguistics, and Computer Science are three different units of the University of Georgia; none of them is a branch of another. Note also that we have only one computational linguistics faculty member (myself) and, normally, only one computational linguistics course (LING/CSCI 8570, which has CSCI/ARTI 6540 as a prerequisite). Expect to do a considerable amount of research on your own.

I encourage incoming students to join an existing research project or do something related. However, I can also direct thesis projects in other areas of computational linguistics that are within my area of expertise. My expertise does not include speech synthesis and recognition, nor the processing of non-Western languages. There are no other computational linguists at the University of Georgia.


(3) "What background and skills do I need in order to go into computational linguistics?"

Computational linguistics is both linguistics and computer programming. You have to be able to build software to test your ideas. You also have to know enough linguistics to make contact with the cutting edge of knowledge about human language.

Suggested reading:

  • Jurafsky and Martin, Speech and Language Processing (good overview and handbook, more than anyone can cover in one course)

  • Bird, Klein, and Loper, Natural Language Processing with Python (designed to teach both NLP and the Python programming language to motivated beginners; popular with computer programming enthusiasts; slanted toward shallow methods, but not exclusively)

  • Allen, Natural Language Understanding (classic book on deep methods)

  • Covington, Natural Language Processing for Prolog Programmers (similar to Allen but with algorithms in Prolog; when used today, we supplement it extensively with handouts and updates)

  • Manning, Raghavan, and Schütze, Introduction to Information Retrieval (up-to-date book on shallow methods)

You also have to be a well-prepared student. This implies:

  • You speak and understand English well enough to make subtle judgments about the form and meaning of English sentences. This is necessary in order to do linguistic analysis and to judge whether your NLP software is working correctly. You are also expected to do lots of reading in English (sometimes a whole book in just a few days) and to be able to write papers in perfect English, with correct grammar and spelling.


  • You know how to use a library (not just Google!) and how to write a scientific paper, including constructing a bibliography and citing sources. (This is undergraduate-level knowledge and will not be taught to you in graduate school. Fill gaps in your preparation before you arrive.)


  • You have a commitment to high quality. Teaching is an alliance — you are here to do good work and I am here to help you make it better. At all stages, your work should look like the work of a professional researcher, not a student hastily doing homework. At the outset, you will of course be limited by your limited knowledge, but your work should never be careless, sloppy, or knowingly inaccurate.

People coming from a linguistics background need to develop a working knowledge of software development, going beyond a single programming language to encompass algorithms, data structures, and some awareness of how to engineer large computer programs. A working computational linguist should know a general-purpose programming language such as Java or C# (the latter is preferred in my lab), a symbolic language such as Prolog (which we teach you), and a text-oriented language such as Perl or Python (again, the latter is preferred). You don't have to know all of this coming in, but you need to make a credible start and demonstrate some talent.

A key cultural expectation is that adept computer programmers are self-taught. If you wait for a course to teach you everything, you'll never catch up. Conversely, if you sit through courses but don't retain the contents, you'll never catch up. People from a humanities background are sometimes daunted by the amount of technical material that needs to be mastered and the speed with which people actually master it. If, on the other hand, you're delighted with the prospect of digging into new technical material, you're our kind of person.

People coming from a computer science background are sometimes dismayed to find that linguistics has a large repository of unsolved problems. It's not a set of finished results you can learn quickly and apply immediately. Nor do meta-techniques (such as machine learning) solve the problem. We genuinely do not know the whole structure of even one human language. We do not even know what kind of formal mechanisms the structure consists of; that's why there's such a thing as theoretical linguistics. Although you can get parsers for natural language as ready-to-run packages (e.g., the Stanford Parser), their outputs are often incorrect, and unless you have a good grounding in linguistics, you won't know when you're looking at incorrect output, or whether the error affects your application.

Accordingly, people coming in from a computer science background need courses, or at least background reading, in general linguistics (e.g., Fromkin et al., An Introduction to Language) and a course in syntactic theory. When available, a course in semantics is also very valuable. Much of this is hard to learn by just reading; like advanced mathematics, linguistic analysis is something you learn by doing.

If your interests are more practical than theoretical, and/or if they involve psychology, you should also know some basic statistics. You don't need to be able to derive statistical methods from their mathematical foundations, but you need to understand such things as mean, standard deviation, significance level, t-test, and clustering. The University has many good graduate and undergraduate courses on statistical techniques. You can also learn statistics from books such as Motulsky's Intuitive Biostatistics.


(4) "What subjects do you specifically do research on?"

At the moment (January 2011) I am involved in two major projects and two minor ones.

The CASPR project uses computational linguistics as a tool to study mental illness; we are presently working with the Emory University School of Medicine to study the speech of people suffering from schizophrenia.

The IHUT project uses relatively shallow natural language processing methods to track developments in international relations. That is, we do such things as read the news media (via Internet) and notice developing international conflicts.

The ARC project has as its goal the knowledge representation and language understanding necessary to process descriptions of architecture, with Gothic cathedrals as a starting point.

I also have a small collaborative project with the University of Maryland to work on a self-aware reasoning system that processes human language.


(5) "Can I join your research group?"

About 10 to 20 students and colleagues are doing research with me at any given time. If you want to join the group, I'm going to ask you about your relevant skill set — that is, what can you do for us? I will also ask you for samples of previous research, or at least term papers on scientific subjects, to verify that you are able to write up your work effectively.

If you appear to be in a position to make a contribution, I will invite you to attend weekly research group meetings (see the date and time on http://www.ai.uga.edu/mc/schedule.txt). Attending group meetings is an absolute requirement — I cannot set up a separate meeting with each individual, and we can't reschedule the meeting every time another person joins. It may well be the case that you simply don't have time to join us after all, and that's OK. Nobody has time to do everything, and we don't want you to try to do too many different things at once.

You cannot contribute to a research project in just an hour or two a week. Joining a research group implies that you will be devoting several hours a week to research and that you are likely to choose course project topics and a thesis topic in the relevant area.

Just to avoid misunderstanding: Most members of the research group do not have assistantships. When I have assistantships available, I award them to students who are already doing productive work, not new applicants whose work I have not yet seen.


(6) "Can I say on my résumé that I have been in your research group?"

No; instead, do some research and then say, on your résumé, what research you've accomplished. It is misleading to say that you have belonged to my research group if you can't point to specific papers co-authored or other research achievements.

Naturally, if you have an assistantship, you should list it on your résumé.


(7) "Do you have a funded position for a Ph.D. student in your lab?"

See the earlier questions. We don't operate the way the biological sciences do, with renewable 5-year grants and a specific number of Ph.D. RA positions that are almost like regular jobs. Our external funding comes and goes on a much shorter cycle and is less predictable. In the M.S. program in AI, funding is fairly abundant but cannot be guaranteed in advance. In the computer science and linguistics Ph.D. programs, teaching assistantships are common.


(8) "Your research sounds fascinating. Can I just come along and help in the lab?"

If you are a first-rate computer programmer with appreciable knowledge of linguistics or experimental psychology, let's talk. Otherwise, there are a couple of reasons why we can't accept casual visitors or volunteer helpers. One is that we don't have easy work to give to beginners. Anyone joining the group needs to bring needed skills that will lighten others' workloads. The other reason is data confidentiality. Our schizophrenia data comes from confidential medical records and can only be shown to people who have had the appropriate training in ethical and legal requirements. Our industrial contracts often involve confidential industry data.

See also question (5) above, and note that we are not looking for entry-level computer programmers. Everyone entering the relevant degree programs can program a computer.