Frequently Asked Questions (FAQ)
for people wanting to
study computational linguistics with me
Thank you for your interest in working with me.
Some of what I say here may seem discouraging. That's not intentional.
Computational linguistics is a hot area, often leading to high salaries,
but requires special talent and hard work, and I don't want people to underestimate
what they're getting into.
Because I can't work with an unlimited number of students,
I have to choose the ones who are most likely to be successful.
Supervising each student takes time.
More importantly, I do externally funded research, not just teaching.
Research projects are not exercises for students — they are jobs where
we need to get the work done,
whether the work is easy or hard, and whether students are involved are not.
This is radically different from taking courses and doing homework designed
for students.
(1) "What are natural language processing and computational linguistics?"
Natural language processing is the use of computers
to handle information expressed in human languages. That includes both
shallow natural language processing (information retrieval and related
technologies for searching and indexing texts) and deep natural language processing
(trying to understand texts in a humanlike way).
Computational linguistics is where computation meets linguistics. That is, it is
roughly the same thing as deep natural language processing. It involves getting computers
to recognize the structure and meaning of human-language utterances. It includes speech
synthesis and recognition, syntactic parsing, and semantic analysis (but not simple text
searching or vocabulary analysis).
Computational linguistics is not just linguistics with
computers. All linguists should be using computers as tools, but that isn't
computational linguistics unless genuinely new techniques are being developed.
(2) "Does the University of Georgia have a program in computational linguistics?"
The easiest way to study computational linguistics is to enroll for the
two-year
master's degree in artificial intelligence (M.S. in AI)
and choose computational linguistics as your specialty.
You can also take computational linguistics options in various degrees in the
Linguistics Program and the
Department of Computer Science,
providing you have good ground in those subjects. That is the only way to
do a Ph.D. in computational linguistics.
Note that AI, Linguistics, and Computer Science are three different units of the
University of Georgia; none of them is a branch of another.
Note also that we have only one computational linguistics faculty member (myself)
and, normally, only one computational linguistics course (LING/CSCI 8570, which
has CSCI/ARTI 6540 as a prerequisite). Expect to do a considerable amount of
research on your own.
I encourage incoming students to join an existing research project or do something related.
However, I can also direct thesis projects in other areas of computational linguistics that
are within my area of expertise. My expertise does not include speech synthesis and recognition,
nor the processing of non-Western languages. There are no other computational linguists
at the University of Georgia.
(3) "What background and skills do I need in order to go into computational linguistics?"
Computational linguistics is both linguistics and computer programming.
You have to be able to build software to test your ideas. You also have to know enough
linguistics to make contact with the cutting edge of knowledge about human language.
Suggested reading:
- Jurafsky and Martin, Speech and Language Processing (good overview and handbook,
more than anyone can cover in one course)
- Bird, Klein, and Loper, Natural Language Processing with Python (designed
to teach both NLP and the Python programming language to motivated beginners; popular
with computer programming enthusiasts; slanted toward shallow methods, but not
exclusively)
- Allen, Natural Language Understanding (classic book on deep methods)
- Covington, Natural Language Processing for Prolog Programmers (similar to
Allen but with algorithms in Prolog; when used today, we supplement it extensively
with handouts and updates)
- Manning, Raghavan, and Schütze, Introduction to Information Retrieval
(up-to-date book on shallow methods)
You also have to be a well-prepared student. This implies:
- You speak and understand English well enough to make subtle judgments about the form
and meaning of English sentences. This is necessary in order to do linguistic analysis
and to judge whether your NLP software is working correctly. You are also expected to do
lots of reading in English (sometimes a whole book in just a few days)
and to be able to write papers
in perfect English, with correct grammar and spelling.
- You know how to use a library (not just Google!) and how to write a scientific paper,
including constructing a bibliography and citing sources.
(This is undergraduate-level knowledge and will not be taught to you
in graduate school. Fill gaps in your preparation before you arrive.)
- You have a commitment to high quality. Teaching is an alliance — you are
here to do good work and I am here to help you make it better. At all stages, your work
should look like the work of a professional researcher, not a student hastily doing homework.
At the outset, you will of course be limited by your limited knowledge, but your work
should never be careless, sloppy, or knowingly inaccurate.
People coming from a linguistics background need to develop a working knowledge
of software development, going beyond a single programming language
to encompass algorithms, data structures, and some awareness of how to engineer large computer programs.
A working computational linguist should know a general-purpose
programming language such as Java or C# (the latter is preferred in my lab),
a symbolic language such as Prolog (which we teach you), and a text-oriented
language such as Perl or Python (again, the latter is preferred).
You don't have to know all of this coming in, but you need to make a credible start
and demonstrate some talent.
A key cultural expectation is that adept computer programmers are self-taught.
If you wait for a course to teach you everything, you'll never catch up. Conversely,
if you sit through courses but don't retain the contents, you'll never catch up.
People from a humanities background are sometimes daunted by the amount of technical
material that needs to be mastered and the speed with which people actually master it.
If, on the other hand, you're delighted with the prospect of digging into new technical
material, you're our kind of person.
People coming from a computer science background are sometimes dismayed to find that
linguistics has a large repository of unsolved problems.
It's not a set of finished results you can learn quickly and apply immediately.
Nor do meta-techniques (such as machine learning) solve the problem.
We genuinely do not know the whole structure of even one human language.
We do not even know what kind of formal mechanisms the structure consists of;
that's why there's such a thing as theoretical linguistics.
Although you can get parsers for natural language as ready-to-run packages
(e.g., the Stanford Parser), their outputs are often incorrect, and unless you have a good
grounding in linguistics, you won't know when you're looking at incorrect output,
or whether the error affects your application.
Accordingly, people coming in from a computer science background need courses,
or at least background reading, in general linguistics (e.g., Fromkin et al.,
An Introduction to Language) and a course in syntactic theory.
When available, a course in semantics is also very valuable. Much of this is
hard to learn by just reading; like advanced mathematics, linguistic analysis is
something you learn by doing.
If your interests are more practical than theoretical, and/or if they involve psychology,
you should also know some basic statistics. You don't need to be able to derive
statistical methods from their mathematical foundations, but you need to understand such
things as mean, standard deviation, significance level, t-test, and clustering.
The University has many good graduate and undergraduate courses on statistical techniques.
You can also learn statistics from books such as Motulsky's Intuitive Biostatistics.
(4) "What subjects do you specifically do research on?"
At the moment (January 2011) I am involved in two major projects and two
minor ones.
The CASPR project
uses computational linguistics as a tool to study mental illness; we are
presently working with the Emory University School of Medicine to study the
speech of people suffering from schizophrenia.
The IHUT project uses relatively
shallow natural language processing methods to track developments in
international relations. That is, we do such things as read the news media
(via Internet) and notice developing international conflicts.
The ARC project has as its goal
the knowledge representation and language understanding necessary to process
descriptions of architecture, with Gothic cathedrals as a starting point.
I also have a small collaborative project with the University of Maryland to work
on a self-aware reasoning system that processes human language.
(5) "Can I join your research group?"
About 10 to 20 students and colleagues are doing research with me at any given time.
If you want to join the group, I'm going to ask you about your relevant skill set —
that is, what can you do for us? I will also ask you for samples of previous research,
or at least term papers on scientific subjects, to verify that you are able to write up
your work effectively.
If you appear to be in a position to make a contribution,
I will invite you to attend weekly research group
meetings (see the date and time
on http://www.ai.uga.edu/mc/schedule.txt).
Attending group meetings is an absolute requirement — I cannot set up a separate meeting with
each individual, and we can't reschedule the meeting every time another person joins.
It may well be the case that you simply don't have time to join us after all, and that's OK.
Nobody has time to do everything, and we don't want you to try to do too many different
things at once.
You cannot contribute to a research project in just an hour or two a week. Joining a research
group implies that you will be devoting several hours a week to research and that you are likely
to choose course project topics and a thesis topic in the relevant area.
Just to avoid misunderstanding: Most members of the research group do not have assistantships.
When I have assistantships available, I award them to students who are already doing productive work,
not new applicants whose work I have not yet seen.
(6) "Can I say on my résumé that I have been in your research group?"
No; instead, do some research and then say, on your résumé, what
research you've accomplished. It is misleading to say that you have belonged to my
research group if you can't point to specific papers co-authored or other research achievements.
Naturally, if you have an assistantship, you should list it on your résumé.
(7) "Do you have a funded position for a Ph.D. student in your lab?"
See the earlier questions. We don't operate the way the biological sciences do, with renewable
5-year grants and a specific number of Ph.D. RA positions that are almost like regular jobs.
Our external funding comes and goes on a much shorter cycle and is less predictable.
In the M.S. program in AI, funding is fairly abundant but cannot be guaranteed in advance.
In the computer science and linguistics Ph.D. programs, teaching assistantships are common.
(8) "Your research sounds fascinating. Can I just come along and help in the lab?"
If you are a first-rate computer programmer with appreciable knowledge of linguistics or
experimental psychology, let's talk. Otherwise, there are a couple of reasons why we can't
accept casual visitors or volunteer helpers. One is that we don't have easy work to give to beginners.
Anyone joining the group needs to bring needed skills that will lighten others' workloads.
The other reason is data confidentiality. Our schizophrenia data comes from confidential
medical records and can only be shown to people who have had the appropriate training in ethical
and legal requirements. Our industrial contracts often involve confidential industry data.
See also question (5) above, and note that we are not looking for entry-level computer
programmers. Everyone entering the relevant degree programs can program a computer.
|