|
|||
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Michael A. Covington
> Courses
> Natural Language Processing Techniques
> Blog
| Reference information: |
E-mail the instructor:
|
|
2010 February 17 |
Some LaTeX resources My web page links for people wanting to learn LaTeX Sample LaTeX document (input) Sample LaTeX document (output) |
|
2010 February 11 |
Contact information for David and Ryan for arranging help sessions Ryan Karetas has office hours in 301 Boyd GSRC:
11-12 Tuesdays He says to knock on the door, as it is a shared office suite.
David Robinson does not have fixed office hours but is often available in the Institute for
Artificial Intelligence (111 Boyd GSRC). He can be contacted at:
|
|
2010 January 31 |
Important differences between Python 2 and Python 3 There aren't many important differences. Here are the ones you're likely to notice. (1) In Python 2, / performs integer division if both arguments are integers, and floating-point division otherwise. In Python 3, / always performs floating-point division and // always performs integer division.
You can make Python 2 act like Python 3 by giving the command: (2) In Python 2, range creates a list of numbers, and xrange creates an abstract object that gives the numbers upon request (thus saving some work). In Python 3, range works like xrange. (3) In Python 2, raw_input is the normal way to get input from the keyboard. If you use input, the string that you type will be evaluated before it is returned to your program (and this is actually usually inconvenient). In Python 3, input works like raw_input. (4) The print statement is quite different, as we went over in class. Use help("print") to get some details. |
|
2010 January 29 |
The NLTK book's web site Corrections to the NLTK textbook are on line at http://nltk.googlecode.com/svn/trunk/nltk/doc/book/errata.txt. (If you bought the book very recently, you may have gotten the second printing, in which these errors have been corrected.) Apparently the entire text of the NLTK book is on line (in HTML, without page numbers) at http://www.nltk.org/book. I have not verified that it is perfectly identical. |
|
2010 January 19 |
Windows 7 64-bit, and other minor matters I had no trouble installing Python and the other tools under 64-bit Windows 7 according to my own instructions (note that they have just today undergone a minor revision). Everything works. Be sure to ignore the warnings about "unknown publisher" since some versions of Windows object to installers that aren't digitally signed. There are two pesky problems. First, the download window and the Matplotlib window tend to pop up behind instead of in front of the window I'm working in. Second, some warnings such as the following seem to be coming up gratuitously (maybe some programmer wasn't as fastidious as he should have been):
Warning (from warnings module):
File "C:\Python26\lib\site-packages\nltk\__init__.py", line 588
DeprecationWarning: object.__new__() takes no parameters
Warning (from warnings module):
File "C:\Python26\lib\site-packages\matplotlib\dates.py", line 91
import pytz.zoneinfo
ImportWarning: Not importing directory
'C:\Python26\lib\site-packages\pytz\zoneinfo': missing __init__.py
|
|
2010 January 15 |
Updates The lecture notes linked in yesterday's blog entry have been revised with better recommendations for how to install NTLK and the NLTK data sets.
The assignment for Tuesday is: |
|
2010 January 14 |
Lecture notes Notes from the second part of today's lecture, including advice on how to install Python and NLTK: Click here. |
|
2010 January 12 |
The LINGUIST List Even if you're not a linguist, let me suggest that you read the LINGUIST List, a mailing list that is headquartered at www.linguistlist.org. (Originally it was just a mailing list and not a web site.) This is where job announcements and conference announcements are posted. Of course, one course in computational linguistics does not make you qualified for most of those jobs, but you can see what's out there. |
|
2010 January 10 |
If you're totally new to computers... Those of you who are totally new to computers, and have no mental model of what goes on inside a computer (ROM, RAM, CPU, etc.), might want to read a short book that I wrote back in 1990: http://www.covingtoninnovations.com/books/Covington-CS-Study-Keys.pdf This predates the Windows era but explains a lot of important concepts, especially why there are programming languages, and how the parts of a computer work together to perform computations. |
|
2010 January 6 |
A handful of useful links Download and install Python and NLTK: http://www.nltk.org/download My research group pages: CASPR IHUT ACL Anthology (natural language processing research literature): http://aclweb.org/anthology-new/ |
|
2010 January 6 |
Some term project ideas These are a few simple, vocabulary-oriented project ideas. Ideas for projects that go deeper into the structure of language will be posted later. Visualizing literary style as color: It is well known that some writers use more verbs and some use more nouns (e.g., some people say "the invaders destroyed the city" and others write about "the destruction of the city by the invaders"). Actually, the ratio of all the different parts of speech is an indication of the writer's style. Some visualization of literary style and color has been done by Keim and Oelke, "Literature Fingerprinting," but they have just scratched the surface. Here's my idea: (1) Run a lot of texts through a tagger to label all the parts of speech (nouns, verbs, adjectives, etc.) (2) Use factor analysis or principal component analysis to reduce this to three variables that account for most of the variation. (3) Map the three variables onto R, G, and B to produce a color and brightness. Then render texts as sets of tiles, or even as a continuum indicating the moving average, just the way Keim and Oelke did. What words are leading economic indicators? Collect economics news from the Web for a period going about 5 years into the past. Look at the changing vocabulary (e.g., the rise of words like "recession"). Figure out which vocabulary indicators are good at predicting changes in the economy (vs. indicating that a change has already happened). Lying language: The University has a corpus of tobacco industry documents, many of which are internal memos indicating scheming to lie to the public. Follow up on Cati Brown's dissertation using machine learning or other more powerful exploratory techniques. Sentiment analysis: Gather Web news stories about some chosen topic and rate whether people are saying favorable or unfavorable things about it. (We are doing a large funded project in this area, but it would be useful to have a sentiment analyzer that was developed independently and is all our own.) |
|
2010 January 5 |
UGA Bookstore has reportedly ordered the wrong edition of one textbook In order to save you some money, I specified the 7th edition of Fromkin, An Introduction to Language, which is widely available used, instead of the current edition (9th now, 8th just a few weeks ago). There is nothing wrong with the 9th edition except that it costs about $113. I specified the 7th edition and that is what I would like you all to have; you can find it used on Amazon and other sites. The UGA Bookstore has reportedly ordered the 9th edition, but if you bought it, you should return it and get your $113 back. Here is a link to the 7th edition on Amazon: http://www.amazon.com/Introduction-Language-Victoria-Fromkin/dp/015508481X/ref=tmm_pap_title_sr. Even if you already know Python and/or linguistics, please get all three of the specified textbooks. We have an unusually large class and it's going to be hard enough to keep people in step without having divergent books. |
|
2010 January 5 |
Welcome! Welcome to LING/CSCI 8570! This is the course blog, where useful reference information will be placed for you. Although intended for course participants, this blog is publicly readable. It is written only by the instructor (there is no comment section). New information is added at the top. This blog is not a complete record of assignments or lecture notes. It is not intended to make it unnecessary to attend class. Now that we've gotten that clear, I again want to welcome you to what may be the largest section of 8570 we've ever had! |