Natural Language Processing, Machine Learning & the Web
Due to the vast amount of available language data, the Web both enables and benefits from machine learning and natural language processing techniques. This course will cover 1) seminal and state-of-the-art approaches to language understanding that are robust and/or scalable, 2) machine learning and data analysis technologies that are well-suited to web data including online training, ranking, active learning and outlier detection, 3) core web technologies and APIs, and 4) ensemble methods for merging evidence from disparate sources.
This course satisfyies the "Corpus Analysis" or "Advanced Natural Language Processing" requirement of the CUNY Graduate Center Computational Linguistics MA/PhD Certificate Program. Linguistics students must have successfully completed Methods in Computational Linguistics I and II. Completion of Language Technology is also strongly recommended.
An editorial note: I will do my best to balance the needs of Linguistics and Computer Science students. Regarding the programming expectations: assignments will include programming from scratch as well as using external packages and resources. All students should feel comfortable converting an algorithm or technique that is presented in pseudocode, diagrams, or natural language into code. Lecture material will not cover specific implementation issues, or data structures. It is the students responsibility to solve implementational issues independently. Similarly, it is the students responsibility to be able to install, run and interact with external packages and resources.
Week 1 - Because that's where the money is.
Week 2 - Document Classification and Word Representations.
Week 3 - Conclusion of Lecture 2. Neural Networks.
Week 4 - Interacting with the Web
Week 5 - Language Modeling / Proposal Description
Week 6 - Question Answering / Project Proposal Presentations
Week 7 - Information Extraction
Week 8 - Linear Time Parsing. Liang Huang Guest Speaker
Week 9 - Ensemble Methods
Week 10 - Information Retrieval and Ranking
Week 11 - Sentiment Analysis / Crowdsourcing
Week 12 - Clustering and Outlier Detection
Week 13 - Karen Livescu guest speaker
Week 14 - Student Presentations
Upon successful completion of this course, a student can expect to be able:
Implement a number of word representation techniques
Perform document and sentiment classification
Implement simple classification algorithms
Correctly evaluate natural language processing applications
Correctly evaluate machine learning algorithms
Use at least two open web APIS
Write programs that process data from the web
Write systems that integrate disparate information
Understand the basics of crowdsourcing
Understand approaches to outlier detection
Speech and Language Processing, Author: Jurafsky and Martin, Publisher: Prentice Hall, Edition: 2 ISBN: 978-0131873216
Convex Optimization, Author: Stephen Boyd and Lieven Vandenberghe, Publisher: Cambridge University Press, ISBN: 978-0521833783. [Free Download]
Come to Class. A major component of this class is participation and presentation.
Cell phones must be on silent, and are not to be checked or used during class - if you are expecting an urgent call, tell the instructor at the start of class.
Strong preference against laptops, tablets or other computers.
Cell phone policy: One warning, after that 5 points off the next homework for.
The Final Letter Grade will be based on a scaled adjustment of the Final Numeric Grade. When the scale has been determined, the class will be informed either in class or over email, and it will be posted to the course webpage (here).
Do not cheat. You may discuss assignments with your classmates, but write or program your assignment alone. Do not ask for or offer to share code, or written assignments. If you discuss an assignment with a classmate, or on an online forum, include the name of the classmate or URL of the forum on your assignment or in the documentation of your code. The first instance of cheating results in an automatic zero for the assignment (or final project). A second instance of cheating results in a zero (F) for the course. The Computer Science Department will be notified in writing of all instances of cheating. On a second instance a report will be submitted to the Office of Academic Integrity.
Assignments will be posted to the website (here) after class the date that they are assigned.
All assignments will be scored out of 100 points.
There are 4 assignments. Each assignment will have a theoretical (pen-and-paper) component. Assignments may also include an implementation (coding) component.
Assignments will be due by 11:59pm on their due date. Assignments should be delivered electronically, via email to the instructor.
No late assignments will be accepted. If an extension is needed let me know as early as possible. I will do my best to be reasonable to you and fair to the rest of class. No extensions will be granted after 24 hours before the assignment is due.
If there are programming requirements to any assignment, coding assignments can be written in C++, java or python.
In general, grading will be 65% Implementation (compilation, passing tests, implementational details) and 35% Documentation and Style. This may be adjusted for some assignments. Always read the assignment for the grading breakdown.
Detailed requirements will accompany each assignment. The instructions and requirements on a particular assignment always take precedence over the general guidelines on the course website.
Submission of coding assignments should be performed over electronically. Submitting multiple times is fine. The latest assignment submitted on time will be graded. If you submit an assignment late, after submitting an assignment on time, you must let me know, via email, that you would like the late submission graded for the assignment.
Written Assignments should also be delivered electronically, via email or google docs.
Electronic copies must be in one of the following formats: .pdf, Microsoft Word .doc/.docx, Google Docs.
Points for each question will be described in each assignment.
In extenuating circumstances, students may be given an Incomplete if material has not been completed by the end of the semester. When an incomplete is granted, the student and instructor will specify, in writing, a timeframe for all outstanding material to be submitted. If no other timeframe has been specified in writing, the deadline for all outstanding material to be submitted to resolve an incomplete will be one month following the last meeting of the class. This semester, that would make the deadline: January 9, 2014. An incomplete that is not resolved by the deadline will become an F.
The Final Project will be an original research project. Possible project ideas will be presented in class. Depending on the size of the class, team projects may be acceptable. The goal will be 12-15 projects for the class. Part of the project will be a short (5-10 minute) presentation of your work.
The goal of the project is to perform a research project incorporating Natural Language Processing, Machine Learning and Web Technologies. Acceptable project ideas will involve either a modification to an existing approach to a problem, or a novel problem entirely. Note: a successful project does not need to generate state-of-the-art results. Novelty, however, is expected. A short, 4 page, report on the algorithm, dataset/problem, and evaluation is expected as part of the project.