Here is a video lecture by Google's Director of Research Peter Norvig. The full title of this lecture is Theorizing from Data: Avoiding the Capital Mistake.
In 1891 Sir Arthur Conan Doyle said that "it is a capital mistake to theorize before one has data." These words still remain true today.
In this talk, Peter gives insight into what large amounts of data can do for problems in language understanding, translation and information extraction. The talk is accompanied with a bunch of examples from various Google services.
Moments from the lecture:
- [00:35] Peter Norvig came to Google from NASA in 2001 because that's where the data was.
- [01:30] Peter says that the way to make progress in AI (Artificial Intelligence) is to have more data. If you don't have data you won't make progress just with fancy algorithms.
- [04:40] In 2001 a meta study of several different algorithms for disambiguating words in sentences showed that the worst algorithms performed better than the best algorithms if they were trained with a larger word database. Link to original meta study paper: Scaling to Very Very Large Corpora for Natural Language Disambiguation
- [06:30] It took at least 30 years to go from a linguistic text collection of 1 million words (10^6 words, Brown Corpus) to what we now have on Internet (around 100 trillion words (10^14 words)).
- [06:55] Google harvested one billion words (10^12) from the net, counted them up and published them to Linguistics Data Consortium. Announcement here, you can buy 6 DVDs of the words here (the price is $150).
- [10:00] Example: Google Sets was the first experiment done using large amounts of data. It's a clustering algorithm which returns a group of similar words. Try "dog and cat" and then "more and cat" :)
- [11:55] Example: Google Trends shows popularity of a search terms based on data collected over time of searches performed by users.
- [13:15] Example: Query refinement suggestions.
- [13:40] Example: Question answering.
- [15:30] Principles of machine reading - concepts, relational templates, patterns.
- [16:32] Example of learning relations and patterns with machine reading.
- [18:40] Learning classes and attributes (for example, computer games and their manufacturers).
- [21:18] Statistical Machine Translation (See Google Language Tools).
- [24:25] Example of Chinese to English machine translation.
- [26:27] Main components of machine translation are Translation Model, Language Model and Decoding Algorithm.
- [29:35] More data helps!
- [29:45] Problem: How many bits to use to store probabilities?
- [31:10] Problem: How to reduce space used for storing words from training data during translation process?
- [35:25] Three turning points in the history of development of information.
- [37:00] Q and A!
There were some interesting questions in Q and A session:
- [37:15] Have you applied any of the theories used in stock markets to language processing?
- [38:08] Are you working on any tools to assist writers?
- [39:50] How far you off from automated translation without disfluencies?
- [41:58] 1) Is GOOG-411 service actually used to gather a huge corpus of spoken data. 2) Are there any advances on other data than text?
- [43:50] Would the techniques you described in your talk work in speech-to-text processing?
- [44:50] Will there be any services for fighting comment and form spam?
- [46:00] Do you also take information like what links do users click into account when displaying search results?
- [47:22] How do you measure difference between someone finding something, and someone being satisfied what they found?
- [49:23] When doing machine translation, how can you tell that you're not learning from a website which was already translated with another machine translation service?
- [50:49] How do you take into account that one uses slang, the other does not, and does it affect your translation tools?
- [51:40] Can you speak a little about methods in OCR (Optical Character Recognition)?
The question at 44:50 got me very interested. The person asked if Google was going to offer any services for fighting spam. Peter said that it was an interesting idea, but it was better to ask Matt Cutts.
Having a hacker's mindset, I started thinking, what if someone emailed their comments through Gmail? If the comment was spam, Gmail's spam system would detect it and label the message as being spam. Otherwise the message would end up in Inbox folder. All the messages in Inbox folder could then be posted back to the website as good comments. If there were false positives, you could go through the spam folder and move the non-spam messages back to Inbox. What do you think?
Have fun!