Friday, July 18, 2008

What am I searching for?

Yesterday I went to a talk at the MSR in Redmond. A guy from Germany was talking about the "suggested search". His idea is pretty cool, but the implementation is sort of "lousy". I do remember Google has done something similar according to my own experience with the Google Tool Bar on my Firefox. It is a search tool that when you type your query, the search engine will automatically suggest the closest possible query by looking at the keywords you are typing into, which can "guide" you during your search process.

I like this guided search a lot, because most times I find myself not exactly know how to best describe what I want to search. For example, if I want to find a cheap airticket from New York to Shanghai, should I search for "cheap New York airtickets", or "cheap Shanghai airtickets", or more precisely, "cheap airticket from New York to Shanghai"? Unfortunately, most likely the last query will FAIL by experiences, since it seems to contain too many keywords. See, sometimes, not always the more the better :-(

When you search for something, search engines provide you potential pre-defined questions to help you better define your query and find out what you need. This seems to be a popular trend in future. There should be a pre-processing procedure to cluster different queries and then classify the upcoming ones into any of the possible categories. Then, these new queries can in turn help improve the clustering results before the next round. Mmm...sounds like "Active learning". This work is quite challenging, since it needs semantic level natural language analysis to better interpret the words' meaning, instead of just doing simple string matching (it seemed to me that the German guy did only string matching using some distance computation).

"Search, search, search~~" We are doing keyword search everyday.
However, before we rush into the search bar, should we think twice what exactly we are searching for? Or should we not?

Perhaps one day, others will know it better than ourselves.


(PS: Something from Google Official BLog.)

Technologies behind Google ranking
7/16/2008 10:53:00 AM

In my previous post, I introduced the philosophies behind Google ranking. As part of our effort to discuss search quality, I want to tell you more about the technologies behind our ranking. The core technology in our ranking system comes from the academic field of Information Retrieval (IR). The IR community has studied search for almost 50 years. It uses statistical signals of word salience, like word frequency, to rank pages. (See "Modern Information Retrieval: A Brief Overview" for a quick overview of IR technology.) IR gave us a solid foundation, and we have built a tremendous system on top using links, page structure, and many other such innovations.

Search in the last decade has moved from give me what I said to give me what I want. User expectations from search have rightly increased. We work hard to fulfill the expectations of each and every user, and to do that we need to better understand the pages, the queries, and our users. Over the last decade we have pushed the technologies for understanding these three components (of the search process) to completely new dimensions.

When we talk about queries at Google, we use square brackets [ ] to mark the beginning and end of queries (see "How to write queries" by Matt Cutts), a notation I will use throughout this post. (Pages and search results change frequently, so in time, some examples used here may not behave as explained.)
  • Understanding pages: Over years we have invested heavily in our crawl and indexing system. As a result we have a very large and very fresh index. In addition to size and freshness, we have improved our index in other ways. One of the key technologies we have developed to understand pages is associating important concepts to a page even when they are not obvious on the page. We find the official homepage for Sprovieri Gallery in London for the Italian query [galleria sprovieri londra], even though the official page does not have either London or Londra on it. In the U.S., a user searching for [cool tech pc vancouver, wa] finds the homepage www.cooltechpc.com even though the page does not mention anywhere that they are in Vancouver, WA. Other technologies we have developed include distinctions between important and less important words in the page and the freshness of the information on the page.
  • Understanding queries: It is critical that we understand what our users are looking for (beyond just the few words in their query). We have made several notable advances in this area including a best-in-class spelling suggestion system, an advanced synonyms system, and a very strong concept analysis system.
Most users have used our spelling suggestion system at one time or another. It knows that someone searching for [kofee annan] is really searching for Mr. Kofi Annan, and is prompted: Did you mean: kofi annan; whereas someone searching for [kofee beans] is actually looking for coffee beans. Doing this internationally with very high accuracy is hard, and we do it well.

Synonyms are the foundation of our query understanding work. This is one of the hardest problems we are solving at Google. Though sometimes obvious to humans, it is an unsolved problem in automatic language processing. As a user, I don't want to think too much about what words I should use in my queries. Often I don't even know what the right words are. This is where our synonyms system comes into action. Our synonyms system can do sophisticated query modifications, e.g., it knows that the word 'Dr' in the query [Dr Zhivago] stands for Doctor whereas in [Rodeo Dr] it means Drive. A user looking for [back bumper repair] gets results about rear bumper repair. For [Ramstein ab], we automatically look for Ramstein Air Base; for the query query [b&b ab] we search for Bed and Breakfasts in Alberta, Canada. We have developed this level of query understanding for almost one hundred different languages, which is what I am truly proud of.

Another technology we use in our ranking system is concept identification. Identifying critical concepts in the query allows us to return much more relevant results. For example, our algorithms understand that in the query [new york times square church] the user is looking for the well-known church in Times Square and not for articles from the New York Times. We don't just stop at identifying concepts; we further enhance the query with the right concepts when, for instance, someone looking for [PC and its impact on people] is in fact looking for impact of computers on society, or someone who searches for [rainforest instructional activities for vocabulary] is really looking for rain forest lesson plans. Our query analysis algorithms have many such state-of-the-art techniques built into them, and once again, we do this internationally in almost every language we serve.
  • Understanding users: Our work on interpreting user intent is aimed at returning results people really want, not just what they said in their query. This work starts with a world class localization system, and adds to it our advanced personalization technology, and several other great strides we have made in interpreting user intent, e.g. Universal Search.
Our clear focus on "best locally relevant results served globally" is reflected in our work on localization. The same query typed in multiple countries may deserve completely different results. A user looking for [bank] in the US should get American banks, whereas a user in the UK is either looking for the Bank Fashion line or for British financial institutions. The results for this query should return local financial institutions in other English speaking countries like Australia, Canada, New Zealand, South Africa. The fun really starts when this query is typed in non-English-speaking countries like Egypt, Israel, Japan, Russia, Saudi Arabia, Switzerland. Likewise the query [football] refers to entirely different sports in Australia, the UK, and the US. These examples mostly show how we get the localized version of the same concept correctly (financial institution, sport, etc.). However, the same query can mean entirely different things in different countries. For example, [Côte d'Or] is a geographic region in France - but it is a large chocolate manufacturer in neighboring French-speaking Belgium; and yes, we get that right too :-).

Personalization is another strong feature in our search system which tailors search results to individual users. Users who are logged-in while searching and have signed up for Web History get results that are more relevant for them than the general Google results. For example, someone who does a lot football-related searches might get more football related results for [giants], while other users might get results related to the baseball team. Similarly, if you tend to prefer results from a particular shopping site, you will be more likely to get results from that site when you search for products. Our evaluation shows that users who get personalized results find them to be more relevant than non-personalized results.

Another case of user intent can be observed for the query [chevrolet magnum]. Magnum is actually made by Dodge and not Chevrolet. So we present the results for Dodge Magnum with the prompt See results for: dodge magnum in our result set.

Our work on Universal Search is another example of how we interpret user intent to give them what they (sometimes) really want. Someone searching for [bangalore] not only gets the important web pages, they also get a map, a video showing street life, traffic, etc. in Bangalore -- watching this video I almost feel I am there :-) -- and at the time of writing there is relevant news and relevant blogs about Bangalore.
Finally let me briefly mention the latest advance we have made in search: Cross Language Information Retrieval (CLIR). CLIR allows users to first discover information that is not in their language, and then using Google's translation technology, we make this information accessible. I call this advance: give me what I want in any language. A user looking for Tony Blair's biography in Russia who types the query in Russian [Тони Блэр биография] is prompted at the bottom of our results to search the English web with:
Similarly a user searching for Disney movie songs in Egypt with the query [أغاني أفلام ديزني] is prompted to search the English web. We are very excited about CLIR as it truly brings us closer to our mission to organize the world's information and make it universally accessible and useful.

I could go on and on showing examples of state-of-the-art technology that we have developed to make our ranking system as good as it is, but the fact is that search is nowhere close to being a solved problem. Many queries still don't get satisfactory results from Google, and each such query is an opportunity to improve our ranking system. I am confident that with numerous techniques under development in our group, we will make large improvements to our ranking algorithms in the near future.
I hope my two posts about Google ranking have made it clear that we live and breathe search, and we are more passionate than ever about it. Our fervor for serving all our users worldwide is unprecedented. We pride ourselves in running a very good ranking system, and are working incredibly hard every day to make it even better.

Sunday, June 8, 2008

The SIGMOD Jim Gray Doctoral Dissertation Award

I was reading my Google Reader RSS today and found this:

"SIGMOD has established the annual SIGMOD Jim Gray Doctoral Dissertation Award to recognize excellent research by doctoral candidates in the database field. Until 2008, this award was known as the "SIGMOD Doctoral Dissertation Award." In 2008, SIGMOD, with the unanimous approval of ACM Council, decided to rename the award to honor Dr. Jim Gray. SIGMOD Jim Gray Doctoral Dissertation Award winners and runners-up will be recognized at the SIGMOD conference, and their dissertations will be included in the SIGMOD DiSC and on the SIGMOD Online web site. The award winner will also receive a plaque and present his or her work together with the winners of the SIGMOD Innovations and Test of Time awards."

This reminded me of this respectable person, Jim Gray, who is from Microsoft Research but has gone missing at the sea since early last year. People have still been looking for him, but there is no good news yet. That was really sad. What made me sadder was that not until today did I realize that he also helped in the development of Virtual Earth, which is an advanced online geomapping service to help us locate ourselves, and I am using it right now! Suddenly I feel like he was not a Turing winner far away in CA, but a person who was so close to me! Can't believe something so great in my life, but one of its inventors can no longer enjoy it with us.

I do not know if I have a chance to win this award since it is not exactly my field. But I am very encouaged that it is renamed under him. Because of his efforts, we will never get missing in future.

Saturday, May 31, 2008

A taste of teaching

Last week, it was our last class with Prof. Foster Provot for this semester. This is a PhD level seminar discussing all kinds of topics related to data mining and machine learning. As the only three registered PhD students in Stern, Xiaohan, Mihaela and I were "pushed" to give (bi-)weekly presentations and lead discussions for every paper on those topics.

Oh, god! That was hard! I couldn't understand this. When I sit down in the class and listen to the professors, they are all talking and smiling, making all kinds of jokes, writing gracefully and drawing nice pictures on the board. They are teaching as if doing something really really really easy. However, when I stood in front of the class, no matter how hard I had prepared, I felt nervous, awkward, and then suddenly forgot what I should say. My tones got wondered and my voice became frozen. My confidence was quickly fading out... In fact, I was pretty confident in my presentation skills because I already had some conference/workshop presentation experiences before. I always felt proud of my cool behavior in front of a group of people. But now the truth was that it did not work here! Teaching in class is totally different from giving a short 20-minute talk, at all! For this, I really admire Foster! He is such a sharp person and a great professor. He can always notice the key point in our thoughts and help us sort it out right away. Often times, his questions are actually helpful and informative "hints", which inspire us to think what we have neglected and then better organize our thoughts.

Prof. Anindya Ghose once told me that when you talk to people, you should try to make your point as clear as you can at the first time. Do not wait for people to find themselves confused and then ask you. I believe this is important, but it is not easy to achieve. Sometimes, when we explain something, we have a tendency to either describe it too much that makes the redundancy, or speak too little that leads to the ambiguity. (It seems that the distribution for the intensity of our explanatory words is "bimodal", either too high or too low.) I like Prof. Panos Ipeirotis's teaching, because his way is highly logic. You feel like you are led into a room, and then get to explore by yourself with encouragements time by time. He does not show the whole picture at one time, but leave to us ourselves to find it out. That is coolest part. You never know how big the picture is! Just like an adventure game!

I sometimes was imaging myself in the future, can I do this well when I become a real professor? Will my students enjoy my teaching too? Yea, I believe so! That is my goal and just keep going:-)

Tuesday, May 20, 2008

Data Mining Blogs: The Big List(ZZ from Sandro Saitta)

Sandro Saitta has a full list about the data mining blogs. Just something very nice that can be introduced here:)

  • Abbott Analytics: both industry and research oriented posts covering any topic related to data mining (Will Dwinnell and Dean Abbott)

  • Crime Analysis and Data Mining: everything is in the title (Shyam Varan Nath)

  • Data Miners Blog: data analysis and visualization from an industry point of view (Data Miners Team)

  • Data Mining, Analytics and Artificial Intelligence: this blog gives news about data mining and AI very frequently (Alberto Roldan)

  • Data Mining et al.: A new blog about data mining with details on particular applications in this field (Georg Russ)

  • Data Mining Lab: the blog of the data mining laboratory at Brigham Young University, mainly about social communities and meta-learning (Data Mining Lab)

  • Data Mining: Text Mining, Visualization and Social Media: a focus on data visualization and the blogosphere (Matthew Hurst)

  • Data Mining in MATLAB: posts related to the use and possibilities of Matlab for data mining related problems (Will Dwinnell)

  • DataSciences Analytics: discuss statistics and predictions among other interesting topics (John Aitchison)

  • Data Strategy: This new blog (started in June) discuss data strategy in general. Data acquisition, visualization and data mining are examples of topics (Chuck Lam)

  • Data Wrangling: comprehensive posts on technology and news related to data mining and machine learning. Also a lot of very useful resources (Pete Skomoroch)

  • Diamond Information and Analytics: analytics and its applications in marketing and operations (Amaresh Tripathy)

  • Foraging in the Data Forest: although not updated recently, this blog has interesting posts about data visualization and statistics (Donald Farmer)

  • Intelligent Machines: news related to data mining, machine learning and artificial intelligence (Damien François)

  • Jamie's Junk: a blog that focus on data mining using Microsoft SQL Server (Jamie Mac)

  • Juice Analytics: data analytics with an emphasis on data visualization and corresponding tools (Juice Team)

  • Machine Learning, etc: Theory behind machine learning and news related to this field (Yaroslav Bulatov)

  • Machine Learning (Theory): a strong emphasis on theoretical aspects of machine learning (John Langford)

  • Machine Learning Thoughts: philosophical and theoretical discussions about machine learning in general (Olivier Bousquet)

  • Math Stats and Data Mining: data mining with a point of view from statistics (Rachel Graham)

  • MineThatData: data mining from the marketing point of view (Kevin Hillstrom)

  • Oracle Data Mining and Analytics: A blog focusing on the use of Oracle for data mining. It covers news, code and applications related to Oracle (Marcos M. Campos)

  • Shane's Blog: a personal view on data mining with posts on different applications and news (Shane Butler)

  • Smart (Enough) Systems: data mining and analytics (among others) for decision management (James Taylor)

  • Undirected Grad: a machine learning blog from a PhD student at Cambridge (Jurgen Van Gael)

  • Yet Another Machine Learning Blog: more machine learning oriented but contains a lot of useful information (Pierre Dangauthier)

Sunday, May 18, 2008