TechLife :: Hey! just made it OpenSource: 2007

Thursday, December 20, 2007

Naukri : Job search 3.0

It has been a keenly awaited enhancement and needless to say, we all were excited about it. It is the new algorithmic tweak in the jobsearch engine. We have enhanced the algorithm to throw more relevant results. Also, a keyword "suggester "is now available for the jobseekers. It is truly a "suggester" as it has its own intelligence and therefore is not a mere "autocomplete" feature. There is a well thought algo behind this small yet powerful feature. We have taken the first step towards a long but interesting journey. The new engine will empower us to experiment with new tweaks and enhancements. Watch out for this search engine. The work has just begun.

Tuesday, October 09, 2007

Outsourcing/In-sourcing

I am not an avid reader, but when my friend anshum praised and suggested a book to me, i thought i'll give it a try. I am still not finished with "The world is Flat" by Thomas L. Friedman, but could agree and relate to many of the author's observations. The important observation was conceding part of your work to others. He talked about big business houses outsourcing and in-sourcing their work and benefiting from it. This is not just limited to big business units, but is relevant everywhere as well. The old school of thought was to accumulate as much work as possible, or sometimes beyond it. The focus was to stay in control and have more direct authority to drive it. There was a sense of insecurity. As a result of this approach, the focus from the core activity spreads across other activities, which sometimes may not be your strength and beyond your competency.
Today corporates are ready to part with some part of their business activities and concentrate on their core activity. Idea is to let the best people do the work. The resistance to part with is no longer there. The main requirement however here is "trust". If you don't trust your vendor, or if the vendor is not able to come good on your trust, the decision may backfire. Which is why a thorough analysis is required before such decisions are taken.
This strategy is well implemented within a team also. I have seen some good leaders delegating work with ease and achieving their objectives efficiently. Some new leads can be seen clutching on to important tasks, even if they don't have time and resources for it. This is not good for anybody and for the bigger objective.
The sooner we understand this, better it is for us. I could co-relate this with another book i was reading called "now, discover your strengths". This book emphasizes on discovering your strengths and working on it to get the best result, rather than concentrating and working on your weakness. Though this was more about soft skills, i thought it is in sync with what is discussed by Friedman in his book.

Thursday, September 27, 2007

Sujith Nair - PMP

Hmm.. sounds good. Well, not a big deal for many, but it is a refreshing feeling for me to be a certified Project Management Professional. Frankly, I didn't study seriously in the beginning, but towards the end pushed myself. The main concern however was that my 25K of examination fee will go in the drain if i don't do well :-) Got a lot of support from shikha (my wife). I must admit that for a lazy guy like me and from a not so big company like naukri, it wasn't easy. Firstly the industry I am in do not follow processes by book. I don't have the kind of process exposure that a PM from a wipro or TCS has. However, my planning and execution is somewhat strong (this is what i feel) and it did help me relate to the questions asked in the exam. Something new in my academic life has happened after 7 yrs.. whoops.... (7 yrs makes me feel older). Well, but that did help and now i have come back into my "study" mode. Feels good to be a student again... atleast I feel a little younger ;-)

Friday, September 21, 2007

Homepage - new look !

Have made some changes on my homepage. Go take a look .... Couple of changes that i have made are,

Changed the base colour to black. Well, black is my latest choice. It looks cool and rocking than a simple white.
Have added the flicker "badge" at the bottom of the page. Watch out for my latest pics.

I am planning to add more content and sections to my site. Have to get the space increased before that. I am planning to install wordpress and then take it from there. Don't want to waste time coding different sections, which wordpress provides ready-made. I'll rather use that time to add content and put some cool features.

Friday, August 10, 2007

Brijj.com - Professional Networking

Naukri launched brijj.com, a networking site for professionals. With this, the company has entered the big league of web 2.0 companies, focusing on user contributions. The focus will be more on providing a platform for passive users to reach-out to other professionally linked people. I like the UI as i find it refreshing. The next step, i guess would be to increase stickiness on the site, thereby increasing the number of members. By concentrating on professional networking, naukri will have an edge over its competitors, as it has carefully not drifted away from its domain. As for the advertisers, they won't get a better place than naukri & brijj combined to reach out to the largest database of professionals in india. Way to go ....

Wednesday, July 25, 2007

Simplicity and usefulness is the essence of a good product

Few days back, a friend suggested an offline wiki to take notes. I decided to check it out and found it really cool. The best thing about it is its simplicity; yet it is very useful. That according to me is the essence of any good product.
Tiddlywiki is a very powerful concept. You can work offline in an html file and save it. It uses html, css and javascript and runs on any browser. Some important features that is provided by it are,

Formatting and indenting
Various interface options
Backup and autosave
Embedded images
RSS feeds
Keyboard shortcuts.
Search Facility

To use it, all you have to do is save the source of Tiddlywiki and start using it. It saves the file when you save the note. All it does is allows you to re-write the file. Your notes are saved under headings provided by you. These can be browsed from the list provided at right side of the page. The best part is this small file allows you to search through the content.
If you have little html or javascript knowledge, you can play around with the code and make the most of this concept. I must admit ... some really good thinking here by Jeremy Ruston, who originally created this wiki.
The success of an idea, or for that matter, a product, depends on its usefulness and the ease/simplicity with which it can be put into shape. That's what matters.. other factors can follow.

Thursday, July 19, 2007

My homepage

Had put up an index page on my website few days back. It is raw though... will work on it whenever i get time. Visit me @ sujithnair.com . Got to look for a decent hosting service with lamp support.

Saturday, June 23, 2007

Web 2.0 in a restaurant ???

Ever wondered what will happen if some of the web 2.0 (or so called) concepts are used in a restaurant or in a dress store ? Well, i was just thinking about it when i came across these funny thoughts, yet interesting ones.
Lets take the very popular amazon's baby "wisdom of crowd" concept if used by KFC. I order for a small bucket of crispy chicken and they tell me that most of the consumers who prefer this have it with salsa sauce. We'll, i woudn't mind spending another extra 7 bucks for it. Or better still, they offer me a discount on chicken original receipe as the same group of people who prefer crispy chicken have a taste for the original reciepe as well. Now what will KFC gain out of it? They will increase sales on other items and at the same time build an impressive alternative list of items. What will i gain ? I'll shed the resistance to try new items and satisfy my taste buds.
Scene 2. I walk into levi's and try out some cargos. They offer me a discount on cool back-packs. Here, the probability of my buying this piece will be more as they are promoting an item after knowing my preference (cargos). But why back-packs ?? Because most of the customers who bought cargos show interest in back-packs as well. This has to be in their records.
Another popular baby of web 2.0 is "tags'. It is the result of user classification. If dress stores define sections based on user preferences, rather than product types, chances are high that the customer will end up buying more of different products. If a section has collection of all sporty products right from Ts to shoes to sweat bands, a customer who prefers sports apparels may end up picking each one of these. She may loose interest if she has to go to a separate shoes or sports accessories section. The point is to arrange or classify sections basis user interests and not product types.
Pardon me for my understanding and interpretations (if wrong) as I am clueless about the functioning of these industries. But having said that, i sometimes feel that it must be the other way round... that is, may be the online industry is taking a clue from other industry. You would agree that if you closely study the marketing and product strategies of these industries. Come to think of it, there would be endless opportunities if experiments are done basis learnings from other industries. The fact can be that these concepts were already prevalent in other industries but didn't have such flashy names or jargons.

Saturday, May 26, 2007

JobSearch 2.0 Beta

It is over a week since we launched the job search beta on our site. Some interesting feedbacks have come our way. Job search 2.0 beta project was undertaken as a stealth project, and am happy that most of the projects undertaken as stealth are able to see the light of day. With support from all depts, especially product, we are able to showcase some good technical stuff. Hope to keep it rolling. Some more interesting stuff in pipeline. Watch out for this space :-)

Vector Space Model

Saturdays are dull in office. Not interested in doing the regular work, i decided to do little bit research. Searching for a specific algorithm, i stumbled upon the good old "vector based model" for search. Thought about brushing up my knowledge on the same and write a comprehensible writeup on it.
In this model the frequency of words is the catch. In the Salton's vector based model both local and global information is considered.

Term Weight = Wi = tFi * log (D/dFi)

Here,
* tFi = term frequency (term counts) or number of times a term i occurs in a document. This is the local (within a document) information.
* dFi = document frequency or number of documents containing term i
* D = number of documents in a database.
* i = the term
dFi/D = is the probability of finding term i in total documents D. Therefore, D/dFi is also called as the "inverse document frequency" or how many times the term i appears in the entire database.

Ok, now lets go through a simple example. Have referred the same from an example found in an online paper.

Lets say, there are 3 documents (D=3) with the given data,

D1 = "the Lotus is in the pond"
D2 = "Garden has a pond"
D3 = "Lotus is a flower in the center"
The query Q = "Lotus Garden Flower"

To start, we must do the following,

Take out all unique words from the three documents and sort them.
Determine how many times each word appears in the respective documents. For eg, here, "Garden" appears 1 time in D2, "Lotus" appears 1 time each in D1 and D3, "the" appears 2 times in D1 and 1 time in D3.
Therefore,
The term frequency tFi for "Lotus" in D1 is 1, D2 is 0 and D3 is 1.
The document frequency dFi for "Lotus" is 2 (found in D1 and D3).
Document frequency (D/dFi) for "Lotus" is 3/2=1.5
Inverse document frequency (IDF=log(D/dFi)) for "Lotus" = 0.18
IDF is a global value for each term.
Now that we have the parameters for the formula, we can calculate the weight for term 't' in each documents 'd'.
Wt,d = tFi * IDF
So (Wt,d) for "Lotus" for D1 = 1 * 0.1761 = 0.1761
(Wt,d) for "Lotus" for D2 = 0 * 0.1761 = 0
Similarly, (Wt,d) for all documents and the query is found.
(Wt,d) for "Lotus" in query Q = tFi * IDF = 1 * 0.1761 = 0.1761
Pls note, if the term frequency (tFi) is 0, then the (Wt,d) is also 0.
Thus we have weights (Wt,d) for Q, D1, D2 and D3.
Now, we find out the vector lengths.
Now, we calculate all dotproducts except all zero products
Now, we calculate the similarity values
In the end, we will sort on the cosine values in descending order and the document in the top will be the highest ranked. In this case, the rank will be

Rank1 : D2 = 0.39
Rank2 : D3 = 0.32
Rank3 : D1 = 0.15

The above explained example is a simple example just to demonstrate the vector space model. In a real search engine, there are other factors which needs to be considered. The important ones are -

Stop words - While extracting all unique terms from the documents, the prepositions are removed. Here for eg - a, has, in, is, the, where. Also special characters like ',' '.' '-' etc can be handled.
Handle different forms of the word. As here a pure keyword search is done, trimming the term down to its root word and then matching it to the query will be a good idea. A good stemming algorithm will help. Stemming will help in matching all words like develop, developer, developers etc.
It is also important to know where within a document a term is found.

The vector space concept is widely used in search engines. MySQL is one good example which is inspired of this model and has implemented it in its fulltext engine.

In the MySQL implementation an application uses database tables consisting of N rows. Each row corresponds to a document, where

L_{i, j} = (log(dtf)+1)/sumdtf; i.e., local information based on logarithmic term counts
G_i = log((N-nf)/nf); i.e., global information based on probabilistic IDF
N_j = U/(1+0.0115*U); i.e., normalization based on pivoting

Here,

dtf = number of times the term appears in the document (row)
sumdtf = sum of (log(dtf)+1)'s for all terms in the same document (row)
U = number of unique terms in the document (row)
N = total number of document rows
nf = number of documents (rows) containing the term

The following equation is thus derived,
w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)

Well, now that we have understanding of this model, we can always tweak it as per our use and derive a new algorithm. Gonna be fun :-)

Friday, March 23, 2007

Concept ... i m working on.

"Every man belongs to some community ! Similary, Every word should belong to one or more group or category"

"More the strength of communities, more stronger will be the chance of a man fitting into a community. Same is true for a word also....In other words, more the data, more chance for a word to fit into a pattern. "

Thursday, March 01, 2007

Stealth Projects

Its a fresh day, there is a bunch of Great Ideas, you have high energy level, it is time to do some action... before it even starts, your ideas are exposed to a number of double barrel guns shooting continuously with reasons why we should not work on the "Great Ideas". Now what ???
One option can be to lay back and work only on pre-approved ideas or "requirements" in other words. But hey, we are innovators and we believe in our ideas. So we decide to take the other path and start work on them. Well... in STEALTH mode though for obvious reasons :-)
This is a new approach i had started around 4-5 months back which is yielding rich dividends. Had started this in the NI team and slowly implemented in the search team as well. I can see my boss endorsing this approach and quite happy with the developments.
Though it has been a success so far, i am expecting a liitle more. I wish one day we'll have more self motivated engineers who will start their own research project with a bunch of other developers, and deliver stuff which our competitors can't even think about. That day, there won't be anything like 'stealth'...
Right now, I am soon going to start a new big stealth project which will be a great research work. Needless to say, i am quite excited about it. Will write about it once it is over. Well, that's why it is in stealth mode ;-)

Friday, February 09, 2007

Search by "Intent"... not "Content"

I had been busy with some developments in my search product, but nothing interesting enough. My team has just started putting into shape one of our long pending objectives. Hope to release it by end of feb.
Now, after seeing some action happening towards this direction, am off to visualize and analyze my next aim. That is "Search by intent" . I have always felt that internet search which looks good today is not even 10% of what it can be. There is a lot of intelligence that can be introduced. I am referring to behavioural intelligence here. Now what's that ? Hmm.. it is my term ;-) Well, it is the intelligence derived from user behaviour. Nice and patient analysis of your search dump will help you develop this intelligence better.
Coming back to search... if we can know the intention of the user, we can provide much better results. If we crack this problem and are able to covert CONTENT into INTENT, we would have solved half the problem. Many of the big search engines have started researching in this line.
Yahoo has already launched the beta version of their search engine "Yahoo ! Mindset". It provides a cool slider to dynamically rank results. It shows results basis its informational or commercial aspect.
I happened to stumble upon a blog which writes about Matt cutts (google) hinting that google is doing scientific research in this field. I won't take it as a joke. We can soon see some intent oriented search happening in google.
Users intention can be known and taken care off by,

Search Dump - A proper analysis of search dump will throw light on the "intent" behind every search. This study has to be w.r.t the results clicked upon by the user from the search result set.
Personalization - Proper tracking of user activities can open a whole new world of knowledge for search engines. If I visit a search engine every week and click on a set of links, the search engine should damn well know what i am looking for every time. And if you have the user profile with you, then what else can you ask for.
NLPs (Natural Language Processing) - Natural language search is what every user is comfortable doing. The engine should be able to derive the intent of the user from the text provided in the search box.
Phrase Searches - The emphasis should be more on phrase searches. The more the user writes in the search box, more clear is his/her intent. Ofcourse, the search engine should be geared up to handle the shrink in result set due to large phrase search. Here, the result boosting algorithms come into play.
Wisdom of crowd - Now, this one is a much talked about concept out of the web 2.0 books. Entire trail of all user activities should be logged. We can know a completely new user's intent by comparing his/her search with searches made by other users, and establishing a pattern. If most of the users search for "apache" and click on pages dedicated for apache web server, we can assume that the new user is also interested in apache web server and not in apache tribes. Here, the results pointing to the former can be rated higher in the search. Ofcourse you have to include all results though.

I had read somewhere about a quote from Peter Norvig, director of research for Google talking about how Google returns results to search query.
"We want to do a better job of understanding the user's intent and the content provider's intentions,"
and
"We mostly rely on matching keywords, but we'd like to get closer to matching the intent."
The same sentiments are echoed by Adam Sohn from microsoft,
"If someone is searching for 'Jaguar, the smarts to distinguish between 'he's looking for a car' and 'a big cat in the jungle' - that's coming."
Here, the "intent" of the big players in the search business is clear.

A major chunk of work in a search engine development goes into large set of data collection and analysis. Better log analyzers and information system should be in place. A thorough research on this information set is carried and the outputs serve as inputs for algorithms. A continuous analysis of user behaviour is important to monitor pattern deviations. Studies have revealed that phrase searches have become more common as compared to searches done 6-8 years ago. No doubt therefore that n-grams are hot again.

I have got to do a lot now. I am yet to have a good information system. There is lot of tracking which still needs to be done. More i think of it, more i realize that the entire data management, right from tracking is a different area. I, as a techie won't be able to do much justice. I'll be more interested in the output of these studies to convert them into complex algorithms. Hmm... till we have a separate team for these studies, tech has to manage it. Can't complain though... it is giving me much needed exposure.

TechLife :: Hey! just made it OpenSource