"; ?> Technology Bits - Amandeep Singh






Technology Bits

Thursday, February 09, 2006

Google Suggest

What is that??? chk out here

As you type into the search box, Google Suggest guesses what you're typing and offers suggestions in real time. This is similar to Google's "Did you mean?" feature that offers alternative spellings for your query after you search, except that it works in real time. For example, if you type "bass," Google Suggest might offer a list of refinements that include "bass fishing" or "bass guitar." Similarly, if you type in only part of a word, like "progr," Google Suggest might offer you refinements like "programming," "programming languages," "progesterone," or "progressive." You can choose one by scrolling up or down the list with the arrow keys or mouse.

This uses a wide range of information to predict the queries users are most likely to want to see. For example, Google Suggest uses data about the overall popularity of various searches to help rank the refinements it offers. An example of this type of popularity information can be found in the Google Zeitgeist.

For people who want to see how this works check out the google javascript code
This code at very first instance looks cryptic since the function and variable name are coded. Some real good guys have decoded this code to a presentable form check out this . The code looks simplified with proper comments inplace.

The magic lies here..
function callGoogle(Rb){
if(_xmlHttp&&_xmlHttp.readyState!=0){
_xmlHttp.abort()
}
_xmlHttp=getXMLHTTP();
if(_xmlHttp){
// We end up calling:
// /complete/search?hl=en&js=true&qu= ...
_xmlHttp.open("GET",_completeSearchEnString+"&js=true&qu="+Rb,true);
// Note that this function will ONLY be called when we get a complete
// response back from google!!
_xmlHttp.onreadystatechange=function() {
if(_xmlHttp.readyState==4&&_xmlHttp.responseText) {
var frameElement=B;
if(_xmlHttp.responseText.charAt(0)=="<"){
_timeoutAdjustment--
}else{
// The response text gets executed as javascript...
eval(_xmlHttp.responseText)
}
}
} ;
// DON'T TRY TO TALK WHEN WE'RE LOCAL...
// Comment out when running from a local file...
_xmlHttp.send(null)
}
}


On key hit event a XMLHTTP object is sent to google, queried and results
back to page where it orginate
.


General methodology to communicate with server without realoading the page( as on orkut's rating page where you click a particular image and a request to server is posted without effecting the original page ) is embedding a hidden iframe element. When a query needs to be posted the page in the iframe element is called that does processing on server and the results are returned to the parent page( that include the iframe) by calling its javascript function that handles request and reflect it on the parent page. Did you realise this is what is known as remote procedure calls ?? :D Without any addons you deviced a method of remote procedures. It looks good and simple. May be i'll put some working demo here in future.

Thursday, February 02, 2006

The Google Thing...

Google, being one of the most popular Internet search engines, requires large computational resources in order to provide their service. This article describes Google's technological infrastructure, as presented in the company's public announcements.

Network topology
Google has several clusters in various locations across the world. When an attempt to connect to Google is made, Google's DNS servers perform load balancing to allow the user to access Google's content most rapidly. This is done by sending the user the IP address of a cluster that is not under heavy load, and is geographically proximate to them. Each cluster has a few thousand servers, and upon connection to a cluster further load balancing is performed by hardware in the cluster, in order to send the queries to the least loaded Google Web Server.

Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), new servers are 2U Rackmount systems. Each rack has a hub. Servers are connected via a 100 Mbit/s Ethernet link to the local hub. Hubs are connected to core gigabit hub using one or two gigabits uplinks.

Main Index
Since queries are composed of words, an inverted index of documents is required. Such an index allows obtaining a list of documents by a query word. The index itself is quite large due to the number of documents stored in the servers, therefore it needs to be split up into "index shards". Each shard is hosted by a set of index servers. The load balancer decides which index server to query based on the availability of each server.

Server types
Google's server infrastructure is divided in several types each assigned to a different purpose:

* Google Web Servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.

* Data-gathering servers are permanently dedicated to spidering the Web. They update the index and document databases and apply Google's algorithms to assign ranks to pages.

* Index servers each contain a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.

* Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.

* Ad servers manage advertisements offered by services like AdWords and AdSense.

* Spelling servers make suggestions about the spelling of queries.


Server hardware and software

Original Hardware
The original hardware used by Google included:

* Sun Ultra II with dual 200MHz processors, and 256MB of RAM. This was the main machine for the original Backrub system.
* 2 x 300 MHz Dual Pentium II Servers donated by Intel, they included 512MB of RAM and 9 x 9GB hard drives between the two. It was on these that the main search ran.
* F50 IBM RS6000 donated by IBM, included 4 processors, 512MB of memory and 8 x 9GB hard drives.
* Two additional boxes included 3 x 9GB hard drives and 6 x 4GB hard drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II.
* IBM disk expansion box with another 8 x 9GB hard drives donated by IBM.
* Homemade disk box which contained 10 x 9GB SCSI hard drives

Current Hardware

Servers are commodity-class x86 PCs running customized versions of Linux. Indeed, the goal is to purchase CPU generations that offer the best performance per unit of power, not absolute performance. The biggest cost that Google faces is power consumption given the huge amount of computing power required. For this reason, the Pentium II has been the most favoured processor, but this could change in the future as processor manufacturers are increasingly limited by the power output of their devices.

Published specifications:

* 100,000 servers ranging from 533 MHz Intel Celeron to dual 1.4 GHz Intel Pentium III (as of 2005)
* One or more 80GB hard disk per server. (2003)
* 2–4 GiB memory per machine (2004)

The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. According to John Hennessy and David Patterson's Computer Architecture: A Quantitative Approach, Google's server farm computer cluster in the year 2000 consisted of approximately 6000 processors, 12000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and two in Virginia. Each site had an OC 48 (2488 Mbit/s) internet connection and an OC 12 (622 Mbit/s) connection to other Google sites. The connections are eventually routed down to 4 x 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two ethernet switches. Google has almost certainly dramatically changed and enlarged their network architecture since then.

Based on the Google IPO S-1 form released in April 2004, Tristan Louis estimated the current server farm to contain something like the following:

* 719 racks
* 63,272 machines
* 126,544 CPUs
* 253 THz of processing power
* 126,544 GB (approx. 123.58 TB) of RAM
* 5,062 TB (approx. 4.77 PB) of hard drive space

According to this estimate, the Google server farm constitutes one of the most powerful supercomputers in the world. At 126–316 teraflops, it can perform at over one third the speed of the Blue Gene supercomputer, which is (as of 2005) the top entry in the TOP500 list of most powerful unclassified computing machines available to humanity.

Server operation
Most operations are read-only. When an update is required, queries are redirected to other servers, such as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different servers in parallel, thus reducing the latency time.

In order to avoid the effects of unavoidable hardware failure, data stored in the servers may be mirrored using hardware RAID. Software is also designed to be fault tolerant. Thus when a system goes down, data is still available on other servers, which increases the throughput.

Monday, January 23, 2006

Orkut, a social networking site named after creator and Google employee Orkut Buyukkokten, is "currently garnering the most hype among the Internet cognoscenti". Orkut combines many of the features of its competitors, encouraging users to create profiles about their interests, professional life and personal life. The site's design appears inspired by Buyukkokten's previous work on Club Nexus, a system developed at Stanford University in 2001 . Created by Stanford students, Club Nexus was designed to assist students' communication requirements . Membership on Orkut is invitation-only; new members must be invited to join by an existing member. This exclusivity has caused Orkut to gain a certain social currency that comes with being a member of a private club. Some enterprising individuals have even gone as far as auctioning Orkut invitations on eBay.Google claims that Buyukkokten developed Orkut during his "personal project time" while at work with the help of "a few other engineers". All Google employees are encouraged to spend part of their time working on personal projects in order to boost creativity and innovation. This "official" story about Orkut being a side project is not believed by the authors of rafer.net, who point out that Orkut is the "most fully featured social network in existence," and "grew from almost zero page views to serving (probably) 3 [million] pages per day... that is a lot of work... it wasn't the work of one man anytime in the last several months. This challenge to the pervasive myth of Orkut being a "pet project" leads into the possible business case of why Google has sponsored such a project. Revenues and profitability may not exist yet, but a possible business model exists in selling subscriptions, classified and targeted advertising. Social networks can be called as the "next generation of online classifieds". Some believe Orkut is another asset in Google's business strategy for positioning itself as a market leader, others feel it is a way of creating a larger database of user behaviour for better data-mining capabilities .

Orkut's design model

While Orkut allows people to create profiles and links to friends, and participate in discussion communites, Orkut also has a 'feature' that has become very contentious among members: Orkut has a jail. Jail is ostensibly a 'time-out' area where someone who has been abusing the system is put for a limited period of time. While in jail, the 'offender' may not post or send messages on Orkut, but can view and read content. It essentially restricts users to viewing the site in a read-only mode. However, the rest of the parameters about jail are confusing, and are muddied by the fact that there is no official mention of Jail on the Orkut help page. There is no warning when you are jailed: you log on and instead of your picture appearing on your profile page, a shadowy image of someone in prison appears. There are no no phone calls, no lawyer, no judge, no jury... nothing. The closest action one can take to an appeal is to email "help@Orkut.com" and hope for the best. Usually most people are 'released' from jail within 24 hours, but this isn't always the case. Some report being in jail for minutes, some for days. Over time, users have managed to piece together theories about what constitutes a "jailable" offence. Orkut does have a codified "law", in that site administrators maintain Terms of Service (TOC) and Community Standards documents, which outline certain legal issues regarding using the service, such as copyright, ownership of content, acceptable use, non-commercial use, and so on. The Community Standards outlines many of these issues in plain language, and also acts as a 'bill of rights'-type document, with general standards for outlawing hate speech, harassment and discrimination. However, many instances of users being put in jail does not seem to come from contravening the TOS or Community Standards (the most common offence being the use of a psudonym or an obviously fake image, leading some people to use "real" sounding fake names and photos). Orkut also allows individuals to help "police" the site by having a "report as bogus" link on each person's profile page, where Orkut users can "tattle" on others. Clicking on "report as bogus" essentially places that person in jail until Orkut administration can review their profile. It also seems that the jailing procedure is at least partially automated to detect "robot"-like actions, such as joining too many communities at once, editing too many posts, or performing other actions in rapid succession. Part of many people's problem with the "Jail" design is that there is almost no user feedback - contacting "help@Orkut.com" almost invariably results in either no response or a canned email reply. Seemingly in response to this and other requests for information, the site administrators created "OrkutGuy", a persona who could communicate with users in communities directly. OrkutGuy is personified as police officer, which is interesting in light of Williams' warning to choose leadership metaphors carefully: Orkut apparently sees its leadership personification as a cop (and a white, male one at that).

The Orkut Loser Patrol

The second case is one that blurs lines of appropriate behaviour and good taste, and almost seems designed to challenge the notion of "community standards" in general. It involves a community called the "Orkut Loser Patrol" (OLP). At times Orkut can feel cliquish and immature as people try to collect as many friends and see how high they can get their "cool" and "sexy" rankings (an exercise in vanity that this author must also admit to indulging in). The OLP was a facetious group created to point out people who were "losers" on Orkut, while simultaneously parodying the kind of "high school popularity contest" nature of the Orkut experience. The very existence of the OLP was contentious: some thought the very idea cruel, vulgar and against the spirit of Orkut as a community; these people perhaps did not share the darker or more ironic sense of humour displayed by many of the members. After several weeks and a large number of users joining OLP (and during this time the original sarcastic nature of the overriding joke had been significantly diluted by the influx of new members, some of whom took the joke literally), the moderator played a cruel joke on the community members. He changed the community name to "Orkut Pedophile Society", along with switching the former community image of a deranged clown to an innocent-looking picture of a young boy in red sweater and the caption "let us touch your little bits just a little bit... LEGALIZE PEDOPHILIA, NOW". The sick genius of this move was that the hundreds of members of the OLP suddenly found themselves members of a group that no-one in their right mind would want to be associated with; this information was also available to anyone who would look at their profile and see what communities they belonged to. Not surprisingly, there was a massive outcry from members, many of whom quickly deleted the group from their communities list. Some thought the joke funny and absurd, while others found the whole situation utterly disgusting and beyond humour, saying pedophilia was a topic that could not be comical in any context. One result was that the moderator who had made the switch was most likely the subject of a barrage of "report as bogus" claims from angry users, and thus was banished to jail for almost three days. Since he was in jail, he could not edit the community name. The joke quickly became painful, and eventually "OrkutGuy" stepped in, changing the community name back to its original (although inexplicably, misspelled) and explaining that the moderator powers to change community names and images were temporarily suspended. The result of this whole incident was a serious blow to the trust of many in Orkut. While the site was out to create a "trusted community", it was apparent that in some cases, some people could not be trusted to uphold the community standards and to take the whole experiment seriously. Others thought the incident indicated that the emperor had no clothes: while using a heavy-handed control system in one place, Orkut's design allowed for abuse on many other levels if one only used a little imagination. Eventually the moderator ability to modify community names and content was reinstated.