Let’s Google How Google Works!

Ruhaan Pratap Singh
Feb 21, 2022
13 min read

Updated: Dec 28, 2022

Table of Content

The Beginning
The Smart Idea
String Search Engines
Few Initial Search Engines
Search Engine Categories
The Era of Google
Working of the Search Engines (Google)
Usability of Webpages
Context and Settings
Important Algorithm Updates by Google
Future of Search Engines
Conclusion

The Beginning

The idea of hypertext and a memory extension originates from the work printed in the Atlantic Monthly in July 1945, written by Vannevar Bush, titled ‘As We May Think’. In this text Vannevar urged scientists to come together and assist in building a body of information for the entire mankind. He then presented the concept of a nearly limitless, fast, reliable, extensible, associative memory storage and retrieval system. He named this device a memex. All of the documents that were used in the memex would be in the form of microfilm copy acquired per se or, in the case of private data, converted to microfilm by the machine itself. Memex would also use new retrieval techniques based upon an entirely new quite associative classification or indexing, the fundamental idea of which would be a provision whereby any item can be caused at one's own will to select instantly and automatically to form personal "trails" through connected documents. The new procedures that Bush anticipated facilitating information storage and retrieval would result in the making of entirely new sorts of encyclopedia. In 1965 Bush took part in the project INTREX of MIT to develop technology to mechanize data processing for library use. In his 1967 essay titled "Memex Revisited", he observed that the evolution of the digital computer, the transistor, the video, and many similar devices had further increased the practicability of such mechanization, however prices would delay its achievements.

The ‘Smart’ Idea

Gerard Salton, who died on August 28 in 1995, was the father of contemporary search technology. His crew at Harvard and Cornell developed the smart informational retrieval system. Salton' Magic Automatic Retriever of Text comprises of vital concepts such as vector space model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values, and appropriate feedback mechanisms. He also authored a 56-page book known as A Theory of indexing, explaining several of his tests upon which search remains extensively based.

String Search Engines

In 1987 an editorial was published describing the creation of a character string search engine (SSE) for speedy text retrieval. The SSE tried to make a completely unique string-search algorithm that integrates a 512-stage finite-state automaton (FSA) logic with a content addressable memory (CAM) to attain an approximate string comparison of eighty million strings per second. The CAM cell is comprised of four standard static RAM cells and a read/write circuit. Synchronous comparison of about sixty-four stored strings with variable length was achieved in fifty ns with an input text stream of ten million characters/s and still allowing performance despite the fact that single character errors were present within the type of character codes. Furthermore, the chip also allowed non-anchor string searching and variable-length `don't care' (VLDC) string searching options.

Few Initial Search Engines

Archie- Archie was the very first web search engine, developed/ in the year 1990 by Alan Emtage, a student at McGill University in Montreal. The creator named it "archives" but later compressed it to fit in the Unix world standard of nomenclature according to which programs and files were given short, cryptic names like grep, cat, sed etc.
Veronica - Veronica was created at the University of Nevada System Computing Services group in 1993 as a searching device similar to Archie but for Gopher files. Not long after, another gopher-searching service called Jughead appeared, perhaps for the sole use of comic triumvirate. Jughead is an acronym for Jonzy's Universal Gopher Hierarchy Excavation and Display, but like Veronica, it can be inferred that the author was behind the acronym.
The Lone Wanderer - Created by Matthew Gray in 1993, the Worldwide Web Wanderer was the first robot on the web and was designed to track the growth of the web. Initially, Wanderer counted only web servers, but soon after deployment, it started collecting URLs. The database of recorded URLs is now the first web database, Wandex.
Excite - In February 1993, six Stanford undergraduates founded Excite, which was previously known as Architext. Their concept was to apply statistical analysis of word relationships to deliver more efficient searches through the vast amount of data available on the Internet. By the middle of 1993, their initiative was already largely financed. Once funding was obtained, they published a version of their search software for webmasters to use on their own sites. The software was formerly known as Architext, but it is currently known as Excite for Web Servers.
Yahoo! - David Filo and Jerry Yang, two Stanford University Ph.D. candidates, published some rather popular pages in April 1994. Yahoo! was the name given to the collection of pages. Their official justification for the name was that they thought of themselves as a couple of yahoos. The team devised methods to arrange better the data as the number of links expanded and their pages began to receive thousands of hits every day. Yahoo! (www.yahoo.com) established a searchable directory to help in data retrieval.
Lycos -During his sabbatical from Carnegie Mellon University in July 1994, Michael Mauldin created the Lycos search engine.

Search Engine Categories

To make searching through a big, nebulous blob of unstructured resources easier, search engines that are specifically intended for searching web pages, documents, and photos were created. They're designed to go through a multi-stage process that includes crawling the infinite stockpile of pages and documents to skim the figurative foam from their contents, indexing the foam/buzzwords in a semi-structured form (database or something), and finally resolving user entries/queries to return mostly relevant results and links to those skimmed documents or pages from the inventory.

Crawl

When conducting a completely textual search, the initial step in classifying web pages is to locate an 'index item' that is specifically related to the 'search word.' In the past, search engines began with a short set of URLs as a so-called seed list, fetched the content, and scanned the links on those pages for relevant information, which later provided additional links. The procedure was very cyclical, and it persisted until the searcher had enough pages to use. A continuous crawl strategy is now used rather than an accidental finding based on a seed list. The crawl approach is an extension of the discovery method previously stated. The system, however, never stops worming therefore there is no seed list.

Link Map

Web crawled pages are frequently dispersed and supplied into another computer, which produces a virtual map of the resources identified. The bunchy clustermass resembles a graph where the many pages are represented as little nodes connected by linkages between the pages. The extra data is kept in numerous data structures that allow certain algorithms to quickly access the data by computing the popularity score of web pages based on how many links connect to a specific web page, which is how individuals can access any number of resources related to diagnosing psychosis.

Database Search Engines

Searching for text-based material in databases involves a few unique obstacles, which have led to the development of a variety of specialised search engines. Databases can be slow when dealing with sophisticated queries (with multiple logical or string matching arguments). Databases support pseudo-logical queries, which aren't available in full-text searches. A database does not require crawling because the data is already structured. However, in order to allow for a more precise search, it is frequently required to index the data in a more compact format.

Mixed Search Engines

When searching for information, it's not uncommon for the results to include both database material and web pages or documents. Both sets of requirements have been addressed by search engine technology. Large Web search engines, such as Google, are the most common mixed search engines. They look for data in both structured and unstructured formats. Take the term 'ball,' for example. It returns almost 40 different permutations on Wikipedia alone in its most basic form. Did you mean a ball in the sense of a social gathering or a dance? Is that a soccer ball? What about the ball of the foot? A separate index is crawled and indexed for pages and documents. Databases are also indexed from a variety of sources. Users can then get search results by searching these many indexes simultaneously and compounding the results using "rules."

The Era of Google

While other search engines were taking into account the number of searches of a particular keyword, Google thought about a better approach that assessed the relationships between websites, rather than calculating how many times the search phrases appeared on the page as traditional search engines did. Page briefed Hassan about his ideas, and Hassan started coding the code to put them into action. Before moving forward, here's a brief explanation about PageRank algorithm, PageRank (PR) is a Google Search algorithm that ranks websites in search engine results. PageRank was the algorithm's name that calculated the relevance of a website based on the number of pages and the importance of pages that connected back to the original site. Larry Page, one of Google's founders, inspired PageRank. PageRank is a metric for determining how important a website's pages are. Although it is not the only algorithm used by Google to organise search engine results, it is its first algorithm and most well-known. The PageRank algorithm generates a probability distribution that indicates the possibility of a random user clicking on links ending up on a specific page. PageRank can be determined for any large collection of documents. Several research publications assume that the distribution is uniformly distributed among all documents in the collection at the start of the computational process. The PageRank computations necessitate numerous trips through the collection, referred to as "iterations," to update PageRank values to represent the actual theoretical value more nearly.

Because the method evaluated backlinks to estimate the value of a site, Page and Brin dubbed the new search engine "BackRub" at first. Page and Brin credited Hassan and Alan Steremberg with being instrumental in establishing Google. Page and Brin later co-authored the first paper about the project, describing PageRank and the initial prototype of the Google search engine, which was published in 1998. Rajeev Motwani and Terry Winograd later co-authored the first paper about the project, describing PageRank and the initial prototype of the Google search engine, which was published in 1998. Héctor Garca-Molina and Jeff Ullman were also mentioned as project contributors. PageRank was influenced by a similar page-ranking and site-scoring algorithm invented by Robin Li in 1996 for RankDex, with Larry Page's PageRank patent citing Li's previous RankDex patent; Li went on to found Baidu.

They eventually changed the name to Google, a play on the word googol, an extremely huge number written 10100 (1 followed by 100 zeros) chosen to indicate that the search engine was intended to give large amounts of information.

Because the company's founders had minimal familiarity with HTML, the markup language used to construct websites, Google's first homepage was simple. On September 15, 1997, the domain name www.google.com. was registered, and the corporation was formed on September 4, 1998. It was based in Susan Wojcicki's garage in Menlo Park, California. Craig Silverstein, a Stanford PhD student, was hired as the company's first employee.

Working of the Search Engine (Google)

Meaning of a Query - To return relevant results for your query, we need to figure out what information you're looking for and your query's intent. Comprehending intent is a crucial part of Search because it is primarily about understanding language. We create language models to figure out what strings of words we should search up in the index. This includes things like analysing spelling problems and attempting to comprehend the type of query you've typed using the most up-to-date research on natural language understanding. Our synonym algorithm, for example, assists Search in determining what you mean by determining that many words indicate the same thing. Search may now match the query "How to change a lightbulb" with pages that explain how to replace a lightbulb. This technology took more than five years to create and improves search results in over 30% of cases across languages.
Relevance of Web Pages - Algorithms then analyse the content of webpages to see if it contains information that is relevant to what you are looking for. When a web page contains the same keywords as your search query, the most fundamental evidence is that the information is relevant. The material is more likely to be relevant if certain keywords appear on the page, or if they feature in the headings or body of the text. We employ aggregated and anonymised interaction data in addition to simple keyword matching to determine if search results are relevant to queries. We convert that information into signals that aid our machine-learning systems in determining relevance.
Quality of Content - Search engines try to prioritise the most credible sources available in addition to matching the words in your query with relevant publications on the web. Our methods are meant to find signals that can help assess whether pages indicate experience, authority, and trustworthiness on a certain topic in order to accomplish this. We seek sites that many people appear to like for similar searches. For example, if other well-known websites connect to the page (a metric known as PageRank), it's a solid indication that the material is reliable. Our systems use aggregated feedback from our Search quality evaluation process to improve how they determine the quality of information.

Usability of Webpages

When ranking results, Google Search considers how user-friendly a website is. When we find recurring user pain points, we create algorithms that favour more usable pages over less usable pages, all other factors being equal. These algorithms look for signals that indicate whether all of our users can see the result, such as whether the site looks good in different browsers, whether it's designed for all device types and sizes, such as desktops, tablets, and smartphones, and whether the page loading times are fast enough for users with slow Internet connections.

Context and Settings

Your location, previous Search history, and Search preferences all aid us in tailoring your results to what is most useful and relevant for you at the time. We serve content that is relevant to your area based on your country and location. For example, if you search "football" in Chicago, Google will most likely return results about American football and the Chicago Bears. In London, though, if you search "football," Google will prioritise results about soccer and the Premier League. Suppose you've specified a preferred language or opted in to SafeSearch. In that case, your search preferences can help you figure out which results are most likely to be beneficial (a tool that helps filter out explicit results).

Important Algorithm Updates by Google

Google Panda

This algorithm upgrade, released in 2011, targeted poor behaviours like keyword stuffing and duplicate content. It established a "quality score" that allowed web pages to be ranked based on how people would perceive the content rather than how many keywords they included. Marketers ensured they generated instructive, quality content, edited underperforming articles, and strategically employed keywords to "survive" Google Panda.

Google Penguin

This change, which went live in 2012, targeted "black hat" SEO techniques such as link directories and spamming backlinks. It looked at keyword stuffing the same way that the Panda upgrade does.

The idea was to shift away from relying on link volume to increase a page's search ranking and instead focus on high-quality content that featured only useful, engaging links.

Google Hummingbird

This 2013 update was focused about bridging the gap between what individuals searched for and what they truly wanted to find. In other words, it attempted to make the search engine experience more human by promoting the most useful and relevant content to the top of the results page. As a result, marketers improved their chances of satisfying readers' expectations by providing more keyword variations and relevant search words.

Google RankBrain

RankBrain, a Hummingbird plugin, was launched by Google in 2015. It assigns a score to pages depending on how well they appear to respond to a user's search intent. It promotes the most relevant and useful material for a term or search phrase, in other words. By understanding the user intent behind every keyword people search for and generating rich, quality content to satisfy their expectations, you can pass RankBrain.

Future of Search Engines

We can observe how business models and breakthroughs have affected search in the past. By instantly connecting consumers to content throughout the internet that solved their concerns, Google became the most successful search engine. We have a plethora of How To, unboxing, and review videos on YouTube that have helped people solve their difficulties. Amazon has connected people with the things they want to buy at ever-faster speeds, addressing their problems in the process. On this premise, we believe that any goods, technologies, or services that can solve consumers' issues faster and simpler than those available now will likely lead the future of search. On this foundation, I've outlined my seven predictions for search patterns in the future.

Search will be Experience Led

When search initially began, it was mostly a text-in, text-out experience. Users would type their search query in text, and the search engine would return ten text results. These universal search results introduced additional media types to the search process, enhancing the user experience. With the development of rich search results, including images, products, maps, news, and videos, this experience has improved over time. To bring us up to date, there are a plethora of new input experiences (voice, image, and video) and a plethora of new output experiences (360 images, Interactives and augmented reality).

Search will Exist Everywhere

As we consider the possibilities of these new experiences, we might begin to blend inputs and outputs (and, more importantly, the devices on which these sit). Indeed, the impact of these additional linked gadgets on the search environment might be significant. When we combine the number of units shipped with sales predictions for Desktops, Mobiles, and Other Connected Devices, we can see that, while the mobile device is now the most popular, it may be surpassed by the sheer amount of other connected devices by 2022. As these devices grow more common, the search will become more frequent.

Search will be More Personal

Google Assistant, Amazon Alexa, Apple's Siri, and Microsoft's Cortana are all in their early stages of development, but they will eventually power everything around us. Consider Google.com directing us right to a chat with Google Assistant or Alexa rather than Amazon.com, and you'll get a sense of what could happen. Consumers' willingness to fork over personal information in exchange for improved service will be required for this level of deep customisation. Consumers, on the other hand, will recognise the value of doing so if it helps them in the long run, based on previous experience.

Search will be More Action Based

This is currently taking place in Google's SERPs (Search Engine Results Pages) (SERPs). For queries such as "what time is it?" and "who were the Mayans?" In the SERP, Google is already responding to the results. Consider a scenario in which search engines can help with more sophisticated issues, such as "Please order me a brown hair dye to match my regular colour," "Please order me an Uber," or "Please buy some toys for my daughter for the holidays." Direct action from all of these queries is feasible with access to personal data, calendars, address books, and previous purchase patterns. Search will become more action-oriented as personal assistants remove barriers to problem-solving.

Search will be More Pre-emptive

Pre-emptive search is now available on the Android platform, with push notifications notifying you of forthcoming restaurant reservations, hotel reservations, flights, or travel. This pre-emptive search becomes viable with access to calendars and emails. Another example is Google's Discover platform, which presents material based on a user's previous search history before they search for it. Search will become more pre-emptive as it learns what consumers will require before they even realise it.

Conclusion

To conclude this article, we have observed where the web searching industry initiated and how it has evolved over the years. We have also seen how some landmark decisions left a huge impact and helped this industry grow further.

All in all, this is a field with a lot of potentials and is expected to bring new innovations to make the experience, even more, user friendly. Although the dominance of a few big players has been continuing for long, only time will tell whether a new company will be able to make a mark in the future.

References

https://en.wikipedia.org/wiki/Search_engine_technology

https://omarzahran.medium.com/the-evolution-of-the-search-engine-c9b0bb08bfb0

https://www.iprospect.com/en/gb/news-and-insights/news/the-future-of-search/

https://www.thinkwithgoogle.com/intl/en-apac/consumer-insights/consumer-trends/evolution-search/