This operation is reasonably easy to distribute to many machines. Within a round the Google matrix is applied to the current page ranks estimates of a set of sites. This problem naturally decomposes into rounds. The computation of this page rank involves repeatedly applying Google's normalized variant of the web adjacency matrix to an initial guess of the page ranks. One of the reasons Google was able to produce high quality results was that it was able to accurately rank the importance of web pages. There are several aspects of a search engine besides downloading web pages that benefit from a distributed computational model. As the problem of managing many machines becomes more difficult as the number of machines grows, Yioop further has a web interface for turning on and off the processes related to crawling on remote machines managed by Yioop.
Like these early systems Yioop also supports the ability to distribute the task of downloading web pages to several machines. However, the PHP language does have a multi-curl library (implemented in C) which uses threading to support many simultaneous page downloads.
Unfortunately, PHP does not have built-in threads. This is one of the reasons PHP was chosen as the language of Yioop. This language is the `P' in the very popular LAMP web platform. Without threading, downloading millions of pages would be very slow. On each machine there would, in addition, be several search related processes, and for crawling, hundreds of simultaneous threads would be active to manage open connections to remote machines. These systems used many machines each working on parts of the search engine problem. Google circa 1998 in comparison had an index of about 25 million pages. This is one of the reasons that Yioop can scale to larger indexes.īy 1997 commercial sites like Inktomi and AltaVista already had tens or hundreds of millions of pages in their indexes.
The Yioop engine uses a database to manage some things like users and roles, but uses its own web archive format and indexing technologies to handle crawl data. This edges towards the limits of the capabilities of database systems although techniques like table sharding can help to some degree. Even if one is only extracting about a hundred unique words per page, this table's size would need to be in the hundreds of millions for even a million page index. Given that a database is being used, one common way to associate a word with a document is to use a table with a columns like word id, document id, score. An example of such a search engine written in PHP is Sphider. Today, databases are still used to create indexes for small to medium size sites. In 1994, Web Crawler, one of the earliest still widely-known search engines, only had an index of about 50,000 pages which was stored in an Oracle database. Understanding some of this history is useful in understanding Yioop capabilities. Since the mid-1990s a wide variety of search engine technologies have been explored.
In the remainder of this document after the introduction, we discuss how to get and install Yioop the files and folders used in Yioop the various crawl, search, social portal, and administration facilities in the Yioop localization in the Yioop system building a site using the Yioop framework embedding Yioop in an existing web-site customizing Yioop and the Yioop command-line tools. In this section we discuss some of the different search engine technologies which exist today, how Yioop fits into this eco-system, and when Yioop might be the right choice for your search engine needs. It also supports many common features of a search portal such as user discussion group, blogs, wikis, and a news and trends aggregator. Yioop provides a traditional web interface to do queries, an rss api, and a function api. Nevertheless, since you, the user, have control over the exact sites which are being indexed with Yioop, you have much better control over the kinds of results that a search will return. In contrast, a search-engine like Google maintains an index of tens of billions of pages. The largest index so-far created using Yioop is slightly over a billion pages. The number of pages a Yioop index can handle range from small sites to sites containing tens or hundreds of millions of pages. The Yioop search engine is designed to allow users to produce indexes of a web-site or a collection of web-sites.
If you have downloaded a prior version of the Yioop software, you may prefer to use one of the following PDF captures of the software documentation: