Today’s Internet search engines compute their centralized index by crawling web contents. This approach implies two major problems: large and relevant parts of the Internet content are not reachable by crawling and thus remain inaccessible for search engines, bandwidth and its growth impose harsh limits on central index currency and indexable share of vastly growing available information.
The obvious solution is a distributed approach to information retrieval that better leverages the available bandwidth in order to achieve higher index currency and improved coverage, including deep web contents. Forward knowledge -like keyword indices – have to be stored closer to the searchable information sources than the central index approach currently does. Furthermore, their updating has to happen in a more bandwidth-efficient manner as compared to the change detection heuristics and “brute force” crawling methods used today.
Metrics collected have shown that the amount of publicly accessible information on the Internet grows much faster than the available Internet backbone bandwidth. In particular it turns out that the amount of application generated content grows especially fast. Crawling-based technologies for global Internet search won’t be able to scale with these growth factors. Furthermore, many of the application-generated contents are unreachably hidden from the crawlers in the “deep web”.
The solution lies in reversing the paradigm of Internet search. Content providers will have to contribute to the searchability of their information space, thus making search more bandwidth-efficient and making deep web contents accessible to search.
A prototype showing how this can be done has been brought online together with a white paper explaining the most important concepts. It combines best-of-breed techniques from the field of information retrieval with an architectural approach to designing searchable online applications.
(Cooperation: Interactive Objects Freiburg)