Thursday, November 4, 2010

Programming the Internet 1.1


The deep web and its impact in relation to search engines, academia and commercial sites can be summarized by the presence of new commercial products aimed directly at the information management sector.

The issue of the deep web is that the various commercial bodies that contribute to the internet at large also maintain large repositories of competitive knowledge or repositories that have no external links or connections; this may be defined as a collection of “Trade Secrets” and “Information”; the primary example would be the recipe for fries at McDonalds.  You’ll see the nutritional make up of fries on McDonalds web site; you’d be hard pressed to find the details of they’re fabrication anywhere. These commercial bodies will engage in public or semi-public communication with their commercial parterres via the internet; but just as all radio communications is not for public consumption[i], neither are all web-servers.

Michal K. Bergman wrote:

“Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not "see" or retrieve content in the deep Web — those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.”[ii]

The only available method to search the “Deep Web” as stated by Bergman, is to conduct direct searches of non-linked sites utilizing cross referencing technology such as those cited in cyvellence’s studies[iii]; realistically the only true method that would not utilize educated guesses would involve the use of a network scanning utility such as nmap, nessus or metasploit to crawl the ARIN based address space for all values from 0.0.0.0 to 255.255.255.255, and index those findings in reference to the various engines used, or to leverage existing heat maps from CAIDA to establish the publically routable space as the primary scope[iv] for indexing. The major issue is that this approach has a number of legal barriers since in most countries “port scanning” constitutes a crime, and the technical challenges of creating a state full web-crawler for the entire space poses a real technical and fiduciary challenge.

The issue with search engines as stated in Bergman’s paper is “Quality” although there is a significant quantity of deep web sites; most of which are topic databases; in relation to content search engines are more concerned with quality and accuracy than quantity of search results. [v]

Academia are concerned primarily with quality; and accuracy of information and relevance of topics over quantity; as such various search engine providers are offering services that cater to the volume of knowledge contained within the traditional houses of excellence.[vi] These include publishers such as Prentice Hall, Springer, Deitel, IEEE, ACM, and other such academic organizations. 

Commercial entities desire competitive intelligence in addition to labour and resources; the nature of competitive intelligence is that it is based primarily on what is known about ones competition; next to unintentional disclosure, the volume of information available online via both traditional search engines and the deep web such as import and landing databases from customs would allow any corporate entity to determine a number of characteristics of their competition that would otherwise remain unknown. Current businesses exist to mine this volume of information and competitive intelligence is an emerging market where various companies offer services of this nature.[vii]

The effects of the deep web on future search engines will be that of content focus in a granular manner and content analysis; as the deep web becomes greater and contains more information of value; search engines will have to develop non-index based databases based on tertiary page characteristics and information such as meta data as referenced via intelligent page capture techniques such as machine learning[viii] and capture. This future although currently dark will be illuminated by the businesses that will gain the most from data mining, business intelligence and information capture.


References



[i] Sokol, Brett; Miami Times, Espionoge is in the Air [Online] World Wide Web, Available from: http://www.miaminewtimes.com/2001-02-08/news/espionage-is-in-the-air/ (Accessed on November 4th 2010)
[ii]  Bergman, K.; White Paper: The Deep Web: Surfacing Hidden Value [Online] World Wide Web, Available From: http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104 (Accessed on November 4th 2010)
[iii] Murray, Brian H.; Moore Alvin; Cyveillence Sizing the Internet A white paper [Online] PDF Document, Available from: http://www.cs.toronto.edu/~leehyun/papers/Sizing_the_Internet.pdf (Accessed on November 4th 2010)
[iv] N.A. ; The Cooperative Association for Internet Data Analysis; Measuring the Use of the IPv4 Space with Heat Maps [Online] World Wide Web; (Accessed on November 4th 2010)  Availble from: http://www.caida.org/research/traffic-analysis/arin-heatmaps/
[v]  Bergman, K.; White Paper: The Deep Web: Surfacing Hidden Value [Online] World Wide Web, Available From: http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104 (Accessed on November 4th 2010)
[vi] N.A. Google Inc. Google Scholar Search Engine [Online] World Wide Web, Available From; http://scholar.google.com (Accessed on November 4th 2010)
[vii] N.A. ImportGenius. About Import Genius [Online] World Wide Web, Available from; http://www.importgenius.com/about.html (Accessed on November 4th 2010)
[viii] Mitchell, Tom. M; CMU; July 2006, The Discipline of Machine Learning [Online] PDF Document, available from: http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf (Accessed on November 4th 2010)

No comments:

Post a Comment