The first exposure I had to solving the problem I'm going to present in this blog was at National Science Foundation where they are charged with funding grants for researching an ever evolving set of sciences, generally those outside of medicine as that is the work of Health and Human Services. Their version of this problem from a practical sense is that they would receive any number of proposals written to apply for money from these grants. Generally the grants are structured to solicit the solving of a problem and not so much how to solve it. This means that scientists from many fields would try and take a stab at proposing a solution to get the grant money and carry out their research.
NSF's biggest expenditure outside of the grant itself is bringing in experts in the field of study that can judge the merit of these proposals. They convene a panel to collectively decide who should receive the grant. Now while that sounds simple enough, the minutae of each approach and the science that it entails poses a large problem and that is simply to categorize each submission so that it can be reviewed by the appropriate experts. This 'pre-processing' task alone accounts for a majority of the operating budget at NSF and consumes time from some of brightest and best people available.
The initial solution proposed to alleviate this heavy lifting was a mixture of text mining and RDF based on work done with MEDLINE and HHS as seen here and here. This back end coupled with tools for modeling taxonomy and ontology such as Protege, TopBraid and a middle layer for visualizing results called Siderean Seamark. While this approach seemed logical the problem at NSF that there is no corpus like MEDLINE availalbe for 99% of the sciences documented, a list which is evolutionary. In fact you can find a shallow taxonomy of these fields and their children in directories on the internet but the concepts that must be represented to accurately 'bucket' the research proposals. Therefore the initial data to prime the system for the desired result just didn't exist.
Fast forward to another project which was looking at a search solution for the data collected during the Bush administration. This was purported to be in the petabyte range and consisted mostly of email and attachments as well as policy and official releases. While the choice for OutsideIn to parse the data which is what Google uses for the deep web was fairly obvious what was of concern for the National Archives was that the Clinton library was only a few terabytes and performing poorly. Now while it seemed the hardware running this data set may have been slightly anemic we were dealing with several orders of magnitude here where that size of a machine may not even exist. Of course the alternate solution is the Google one where a data center of commodity blades is used to process searches. Google along with Microsoft FAST were competitors in this technology evaluation.
After looking a bit more closely at the alternatives and the scalability of the inverted indeces and hardware vendor analysis from IBM leveraging Nutch and Lucene as well as distributed file systems like Hadoop used by Google and others I came to the conclusion that out of the box technologies (at least from Oracle) only had one chance to be able to compete with more establised technologies and it wasn't going to be the Text Index shown here:
While this was a good design it didn't seem to scale up or out efficiently even when put on hardware like this which supports 128TB of RAM or about an eighth of the size of the content itself. In reality it's the size of this type of index that proves cumbersome so we attempted a new approach to looking at a solution to get away from an index that is a high percentage (sometimes half) of the size of the content itself.
At the root of the solution was the idea that we would use the OutsideIn to parse all email, html, xml and binary document content into a homogeneous XML that could be stored in XMLDB and used with a XSLT stylesheet to product 'snippets' when those entries were tagged in a search. The real crux of the performance issue was addressed at crawling and parsing time by stripping away the Wordlist, Stoplist and Lexer from the 'index' structure that would be queried in searches. This very important part of any full text index was abstracted into a flat table of the entire english language as extracted from WordNet. There was also a semantic version of WordNet that would allow for expanding search terms. Since this was an all English corpus the approach was linguisically valid as there are any number of off the shelf solutions that translate end results to other languages. In each case allowing for searches in other languages was not intended to be supported.
As new words and acronyms were presented by subsequent crawls they would be added to the relational table as well as the semantic table. Support for white space, puntuation and other noise such as special characters was added to be used in exact quote matches. Of importance here is the realization that the English language base is only ~1 million words and let's say it even doubles or triples with slang, acronyms, etc. it is still a relatively small data field when done with a 3 byte integer. In addtion there was an XML CLOB to support the retrieval of snippets. The primary key columns contained a number for the document id and for each discreet instance of a word in each document. The latter would serve as the ordinal position for retrieving words quoted in the search criteria and receive its own index as well as being a sequence that would reset for each document. Total in line record size of these primitive types would yield 3+4+4=11 bytes so 10s of billions of total words would only yield roughly a terabyte of integer storage and could be managed on TimesTen. We now had petabytes of content able to be stored as searchable in terabytes worth of integers and indexes. This cardinality of the dictionary word id to its foreign key in the table would be supported via a bitmap index or bitmap join index. The issues left to be conquered would be the maintenance of that index on subsequent crawls (therefore inserts into the table) and how long it would take to create a multi-terabyte index of this type as that is the recommended approach since maintenance on the index would be prohibitive and yield fragmentation. Partitioning and its globally partitioned indexes can not be used as they do not support bitmap index type at this point.
The real value of the solution for the President Bush library is that it would run on a traditional scale-up, shared global memory system that would allow you to save data center costs as well as labor costs as the maintenance of the system would be the standard off the shelf components used in many places today. As an SQL based system queries would be expanded with a semantic query against the RDF version of WordNet, 3 byte integers would be retrieved from the relational representation of the WordNet data and used in the main query against the bitmap index join (or bitmap join index) resulting in a 'greedy' get on all of the 'hits' where that word or synonym, etc. occurs in the body of crawled data. This inner query of a correlated subquery would move the pertinent records out to another cache area where they could be again semantically filtered based on the search terms and used to retrieve their XML representation that gets merged with an XSLT to produce snippets of the results.
It's not hard to see that making clickstream or other 'ranking' mechanisms for search results would be an easy bolt on to this simplest of data representations. I should credit some documentation that I used to validate this approach here and some insight on the binary state of the bitmap index here and here. Since so much of what gets stored in this situation is redundant Advanced Compression techniques can be used against the document base and XML representation for a significant savings. So what I'm getting at here is that all of the tools are available on standard platforms and standard off the shelf software to build your own mini Google as it were albeit constrained to a single language it opens up a world of possibilities.
Let me tie all of this together and talk about why it's important that more mainstream computing power be used for Web 3.0 and not have it in the hands of the few who can fund their own data centers for 'clouds' or get multi million dollar grants to author a more single purpose system. There are many folks doing practical things with mainstream technology and I'll refer you two such folks here and here. I find it interesting that at the lead of the Web 3.0 entry in Wikipedia it says "it refers to aspects of the Internet which, though potentially possible, are not technically or practically feasible at this time" and that is what I think can be changed here. An old colleague from Oracle makes my case here. Although Oracle doesn't have a research division per se in the vein of a Microsoft or IBM what they do have are technologies that are practical, well thought out and (generally) ready for use without a PhD to understand them.
Back to the original National Science Foundation problem and the understanding that you start with essentially an infant system that doesn't understand much of anything except how to answer a search query. But using collective intelligence from information browsers you can begin to build an understanding of the relative bonds of the information and through experts tagging results of their interpretations build an 'understanding' of the underlying data. You've taken natural language processing in its classical sense somewhat out of the picture and let the users interacting with the system render the interpretive results including language.
Now imagine if experts in the subject matters were able to infuse their knowledge via OWL ontologies. There is a good book I've read on this area of research called Data Mining with Ontologies: Implementations, Findings and Frameworks which really begins to show how semantic content can not only be used to help enhance search queries and results navigation but in fact control the way in which bodies of data are mined for intelligence. Powerful huh? Now your browsing history and favorites can be made into a semantic package that gives you some context as well when you interface with this source and Google Desktop and others have seen this vision as well. One could make the case that it will be startups like 33Across and Peer39 that will actually monetize Facebook and other social networking sites.
Not surprisingly Oracle chose to put the semantic component inside their Spatial extension to the database. When you look at the Wikipedia Web 3.0 entry under Other Potential Research you see tremendous opporunity inside this data structure format as the data itself becomes dimensionally navigable. For one of the best explanations of this complex paradigm I refer you to a publication called The Geometry of Information Retrieval which is a brilliantly thought out explanation of the 'Existentialism' of the information itself. Mathematical explanations like Causy-Scwhartz Inequality gives light on how to use data mining techniques such as probability and variances to bucket data into correlated, informational assets.
I believe Second Life and the quote below it from Sir Tim Berners-Lee:
"I think maybe when you've got an overlay of scalable vector graphics—everything rippling and folding and looking misty—on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource."
are certainly the power of what we're dealing with which is not only a distributed model of collective intelligence that learns over time through interaction and data acquistion but an interface that allows users to immerse themselves inside of a navigable information space that is bourne from the very cognitive representation of the knowledge they seek. The real power comes in the future (hopefully in my lifetime) where it is understood how to translate this knowledge base into any language and other human interface devices that will see the reality of the Socio-technological research talked about in the other area of Potential Research shown on the Wikipedia Web 3.0 entry. It gives hope that the transfer of this knowledge and therefore enlightenment and understanding would be much easier for all to achieve.
As my hacker ethic would have me believe all of this should and will be accomplished by the masses as they will be the recipient of it and it will not succeed without the proper input where the 'Internet', Web 3.0 and beyond, remain an intangible, untaxed and unowned amorphous entity that can be used for the greater good.