Thursday, March 24, 2011

Business Intelligence for Your Cloud

As cloud computing hits the initial incline of the maturity curve you begin to see a coupling of capabilities from a variety of disciplines, which may have previously been considered to be strange bedfellows. There are many examples of this such as security's impact on power usage efficiencies through the enabling of multi-tenancy. The one I want to focus on in this blog posting is that of business intelligence for your cloud operations. At the surface this sounds benign enough as we are often asked to produce business intelligence reports for measuring things that matter to our respective organization but this blog posting is looking further into the future. In many ways the future is available now and as such, should be factored into what is a relatively green field in the scope of IT operations. That green field is the journey from virtualization to cloud computing. Understanding the term ‘green field’ is bold in any context but with the direction to virtualize in general you generate some separation of concerns, lift and shift is a term I’ve often heard to describe this type of situation. In that respect it is an opportunity to rethink how your approach to something as complex as cloud computing might evolve over time and how to align management practices over this new paradigm to exercise proper controls.

At the top of that maturity curve you often hear the term ‘utility’ computing thrown around as well. In fact my ‘crystal ball’ is in part shaped by Telco experiences in days of yore supporting first a billing system that collected data from a 5ESS switch and subsequently managing a reporting system for FCC fairness in order for one of the baby Bells to enter the long distance market. Some anecdotal takeaways here are that the 5ESS switch ran the network carrying phone calls so capturing adequate data to convert for BSS purposes and other financial systems was a main focus of operations. Secondly, was the OSS data from such efforts was then 'rolled up' in many ways, not only for internal study of profitability, etc. but also for regulators and thereby competitors' scrutiny as well due to the open nature of those data sets. These business intelligence rollups are what is required for executives to discuss the state of the business in terms like compliance and profitability. There is opportunity for net new applications built on a modern Platform as a Service offering such as VMware vFabric and running in a modern cloud infrastructure like CloudFoundry, to exhibit the elastic operational capabilities along with the transparency necessary to achieve a true utility model. The amount of applications that will not appreciate these opportunities for a rewrite in the near term is vast and as such the target pattern for what will likely be the mainstay of cloud computing until such time as the scale tips towards those ‘utility’ capable applications is what is known as 'IT as a Service'. More succinctly this is a cloud that provides enough automation and manageability for consumers to request capabilities in the cloud. Also for IT supporting those existing applications that have been made cloud ready to understand how that cloud ecosystem can support the requirements driven by a concurrent diversity of consumers.

Perhaps the best analog I can think of to illustrate how these new breed of cloud management tools combine to form a gestalt is the way manufacturing, supply chain and logistics as well as inventory management and point of sale have become such a cohesive whole that enterprises controlling all of these facets, from creation of goods to their retail sale, for instance, can literally optimize their entire operation by simply funding more advertising or other campaigns. This capability is enabled since they have complete and near real time visibility into the discrete functional elements of their enterprise as well as how they can add resources to each element to support greater throughput while understanding the boundaries of how much capacity they have in total as well as the limits of individual facets and how they impact adjacent, likely tangible, dependencies. Those who can master the ability to procure and manage this pattern within a cloud infrastructure, it's cost and methods to leverage it in the most efficient ways, will experience the highest margins and have a customer base that, coming from on-premise solutions with perpetual licensing and other high initial capital outlays, will be more amenable to consuming the cloud based service at a price point that matches their business terms, e.g. per user/time period, etc.

So let’s talk for a minute about the title of this posting, Business Intelligence for Your Cloud. For VMware and our customers the simple message of 'Your Cloud' reflects our belief that your journey to the cloud begins with virtualizing your own systems in your own data center. One of the main reasons for this is the requirement for your organization to internalize what it means to ‘control’ a virtualized infrastructure. I use the term 'control' more so in the vein of IT Portfolio Management or Balanced Scorecard, understanding that that which can be measured can be verified. The main reason for establishing this baseline of control is that as you begin to move from virtualized enterprise to a private and eventually a hybrid or other cloud model, there will undoubtedly be third parties inserted into how your enterprise IT gets delivered. It will become critical to transpose these mechanisms of control to your cloud-hosting provider, as an example, so that contracts and SLAs can be more concretely negotiated then verified in an active, ongoing basis. At the end of the day this will involve collecting business intelligence just as it does for most efforts in other, more tangible areas of the enterprise.

To cover what business intelligence FOR your cloud means I’ll start by saying what it isn’t, business intelligence IN your cloud. That would be an interesting blog topic but likely to include Business Activity Monitoring, Complex Event Processing, Service Oriented Business Intelligence, Real Time Decisioning, etc. configured to study some other line of business within your enterprise. At VMware business intelligence comes in the form of capturing the ‘instrumentation’ of all layers of your virtualization and cloud infrastructure. The operational data needed to accurately monitor these applications and their supporting virtualization infrastructure is available via VMware vFabric Hyperic, which provides mechanisms to correlate all of this time series data in collections that provide meaningful event based insight. You may want to also automate the tracking of operational status of SLAs using VMware vCenter AppSpeed, achieve event-based notification of configuration compliance using VMware vCenter Configuration Manager or utilize VMware vCenter Chargeback to get clear usage translated into costs.

These VMware applications afford the appropriate perspectives into cloud operations as they happen but anyone who has been responsible for a business intelligence effort recently knows that the data is required to make better forward looking decisions. That’s where tools like VMware vCenter CapacityIQ, which leverages historical trend data to recommend resource planning for virtualized infrastructure, come into play. There are also applications that utilize more predictive methods to turn operations data into near real time events in the form of VMware vCenter Operations (which now includes CapacityIQ and Configuration Manager in Advanced and Enterprise editions), where ‘events’ are correlated across all applicable cloud layers, including hardware, network and OS, measured over time and notifications and/or actions generated as anomalies occur. Now you're slicing, dicing and provisioning capacity in the cloud with the measurements you need to manage your operations in flight as well as planning for the future. Leveraging the abstraction of virtualization as a vehicle to the cloud at the infrastructure layer and leveraging this homogeneous instrumentation capability is likely much easier than trying to wrestle your existing enterprise servers’ and application assets’ operational data into something like a SIEM or Governance, Risk and Compliance (GRC) solution.

Gaining the proper perspective to harness the data coming from your virtualized infrastructure and supporting applications is not unlike capturing data from any sort of group of sensors in the tangible world. And like any other entity, harnessing that data in order to measure key performance indicators over time is the way to assert control, even over something remote and intangible in certain aspects like cloud IT operations. It is the line of sight and transformation of the data captured from all layers of the cloud operation, IT and business, that will enable your cloud to become a strategic, agile appendage of what your business aspires to accomplish and allow the CIO to participate fully in delivering new offerings as strategic differentiators in the marketplace. It will be important not only to give the proper perspective to all stakeholders but also to look at the cloud as a portfolio of assets that should also accept input from those stakeholders in terms they speak, effectively letting business drive the evolutionary cloud configuration.

Looking at CIO priorities it's clear they want the agility of cloud computing to make the future of IT a driver for business strategy, however, most remain wary of how much control they may be required to give up as they move to cloud computing. In most cases these two are juxtaposed however the move to virtualization and the cloud brings an opportunity to automate not only for agility’s sake but also for capturing operational data needed for all types of control to be established. Ultimately what this will mean is the ability to avoid reinventing a very complex wheel over and over. To revisit my Telco analogy as an example, a complex QoS managed circuit, tailored to an individual customer profile that is difficult to support and even more difficult to price effectively for profit and loss. As this architecture evolves productizing added capabilities on the fly, e.g. more Internet throughput, more HD channels, more bundled long distance minutes or calling features at a competitive, market driven price point, becomes inherent to the culture. Having business intelligence for your virtualization infrastructure will elevate IT directly into a strategic line of conversation as an asset to the business instead of a cost center or liability while delivering the means to control the move to your cloud.

Thursday, February 17, 2011

Trusted Cloud

Introduction
As an executive you’re familiar with the value propositions for the agility and economics that cloud computing ostensibly provides. While appealing, these advantages have a significant barrier to their realization that can be summed up in a single word, Trust. There are many concepts that are used to deliver Trust in the enterprise environment today. Since the decision to use a cloud for the delivery of IT services is best done by starting with the knowledge and experience gained from previous work, this paper will illuminate methods and technologies that are mainstream in the Enterprise today and how they can be leveraged to acquire the maturity level necessary for cloud readiness.

Key Components
While the Trust concept itself is somewhat subjective we will attempt to address how technology patterns can be combined to achieve what is often the most challenging effort to undertake, a finite definition of what Trust means to all stakeholders involved. This is critical in that it must be agreed upon in delivering a trusted solution so that service levels and risk can be well understood and monitored for compliance. To begin with, there are physical levels of trust that are well defined and understood, for instance, moving enterprise applications for the Federal government to FISMA compliant data centers. This, coupled with deployment of secure enterprise networks, assures that the data center provides the means necessary to run these applications in an outsourced fashion. Another key component of providing this type of service are the Identity and Access Management (IAM) solutions that assure appropriate access to these systems occur in a consistent fashion. Like many other applications, these IAM technologies are offered, via Service Oriented Architecture (SOA), ‘as a Service’, e.g. the ‘aaS’ you often see when referring to various Cloud architectures. Perhaps the most critical component available and in place in many enterprises today is Virtualization. The advantages of ‘virtualizing’ hardware infrastructure are not new but the capabilities necessary to do so on an x86 architecture have made great strides in providing a hypervisor that has little to no overhead from running operating systems and applications on the ‘bare metal’ itself.

Taking Key Components to the Cloud
The key components previously discussed have reached a certain maturity level in most enterprises, however, even when coupled with newer technologies like a Security Information and Event Management (SIEM) system, lack the level of control necessary to ‘templatetize’ these seemingly disparate technology patterns into a coherent whole that can be outsourced to a cloud service provider. In this section we will look at an approach to tie these key components together in such a way as to fashion them into a holistic ‘Trusted’ entity that can be repeated and measured.

The overarching continuum that will provide this level of Trust within Cloud architectures lies in Service Oriented Architecture and a concept we’ll call ‘Cloud Orchestration’. This concept which performs virtualization on top of Intel Trusted Execution Technology (TXT) enabled servers, extends the compliant physical layer of trust into the automated provisioning of ‘Virtual Applications’ or collections of virtual machines initialized to bring about a certain business function, e.g. Business Process Management System (BPMS), object –relational cache or a Portal/Web 2.0 presentation layer. Because the physical boundaries of the data center are mapped to a physical set of servers that host what is now a ‘Trusted’ hypervisor by way of Intel TXT, you can provision what are essentially, ‘Secured Virtual Enclaves’ of these Virtual Applications. These Virtual Applications leverage the clustering and load balancing mechanisms inherent to the applications for availability while also creating a truly ‘on demand’ elasticity capability. This also allows the instances of the Virtual Applications to exist in an unmanaged or ‘zero touch’ state, eliminating needs such as physical access and change control governance.

We’ve now mentioned SOA in several facets of this architecture but let’s take it a step further to try and crystallize a couple of key points. So far we’ve asserted that you can take a reference architecture stack like Cisco UCS/Nexus and deploy it with a trusted, virtualization layer using a virtualization technology stack like VMWare’s vCloud Director and its inherent service oriented capabilities, complete with virtual TCP/IP addressing. Because all of these functions are enabled via XML it is now possible to leverage this virtual ‘container’ in ways that blend what was historically considered a ‘management band’ activity with a business policy that drives these operations in a trusted fashion. A perfect example of a use case that requires this type of solution is the requirement to provide true multi-tenancy in a cloud environment where Top Secret, Secret and other protection levels must be provided, with a combination of application stakeholders from government and industry, forming a scenario known as a ‘Community Cloud’. The usage model for these combined technologies also eliminates the need for ‘self service’ provisioning of new virtual compute capabilities since a portal/business process flow for ‘Add New Project’ would possess inherent policy based provisioning.

Leveraging Security and Policy for Control
While this combination purports to solve the ‘inner sanctum’ challenges to support some of the more complex cloud use cases, what will be used to orchestrate the virtualization, provide secure access to virtualized applications and produce the required ‘Audit Band’ to operate with the necessary control to Trust your Cloud? The technology that is the lynchpin for this overall solution is a service gateway which can be run in a tamper proof hardware form factor or as a virtualized software application. This enables the positioning of the service gateway at multiple vantage points for policy based control of how management, application and audit services are offered. It does this by combining a number of technology standards, TLS, X.509, WS-Security, WS-Policy, WS-Trust, SAML, LDAP, XACML, etc. along with policy to generate artifacts, essentially chains of trust, to the Audit Band.

This alphabet soup of standards has a diffuse set of meaningful usage patterns in concert with one another to accomplish the same goals of security, privacy and trust. The Wikipedia.org example of XACML policy elements, (Policy Administration Point, Policy Decision Point, Policy Enforcement Point and Policy Information Point) too is a good analogy for how all of these items provide this level of trust enforced and orchestrated by the service gateway. Applications such as an LDAP data store or an XACML administration solution allow for expressing who will have access to what and in what fashion but it is the collection of these (and others) applied in the correct combination at each route the data travels that extends the irrefutable chain of trust from the aforementioned compliant data center and physical computing assets, through the hypervisor and into the application layer. Policy administration solutions will provide answers to who is allowed to do what, complete with point in time states, while the service gateway will produce searchable audit artifacts from these operations to enable near real time visibility into who did what and when. Because all communication between logical application tiers will occur over XML via services, the application data payload itself becomes subject to overarching ‘Policy’ which can redact for de-classification or re-route based on content in order to provide more human centric dissemination of information.

Conclusion
Establishing the necessary level of control for Trust will be the barrier to moving applications to a cloud environment. Leveraging a services gateway to orchestrate your cloud renders a number of disruptive benefits that can be achieved:

1. The security to run applications anywhere in the compliant cloud infrastructure in a multi-tenant fashion while maintaining policy enforcement will be the key to realizing the power usage efficiency promised by the cloud

2. Continuity of Operations, Disaster Recovery and Failover also become intrinsic to the solution

3. Due to the repeatable architectural concepts described herein, cloud provider hosting becomes a more commoditized procurement process based on well understood physical access controls

4. Configuration management of cloud applications becomes a process of delivering signed, trusted iterations of virtual machines to perform within virtual applications

5. Leveraging existing SOAs such as Identity and Access Management, other API’s from packaged enterprise application suites or custom built business logic preserves your existing investments and while also offering those services in the cloud

6. Open Source software applications, once considered a security risk, are now viable solutions by leveraging the highly available, self-healing, unmanaged, ‘zero touch’ nature of the virtualized ‘middle tier’ used to provide cloud services

7. Productive application stack for modernization of all legacy investments including SOA middleware components that can remain as enterprise located assets which, over time, will require diminishing levels of costly, proprietary enhancements

8. Assuring the information lifecycle for protection of sensitive data where it matters allows for more freedoms in consuming public internet data in the presentation tier that will be demanded for rich internet applications

9. Designing transparency into the architecture allows for well understood lines of sight along the axes of Trust relative to parties involved to achieve desired compliance visibility while simplifying the effort needed to produce attestation

Beyond these benefits, cloud orchestration can provide ‘Trust as a Service’ to stakeholders and enable the promised agility of the cloud to improve service levels where complex security and audit capabilities are required. All of this while bringing capital and operational expenditures to a predictable, achievable price point allowing you to focus on new ways to deliver value.

Tuesday, May 19, 2009

Approaching a Cloud Computing Model with SOA and Virtualization

There is a lot of press given these days to ‘cloud computing’ that is attractive to many in industry, especially the IT components of those industries. Some of the obvious values that cloud computing purportedly provides are not readily accessible to customers with more stringent security and privacy concerns such as those required within the Federal Government due to the globally virtual nature of the cloud. This white paper will address the impact of the cloud computing concept to portions of the Federal Government who have already engaged in providing shared infrastructure services such as those typically planned and provisioned for an enterprise SOA.

Cloud computing isn’t an entirely new concept and in fact could be considered an amalgamation of several computing patterns that have come into vogue and matured to become prevalent in the IT mainstream. Some of these concepts are grid, virtualization, clustering, and on-demand computing. Unless you’ve been under a rock over the last decade you’ve undoubtedly been inundated with vendor speak on these items as well as likely having taken a stab at leveraging them to provide some value they ostensibly provide. In these cases the likely cost benefit fell into two main categories which are one, the physical footprint to provide requisite compute power (size and/or cost) and two, being able to manage many resources as one.

When looking at the recent adoption of these provisioning patterns it is also important to understand the larger scope of what has been successful in the Federal government to date. There have been shared backbones for supercomputing applications at NASA, Dept of Energy and DoD/DARPA for many years as well as recently established grids such as HHS/NCI National Cancer Grid which has outreach to other research institutions outside of the Federal Government. There has been an uptick of shared service Centers of Excellence for booking travel (GovTrip, Defense Travel Service and FedTraveler) and Human Resources/Payroll (Dept of Interior National Business Center, USDA National Finance Center). Enterprise security (Dept of Defense Net-Centric Enterprise Services/NCES) which provides a common access method to all facilities and systems in the form of a CAC (Common Access Card) has also begun to take hold.

As was mentioned previously, the technology patterns that constitute a ‘cloud’ are mature and in use in many places today. Clustering has been around in mid range servers (Unix, Linux, Windows) for a while and has even become part of the base operating system although third parties like Veritas, etc. still exist with some compelling value adds. Generally, a cluster is a communicating group of computers that mostly offers load balancing of processes as well as availability in the case of failure as the group appears as a single entity to the outside world. Another mainstream pattern for distributing compute power across a set of associated nodes is that of a Grid. Grids take clustering a step further as they have a way to digest workload and decompose it to perform in parallel across grouped resources. An example of mainstream Grid processing is that of Oracle’s 11g database where the ‘g’ is for Grid.

The other piece of the cloud puzzle that has significant uptake is that of virtualization. Virtualization is the act of hosting many servers on a single piece of infrastructure that may or may not be members of the same cluster or grid. Recently both Oracle and IBM have begun to offer Xen hypervisors capable of virtualizing Linux on Intel/AMD based servers (Oracle VM) and IBM S/390 or System Z mainframes (IBM z/VM). This coupled with Oracle’s recent acquisition of Sun means that this combination of virtualizing Linux is likely to receive the same R&D attention that LPARs (IBM), VPARS (HP) and Containers/Zones (Sun) have long received on the UNIX side that provided such excellent manageability of large SMP servers.

Springing from the collection of these concepts are offerings like On Demand computing and Software, Platform or Infrastructure as a Service that have recently come of age. Given the distributed nature of the resources employed to provide such services, even behind data center firewalls, SOA has been a large part of realizing any successful foray into these offerings. What has been challenging for offering these kinds of services are the very items that cloud computing seeks to ameliorate. Examples of the challenges to date have been around provisioning compute resources ‘just in time’, being able to scale when needed with an agreed fee schedule and defining the support model when platform, infrastructure or software is offered as a service. Some have been able to conquer these gaps by offering software developer tools at an appropriate abstraction layer in order to maintain control over the other layers in the infrastructure.

Perhaps due to the fact the cloud offerings are somewhat in their infancy is the other reality that entire clouds have been unavailable for hours at a time, which would violate most SLAs for Federal Government systems. Outside of this difficult fact are the privacy and security required for Federal Government systems that are really at the root of the challenge for offering a cloud computing model for these systems. The ability to manage the total infrastructure, or fabric, which runs all of the components necessary to effectively provide a cloud infrastructure, is the next step in realizing cloud computing capabilities. The fabric could include a SAN and its controllers, 10GB Ethernet or Infiniband, MPLS with QoS and latency requirements, VPNs, Routers and Firewalls, Blades and their operating systems as well as application software deployed on them. To effectively manage resources in a cloud you must have all of these items defined to a level where they can be specified and provisioned at a moment’s notice, perhaps from some trigger in an actively managed infrastructure such as CPU threshold met.

SOA applications have been reduced to a footprint that is easily configurable at deploy time through provisioning capabilities like the Open Virtualization Format and WebLogic Scripting Tool while also being able to subsist on just a few different form factors of blade servers in order to create a virtual ‘appliance’ from a group of virtual machines. Once these variables have been identified and values managed in lists, such as IP addresses and TCP ports of cluster controllers, SOA process definitions and connections, etc. it becomes a somewhat trivial task to introduce compute power to the cluster. However, identifying the matrix of dependencies in aggregate is a non-trivial task given that grids may contain clusters, clouds may be on top of grids, and grids may be on top of clouds. In the end, this distillation of the interface data is what can allow for a telecommunications like provisioning system for compute resources. Coupled with effective blanket purchasing model that allows you to accurately forecast and stock these types of physical resources or provision them as services via partners puts you in a position to actively manage your private infrastructure the same way Amazon or EMC does.

Many Federal Government agencies have invested in multiple data center locations for disaster recovery as well as having pursued ambitious Enterprise level SOA projects. The coupling of the cloud paradigms covered in this document along with the understanding that even cloud computing and its provisioning can be offered through an SOA interface is enticing. As the annual budget outlays for multiple data centers essentially increases inversely to the number of data centers added, more pressure is exerted by OMB and agency executives to leverage them for more than simply an equally performing warm failover but in an active-active role that leverages the total investment. Cloud computing not only offers the ability to actively manage the entire infrastructure with methods that help achieve this goal but also the capability to re-purpose compute power as needed. The possibilities here include performing data warehouse aggregations in the evening hours or perhaps even offering them as a dynamically shared service for compute power in a multi-tenant model to other agencies whether they are looking for SOA or more specialized grid computing applications.

Service Oriented Authorization: Information Assurance for SOA

In an enterprise SOA resources have become more centralized in order to be shared causing growing pains through a realization of necessary governance. One of the first concerns from management and business owners is that what they are offering within the scope of the SOA can be controlled with policies they establish as well as audited against adherence to those policies. Historically when application silos were built, tight control was available as most authorization for performing tasks within the application was either a function of the application container or the application logic itself. While network security, encryption, digital signatures, and strong authentication can be enabled consistently for SOA assets, what is challenging many SOA implementations today is how to adequately enable application authorization in an SOA context with coherent management capabilities. This white paper will attempt to outline the case for leveraging the move to an enterprise SOA as the perfect opportunity to enhance information assurance for the enterprise as opposed to allowing it to manifest itself as actual or even perceived gaps in controlling access to both SOA and legacy applications.

As the IT infrastructure has evolved over the last decades, many security, authentication, and authorization models have been introduced from ACF-2 to Top Secret and RACF for the mainframe to LDAP and Active Directory for UNIX and Windows. Given what these services provide to the overall effectiveness and risk management of applications they are certainly an intrinsic part to the fabric that provides the necessary components for making applications available to diverse user populations. Most of these offerings sprang from vendors and generally support an open standard (LDAP) for storing user accounts and passwords in a variety of contexts within an LDAP ‘tree’ that is queried at logon to yield what’s known as an ACL (Access Control List). An ACL is essentially a ‘key’ in the form of the total resources that the user will have access to that are also protected within the LDAP directory.

The challenge with this path of evolution is that while given a standard repository with which to store information many vendors have chosen either to extend the repository to suit their own needs or to remain in a proprietary security mechanism that likely originated prior to the advent of LDAP. While it would be ideal to believe that all applications conceived after LDAP actually leveraged it to store security information, this is not the reality. Many applications are able to leverage LDAP for user authentication so you do not need to manage a separate user database for the application’s authentication, however the applications that facilitate authorization of users in this same fashion is an extreme minority. Microsoft Windows leverages Active Directory as a method of Single Sign On to enterprise resources but even they do not provide automated access to much else other than shared network file systems and printers. In fact, one of the only applications entirely integrated with Active Directory is their flagship enterprise email server, Exchange.

While IT began rolling its own authorization and in many cases authentication mechanisms, the software vendors were busy performing an acquisition spree around an application suite pattern known as ‘Identity Management’. The vendors that have an LDAP offering themselves provided customizations to LDAP in order to support a single sign-on within their own group of applications. They soon discovered that as they began to acquire more vendors and their applications, that an external, homogenized mechanism used to provide ‘security as a service’ within their own suites was becoming of paramount importance. These Identity Management suites consist mainly of identity provisioning, or the act of creating user accounts on systems as well as access control through groups or roles, or protecting access from a single authority to resources such as URLs, APIs or lower level access like SQL databases or Message Queues.

The industry eventually came up with a more robust XML based standard that could be embedded in SOA requests called SAML (Security Assertion Markup Language), which facilitates the trust relationship between these security directories. This standard allows you to federate identity, authenticate, and authorize across applications that trust the provider of SAML tokens. The authorization mechanism of SAML however, is limited, and provides an answer to whether or not someone is associated with a particular thing as well as providing information on whether or not someone has access to a particular resource after authentication. These items are of great use in constructing an audit trail of who did what and when but in the end falls short of enterprise policy compliance audit needs.

The next step in creating standards for making service oriented authorization a reality for SOA (Service Oriented Architecture) was the formation of XACML (eXtensible Access Control Markup Language) which provides a way to hold rules about making authorization decisions. These authorization decisions, essentially a yes or no answer based on who is asking to perform what action to which object, are invoked by a PEP (Policy Enforcement Point or something that protects a resource) when a request is made of that resource. The rules for evaluating the requested access is executed by a PDP (Policy Decision Point) that has access to the rules stored in a PAP (Policy Access Point). This mechanism is essentially providing much the same answer that an Active Directory request would in the form of an ACL, but it is externalizing the decision itself by analyzing what may be a resolution of intersecting hierarchies or authorities. This is critical for centralizing this behavior and getting it out of the hands of a diverse developer population of SOA services.

As was mentioned earlier one of the barriers to sharing in SOA has been the ability to effectively define, govern, and audit these types of policy. This holds even truer when we are dealing with Federal Agencies tasked with providing national security. You need look no further than a recent attempt to share and federate information across these agencies called Railhead to find the challenges in dealing with sensitive data and chain of authority operations on shared services even as simple as search applications. When XACML is the default method for securing your SOA as well as Portal and Web 2.0 applications, the benefits of a single central manager for policy overcome many of these challenges. In fact coupling this layer with secure database mechanisms such as Oracle’s Database Vault, Virtual Private Database, and Label Security you can increase DCID Protection Level compliance and defend against becoming a privacy concern for the public even where sensitive documents subject to redaction are the SOA payload.

Enterprise logging mechanisms such as ArcSight provide collecting and reporting on a wide range of enterprise application logs. Logging with this homogenized approach facilitates those things compliance auditing is most concerned with, such as easier preparation for insight into who knew or did what, when and how that are key tenants of information assurance building on top of cryptographic non-repudiation. There are many active risk management solutions coming from industry that essentially put network intrusion detection into the application layer by leveraging this type of usage, policy, and log data. Another beneficial application of the data coming from this type of solution is the ability to actively mine and manage roles, discovering which tasks may be better suited in different portal regions or which human resources may augmented by offloading or automating certain tasks.

At the end of the day for SOA to be a success in the federal government but given the demand placed on exposing and sharing such sensitive data with policies that may change over time requires a homogenous way to define and audit those policies as well as offer a consistent model for consumers. SAML and XACML provide those in a vendor neutral fashion that satisfies other requirements inherently that have otherwise seemed unapproachable threat analysis and role management. If SOA is the fa├žade for your legacy applications, then creating a service-oriented authorization layer ensures access to these systems is performed in a secure, manageable and perhaps most importantly, readily auditable fashion.

Wednesday, February 4, 2009

Why Technology is Integral to Legislation

With the current economic indicators and overall malaise we've found ourselves in I thought I would use the opportunity to throw out a novel idea. That idea is centered on the need for an understanding of what our government is capable of before we get around to spending hundreds of billions of dollars to fix a problem. Now this is not a political blog as I think there are enough of those that seek to place blame and there is plenty to go around. What I'm talking about is the rest of our government that has to implement these grand ideas and somehow try to show the results. In an attempt to keep an open mind, avoid groupthink and look at the solutions not only to this problem so that we do not let another ‘bubble’ catch us by surprise but also how things should be done from the top down in regulating our financial markets.



The title of this blog entry has to with technology and legislation or policy but before we get too deep into that I will set the stage discussing the state of technology as used in the Federal Government for various purposes today. I've blogged before about how you can use frameworks for an effective BPM based SOA solution around Governance, Risk and Compliance that I believe applies to this issue. The Federal government has done a good job in providing a defining schema (an XML based data model) for their budgeting process which works quite well (I have programmed Federal budgeting systems with it so I can attest) but rarely is this schema used other than on a yearly basis to make programs and projects appear to be most valuable per the metrics supported in the system. This becomes mainly a black art of spreadsheet 'magic' to try and position the way spending will benefit the citizen, war fighter or whatever the mission(s) that have the most visibility and therefore higher spending. This is a framework about how the finances of the government are managed in a portfolio. What we are attempting to address here is the financial and operational data regulate our nation’s markets from agencies like Federal Housing Administration, Federal Reserve, Treasury, FDIC, SEC/CFTC etc.


We’ll begin with some discussion of how these and other parts of the government interact to provide oversight to the activities within the private business community that affect our economy. While these interests do have some combined oversight and even included Fannie Mae and Freddie Mac in one case called OFHEO which now hails as FHFA, it’s obvious that the ties that bind them have been woefully inadequate to predict the overall effect of the mortgage industry on the health of Wall Street, Banks and therefore the overall economy. There are programs intending to tie them together such as the FFIEC and the Shared National Credit program. I believe the SNC had the best of intentions as outlined in a 2007 report from OCC covering some of the financial issues facing the banking system and the economy as a whole. During the last couple of years there was a resurrection of the modernization of the Shared National Credit program followed by a Secretary Paulson proposal for the complete re-structuring of a lot of the players involved.


These items are all positive, even if disruptive, but we are up against complexities encountered in this crisis that our government just isn't designed to handle. This blog entry isn't going to be about policy or even placing blame but more so about what I've seen that works and what we should be looking at to institute the best mechanisms moving forward to make sure the government is able to handle these complexities, seen or unseen, in the future. After looking longer at policies and proposals I'm more prone to believe suggestions such as this and this. As you look the previous links to LA Times article and the white paper, one theme is clear and that is that new institutions are needed not just for oversight and enforcement but potentially for actually operating some of the core functions just as Ginnie Mae have been forced into doing in light of Fannie Mae and Freddie Mac implosion.


If you look at the documents I referenced earlier from OCC and the Risk Management Association one of the themes that run through them is the incorporation of the Basel II or similar framework to measure potential for default, exposure at default, etc. as a consistent baseline in understanding the way each institution would handle those calculations. FDIC was averse to Basel II for a while due to the effect of capital requirements that would be brought to bear on lending institutions that it saw as unnecessarily burdensome (shown here on slide 37). As one who has an innate affection for frameworks due to their very nature, I will present one here for some pretext to the larger argument I'm trying to make and that is XBRL (which has some additional explanation here).


FDIC has not only since come around to Basel II but has gone to some lengths to look at XBRL as a solution for sanitizing the way financial data is transmitted. SEC has done some things with XBRL in regards to EDGAR and you can see here this is starting to get enriched as it pertains to more diverse banking paradigms in the case of the mutual fund taxonomy for example. I've done work with the SEC around options using a framework called FIXML which serves its purpose well. This proves that a single framework isn't necessarily the answer just as Basel II isn't necessarily the silver bullet either. Take a look at these two postings from The Institutional Risk Analyst in 2006 to look at XBRL as it pertains to Basel II within the Federal Government:


Here's an excerpt from the first:


IRA’s core mission is to develop and field benchmarking analytics. As a developer of computer enabled data mining tools, we strongly support the advent of publicly available, well-structured or “interactive” data. In the past we have lauded the FDIC’s modernization effort, which now has all FDIC-insured depository institutions submitting quarterly financial reports using eXtensible Business Reporting Language or XBRL. The transparency, completeness, consistency and quality of the FDIC’s bank information pipeline, which is used in our analysis engines to produce uniform benchmarks for Basel II, enables IRA’s “Basel II by the Numbers” report series to serve as a canvas upon which to demonstrate the power of “distilling” structured data.


And one from the second:


Fact is, a growing number of senior people in government are pondering the use of XML-based technology solutions to address the issues like those raised by the Corrigan Group, in particular the issue of gathering sufficient financial statement data about hedge funds and other lightly regulated entities to understand counterparty risk. And the FDIC's use of XBRL for gathering bank data is only one example.


One of the items that starts to emerge here is not only how to effectively rate complex banking institutions like hedge funds but also looking back at the OCC document you start to see concerns of how to regulate traditionally depository institutions like a Bank of America when acquisitions such as Countrywide for instance, begin to conglomerate (under Horizontal Reviews of Large Banks in the OCC document). Moving in to 2007 you start to see the sobering writing on the wall as seen here where it is more clearly understood how tied the performance of these credit derivatives like credit default swaps (CDSs) and Collateralized Debt Obligations(CDOs) were to the real estate market, specifically sub-prime and speculative mortgages. If you are not up to speed on how this meltdown occurred here is a crude animation on the 'evolution' of this problem.


When you take this to the macro level where the government should be managing the Shared National Credit risk you find a lag problem where indicators like you see from Bureau of Labor Statistics are simply a good indicator of what's already happened as are the economists' data coming from places like HUD. They are not however a good indicator of what is to come when what is coming is unique and as a pattern, somewhat unidentifiable. To be able to effectively spot a contagion you need the most accurate data in a format you can consistently retrieve and integrate for predictive analytics. There are great data mining operations going on in all of these institutions and there are vendors like UBMatrix that provide tools that XBRL solutions like the FFIEC Call Report can be built on.


Going back to the first posting from The Institutional Risk Analyst earlier I believe that major vendors in this space like IBM, Oracle, Microsoft, Fujitsu, etc. coupled with the advances in storage mechanisms for XML will render the following statement:


We rub our worry beads pondering the anthropology of innovation, each component developed piecemeal and each maturing to serve the interactive data space. Not unexpectedly, we see evidence of classic early adoption myopia -- competing solutions ignoring each other’s value, while pushing, at times aimlessly, in the hope of owning as much of the interactive data real estate as possible. We know from experience that the “one wrench does it all” approach hurts rather than helps the adoption of interactive data as a resource to the financial community. We believe there needs to be more context as to what functional purpose a technology has to each step in the value pipeline – collection, validation, storage, distillation & dissemination – over which data travels from source to user.


can and will be somewhat ameliorated by methods to handle schema evolution coupled with the XBRL organization maintaining the technology artifacts that represent the line of business involved.


And from the second posting from The Institutional Risk Analyst related to risk modeling:


To us, the chief obstacles preventing regulators and risk managers from understanding the nature of the next systemic tsunamis are 1) over-reliance on statistical modeling methods and 2) the use of derivatives to shift and multiply risk. Of note, continued reliance on VaR models and Monte Carlo simulations is enshrined in the Basel II proposal, the pending rule revision on CSFTs and the SNC proposal. All share an explicit and common reliance on statistical methods for estimating the probability of a default or P(D), for example. These ratings, in turn, depend heavily upon stability in the assumptions about the likely size and frequency of risk events. None of these proposed rules focus great attention or resources on assessing specific obligor behavior.


With a new XBRL based SOA underpinning this new framework adds discrete event simulation capabilities which give the ability to use computing models to play ‘games’ like the Department of Defense does that I've blogged about here. In addition is the capabilities for statisticians and economists to use this data in aggregate to measure true national credit and risk factors more accurately.


Another from the second posting from The Institutional Risk Analyst related to oversight of the risk calculations:


Thus the urgency in some corners of Washington regarding revisions to SNC, including a quarterly reporting schedule and enhanced disclosure of counterparty financial data. Remember that one of the goals of the SNC enhancements is to gather private obligor P(D) ratings by banks and to aggregate same to build a composite rating system for regulators to use to assess counterparty risk. That is, the creation of a privileged data rating matrix which could be used to assess the efficacy of both bank internal ratings and third party agency P(D) ratings alike. More on this and the effect of derivatives on visible bank loan default rates in a future comment.


Even though some say SOA is dead I know the platform is very much alive with products this and this which I worked on while at Oracle which are the underpinnings of Basel II solutions such as this. While Basel II isn’t the silver bullet here it is being recommended that is should stick around. Basel III won’t necessarily be the answer either but what we have is a method to surface the data artifacts of XBRL into processes (including business intelligence for items like risk calculations) that are easily mapped and understood into larger and larger scopes. That is really the beauty of these XML based frameworks and I've had the pleasure to implement others like AiXM, HL7 v3 and NIEM which support native message types and processes, for examples, airlines to the FAA or doctors to the FDA (and all applicable points in between). The resulting instances of these items become instantly transparent and ease the need to harmonize them for understanding in the process of oversight.


Back to the last paragraph of the second IRA posting which begins to delve into policy:


Bankers, after all, are not very good at understanding future risks, no matter how many ERM consultants they hire, default risk software implementations they direct, or meetings they attend at the Federal Reserve Bank of New York. Even making accurate observations about the present day risk events seems to be a challenge. Witness the fact that commercial bankers as a group managed to direct more than $2 out of every $3 in political contributions this year to Republican members of Congress, even as the GOP looks ready to lose control over the House and perhaps even the Senate. When Barney Frank (D-MA) is Chairman of the House Committee on Financial Services, perhaps the industry will take notice of this operational risk event and adjust accordingly.


Obviously this article is from 2006 and we've since moved back to a democrat controlled Congress and White House. In fact the gentleman in charge of the Federal Reserve Bank of New York at that time is now the new Secretary of the Treasury. Tim Geithner had this to say in 2006:


"Credit derivatives have contributed to dramatic changes in the process of credit intermediation, and the benefits of these changes seem compelling. They have made possible substantial improvements in the way credit risk is managed and facilitated a broad distribution of risk outside the banking system. By spreading risk more widely, by making it easier to purchase and sell protection against credit risk and to actively trade credit risk, and by facilitating the participation of a large and very diverse pool of non-bank financial institutions in the business of credit, these changes probably improve the overall efficiency and resiliency of financial markets. With the advent of credit derivatives, concentrations of credit risk are made easier to mitigate, and diversification made easier to achieve. Credit losses, whether from specific, individual defaults or the more widespread distress that accompanies economic recessions, will be diffused more broadly across institutions with different risk appetite and tolerance, and across geographic borders. Our experience since the introduction of these new instruments—a period that includes a major asset price shock and a global recession—seems to justify the essentially positive judgment we have about the likely benefits of ongoing growth in these markets."


While trying not to place blame on the current state of legislation or the operation of government as ‘it is what it is’ and to put it bluntly there is no possibility that you can prescribe legislation, hope to take its goals and objectives (measured semi-annually by OMB) and turn them over to an agency or agencies who's top officials may change every 4 years then expect their CIO's and others to let competitive bidding to the usual suspects in around the beltway while expecting different results. In fact, quite the opposite as we've compounded issues we can't fully understand because of a lack of transparency, not just of government and the oversight of industry but the overarching process models we have for doing business (risk models, etc.) and how they are audited by the government.


At the end of the day policy makers do things that sound appropriate and Sarbanes-Oxley is a good example of that which was passed to combat the abuses of Enron, WorldCom and others. The unintended consequences, sometimes in the form of a sense of false security, are often the ones that end up biting you the worst. The problem as I see it is that the institutions involved in the current crisis deal in finance specifically and not other lines of business that yield financial results. Not that these companies weren't subjected to the same policies only that valuation was the root of this crisis. There is blame to go around here from the housing policy that said banks should do the lending to the unqualified including the minions that became real estate speculators as a second job and the financial institutions that packaged, re-packaged and sold this debt. Since these complex financial instruments are the backbone of this contagion, it's virtually impossible to 'unwind' them at this point and most of them are at some point tied to mortgages. Dealing with this part of the problem could allow for stabilization of the situation to a certain extent.


Looking at what’s been done on housing policy thus far I don't see anything wrong with a forced stoppage of foreclosures although after having worked at FHA for the better part of 2008 I can tell that no one likely even remembers the Hope for Homeowners or its revisions for 'flexibility'. It's not to say that these things were and are without noble intentions but if we look back in history we see that HUD has shaped homeownership policy, at times to the detriment of the very banks in trouble today and FDIC has been in receivership of these banks as well (IndyMac comes to mind as a good example of an institution straddling that duality). If we look at the results of Hope for Homeowners we see that while the legislation targeted 400,000 homeowners only 25 have actually leveraged the relief offered in the legislation. Of course one of the unintended consequences was that FHA was able to hire many employees with the $25 million provided for implementation. This is significant because HUD and its largest program, FHA, have no budget for shared IT modernization as the entire pot (~$50 million per year) goes to maintain the ~50 mainframe applications running the systems there which take 18 months and many millions more for the simplest of changes to support new operational goals. Looking at the future and what’s happening with Federal Student Aid, who like HUD don’t even own their own data…indeed YOUR own data, and Sallie Mae there is another wave of this economic tsunami headed our way not to mention to the additional Adjustable Rate Mortgages that are about to reset hopefully at a reasonable enough rate to keep qualified homeowners in their home or some subsidies to keep potentially unqualified ones there as well.


Given what is happening to the banking industry at large, due mostly to mortgage lending and securities derived from mortgages, it's tough to make an argument against nationalization or making Bank of America the real ‘Bank of America’ or in lieu of continuing to feed them money and turning them into 'zombies' as seen in this paper. With the regulations and communization of strictly depository banking like local incumbent telecom companies, serving up a local telephone line or checking account isn’t viable as a growth business. It could be time to create some fresh banks seeing the Federal Reserve Board, Treasury and FDIC are really the mother of all banks anyway. Let the bad performers die, let the government use these funds to start a shadow banking system and mortgage underwriting and use new technology to do it right this time along with turning those entities back into commercial ones after the bad ones get valuation and/or simply die. I find it hard to believe that anyone would care whether they banked with Wells Fargo or some government version of a depository institution but would certainly care if their bank was insolvent like most of them are today but seem to get ongoing support when they should be allowed to fail. The other financial operations that deal in equities, insurance, risk and other financial sub sectors would be in a position, as many like JP Morgan are now, to perform many levels of financial services including acquisition of insolvent depository institutions like Washington Mutual.


When you really look at this problem you start to understand that people and companies they run, when left to their own devices will end up with a conflict of interest without consistent, thorough and timely oversight. Who ‘polices the police’ as they say and additional oversight from our government agencies and their respective Office of Inspector General along with Government Accountability Office will just never be enough. With the new paradigm presented in this blog encoded in their DNA the government has the ability to re-organize its enforcement staffs into a cohesive model that fits the institutions they are regulating along with allowing them the flexibility to morph as those institutions are likely to in the Brave New World we are facing. This frees up capitalism to move on about its merry way to recovery even if the depository side of banking and mortgages in the form of Freddie, Fannie and Ginnie all need to stay ‘governmentized’ for a while until the free market is able to sort out the mess the last debacle leaves behind. Using techniques like this we can make sure these items are spun off for good and, perhaps most importantly, no longer considered to be GSEs all while giving them the proper policy oversight.


At some point the right solution will be realized, perhaps when we come up with a price index and allow all homeowners to refinance (those who were rightfully financed in the first place) to a 10 year adjustable or 30 year fixed product at this adjusted home value. Before you dismiss the idea what will be stopping someone with good credit to move down the street for a nicer house at less than what they owe on their current mortgage? This will perhaps allow the bank and homeowner to share an increase in value over the coming years up to the original value of the mortgage at which point the homeowner would be the recipient of the additional equity or perhaps in some tapering sharing of equity. Interest rates would remain low for some time to allow for these loans and the 10 and 30 year products would hopefully put homeowners out of a time horizon for huge interest rates hikes that will undoubtedly occur to fight inflation. Homeownership would be tough for a few years during the time interest rates are going up but the banks would have sound balance sheets and at least the CDOs could be unwound and credit default swaps absorbed. At some point all would return to homeostasis.


What we need is the ability to not only found 'language' along with these goals, objectives and measures but levels of process models that ensure how they will be carried out. The main components can be put into a process model that decomposes to another level and eventually into the implementation of the systems that facilitate the negotiation of complex instruments by presenting counterparty risk in aggregate each time they are bought and sold. More importantly is that oversight and measures of efficiency for what the government may be doing to bail these institutions out as an example would be immediately available. Simple diagram of how these levels of complexity and volume decompose is shown here:


Effectively this would make multiple iterations of the Troubled Asset Relief Program (TARP) not only inherently transparent but also be conducted on a transactional basis from the funds set aside to perform duties assigned to them by the legislative policy. If anyone believes that TARP, a National ID Card or an electronic medical record maintained by the government can be devised, funded, implemented, managed and reported on to allow for adequate oversight that it would accomplish the goals that were originally intended and not instigate other, possibly worse side effects is not being realistic or needs to be educated as to why it’s impossible to let ‘the smart people at IBM take care of it’. At some point while we may stop foreclosures or even subsidize mortgage payments it will not stop what has devolved into the end of a game of musical chairs where someone has taken all of the chairs. Whatever the solution, we are all in this together, homeowners, banks and government so the solution should allow all 3 to participate and have visibility to results on a real time basis to rebuild the trust within our capitalist society. Otherwise government will spend more money and not accomplish desired results; banks will foreclose on more homes and commercial properties as their capital levels are fortified by the government while waiting for an uptick in housing to sell off foreclosed inventory. The problem there is the new homeowners won't exist as there won't be an economy with jobs to support any new homeowners. We'd better get the smart people on this and allow them to participate on how we solve this, implementing technology at every step in the process from legislation forward to insure success. We don’t have the money available in the whole world to support feeding this problem as it exists now. Otherwise we had better be prepared to understand that (especially without such techniques as espoused here in this blog) there will be more Orwellian debacles yet to come and perhaps most importantly, we won’t see the full impact of their aggregate perils until it’s too late.


In conclusion I'm essentially sounding the alarm that while things coming out of Congress can be debated to great end about their intentions or fairness they cannot be measured ahead of time for their efficiency in addressing the problem(s) at hand and periodic measurements of aggregated efficiency which could be construed as ‘effectiveness’ just isn’t agile enough. There isn’t the kind of ammunition left to keep firing $1 trillion birdshot with our double barrel sawed off that we call the Treasury and Federal Reserve to clean up this mess. We need is a fresh start with a few well placed 7mm sniper rounds to solve some of these systemic issues. I'm not suggesting we throw caution to the wind and adopt some Isaac Asimov state of machine rule, nor am I suggesting that I should be the next ruler of the free world because I understand how these systems work and more importantly, should work to support new initiatives. I'm not sure about how the rest of the world feels about a technocracy but it's obvious our Federal Government is far from that at this point. Keep in mind IT spending for the entire Federal Government is only around $78 Billion which is only 10% of the new stimulus bill just passed by Congress. What I'm saying is that in the world where we are more and more dependent on technology we cannot let the inefficiencies of government permeate the implementation of the new programs especially the IT that is mainly responsible for 'making the trains run on time' as it were. We need a new era of the President's Management Agenda where a Federal CTO who oversees FOIA and the like are going to fall way short of not only enabling technology that can support the goals of legislation but mitigating the risks (doing away them with in an ideal world) of the unintended consequences by providing a framework to provide a ‘line of sight’ when tweaking policy with automatic instant transparency, neither of which would otherwise be provided.

Thursday, January 29, 2009

Searching Non-Textual, Unstructured Geospatial Images and Video

Another problem that I looked into last year and likely will remain a challenge for years to come is one I will discuss in the context of the DARPA research project called VIRAT (as the others are similar but can't be discussed). The problem, in a nutshell, is that with increasing sources of video being captured for national security and otherwise, how is it possible to create digitized events from that video as the number of human eyes to pore over this ever increasing amount of video has been exhausted? The thought occured to me to take a stab at this one using a database centric approach with offloaded, clustered object cache.


When I initially looked at this situation and began to decompose it, I saw the reverse of a task I used to do way back in my days of 3DStudio and Maya animation productions. The task of rendering video from a model went like this: you would enlist 'slaves' which for me were all the PC's in the entire office (486 at the time) and running a non-interactive version of the software, these slaves would receive frames to be rendered from the master where the vector model, overlays and effects merged and returned to the master that would assemble it into video. Now the obvious missing piece here is that you don't know the model you are trying to derive when approaching the problem as it's described here.


There are vendors out there like Eptascape who have products that perform 'video surveillance analytics' using 'computer vision'. This particular product garners mention here in that it uses MPEG Layer 7 (MPEG-7) data for event processing. The challenge here is that these products are designed for fixed or limited field of view cameras and simple motion detection flagging 'object descriptors' that aren't configured to be ignored. These would be items like texture, color, centroid, bounding box and 'shape'. With the problem we're looking at there are many factors that technologies such as this haven't likely addressed such as thousands of square miles of footage, moving cameras that may have the earth's curvature or atmospheric conditions to account for and in general a variable field of vision due to its own and its targets varying global positions. There are also high performance computing solutions out there like this that may in fact be employing similar bundles of technologies presented in this blog entry and SGI is a pioneer in the this subject area but the cost is likely somewhat prohibitive. This solution attempts to portray general use software products and open source frameworks that can be made to solve this very specific need.



So really what we need is a good format to store shapes extracted from frame samples and that is one that's been around for quite some time and has grown to be known as X3D. A good thought about how to abstract and semantically extend MPEG-7 into X3D is located here while a good reference for how to use RDF for such matters is located here. Getting a framework to use in an SOA and for complex event processing was something that I had looked into before with PEO STRI (the war games folks) who were trying to get real time data in from the battle field to achieve a live battle enhanced simulation. In this solution having a catalogue of known geomteries that can be infused into an offloaded clustered object cache like Coherence for event detection is the idea for the end result but how to generate the geometries from the images?



I will refer to Oracle's Data Mining and Oracle's orthogonal partitioning clustering (O-Cluster) which is a density-based method that was developed to handle large high-dimensional databases for a solution in deriving geometries from hyperspectral or image data. Thes geometries can be used as a baseline for comparison against current feeds to trap events such as those saught for military or law enforcement operations. Much of this type of processing is available and actively used in aerial devices such as ARCHER used in search and rescue, Homeland Security, etc. Extracting geometries from images is based on a technology that has been around for a long time called stereophotogrammetry.



Even for a semi-static baseline to compare against this is a massive amount of data and processing that we are talking about. We are also talking about taking in orders of magnitude more data for event identification which may be of a mission critical nature and demand more immediate results which as mentioned before, there are only so many human eyes available for such work. Where does that leave us in the analysis of this problem? Ouside of some promising new directions in computing technology such as this and this what we are looking for is identifying criteria that constitute an 'event' within physical observations that are digitized in some fashion. Given that this is criteria that is really potentially unknown it must be identified as different from some baseline but not in the category of something that doesn't apply such as an animal moving across a sensor area when we are looking for a vehicle but want to be flagged for other 'suspicious' activity.


This reality illuminates the need for a fusion of data into a package that supports processing millions of frames per second in order to stich together the information aggregate into a phenomenology of sorts. With the baseline data captured into some kind of Triangulated irregular network and perhaps derived from a Digital Elevation Model what is needed is some capture of data that facilitates this quick matching and processing for variations that constitute an event. A new technology called LiDAR has emerged as a method to not only retrieve elevation on the order of millions of a points per second over many square miles but also measure differences in small windows of time to ascertain phenomenon like speed which can be used to determine, for instance, the class of vehicle that can obtain such speed. This is similar to a method used by the Coast Guard and other sea based interdiction units called sonobuoys. Here is the rudimentary schematic of LiDAR from Wikipedia:






Since this scope of this blog is primarily geared towards computing solutions I will offer this as a good start for understanding the Spatial component of the Oracle Database as it relates to this type of data. As a small addendum to the analysis I will also add that Oracle 11g now supports TIN and DEM as a stored object type as the slides in the previous link are based on Oracle 10g. While I won't get into the SQL used to process events coming from all of the data I will say that it is nice that all of the work done in the past using a language such as AutoLISP that I used to provide utilities for the merging of surfaces and the like while prepping virtual worlds for 3D animations. Given so much data and said data being enriched with descriptive metadata, the possibilities for visualizing results is reaching the status of science fiction. Take this example of LiDAR points that render an image and then take a look at the site photosynth.net from Microsoft. Some of the examples on this site are of wide open spaces, some street level and some aerial but you get the idea that another angle for fusing this data is stiching it into a panorama or perhaps even a photorealistic 3D world.



In the final analysis the tools are avaible off the shelf to acquire, harmonize and associate these types of data in order to compare differences in them to constitute 'events' that should be presented for further review by humans. Any progress against this problem goes a long way as it stands now the entire body of data is subject to review. 'Greedy' flagging of too much is the acceptable direction for error at this time and like other systems will become smarter over time as anomalies are more accurately discerened and likewise targets and their possible permutations as presented in the various or combined media types are more readily identified for proactive routing of event data.

Tuesday, January 27, 2009

A Different Search Paradigm

The first exposure I had to solving the problem I'm going to present in this blog was at National Science Foundation where they are charged with funding grants for researching an ever evolving set of sciences, generally those outside of medicine as that is the work of Health and Human Services. Their version of this problem from a practical sense is that they would receive any number of proposals written to apply for money from these grants. Generally the grants are structured to solicit the solving of a problem and not so much how to solve it. This means that scientists from many fields would try and take a stab at proposing a solution to get the grant money and carry out their research.

NSF's biggest expenditure outside of the grant itself is bringing in experts in the field of study that can judge the merit of these proposals. They convene a panel to collectively decide who should receive the grant. Now while that sounds simple enough, the minutae of each approach and the science that it entails poses a large problem and that is simply to categorize each submission so that it can be reviewed by the appropriate experts. This 'pre-processing' task alone accounts for a majority of the operating budget at NSF and consumes time from some of brightest and best people available.

The initial solution proposed to alleviate this heavy lifting was a mixture of text mining and RDF based on work done with MEDLINE and HHS as seen here and here. This back end coupled with tools for modeling taxonomy and ontology such as Protege, TopBraid and a middle layer for visualizing results called Siderean Seamark. While this approach seemed logical the problem at NSF that there is no corpus like MEDLINE availalbe for 99% of the sciences documented, a list which is evolutionary. In fact you can find a shallow taxonomy of these fields and their children in directories on the internet but the concepts that must be represented to accurately 'bucket' the research proposals. Therefore the initial data to prime the system for the desired result just didn't exist.

Fast forward to another project which was looking at a search solution for the data collected during the Bush administration. This was purported to be in the petabyte range and consisted mostly of email and attachments as well as policy and official releases. While the choice for OutsideIn to parse the data which is what Google uses for the deep web was fairly obvious what was of concern for the National Archives was that the Clinton library was only a few terabytes and performing poorly. Now while it seemed the hardware running this data set may have been slightly anemic we were dealing with several orders of magnitude here where that size of a machine may not even exist. Of course the alternate solution is the Google one where a data center of commodity blades is used to process searches. Google along with Microsoft FAST were competitors in this technology evaluation.


After looking a bit more closely at the alternatives and the scalability of the inverted indeces and hardware vendor analysis from IBM leveraging Nutch and Lucene as well as distributed file systems like Hadoop used by Google and others I came to the conclusion that out of the box technologies (at least from Oracle) only had one chance to be able to compete with more establised technologies and it wasn't going to be the Text Index shown here:

While this was a good design it didn't seem to scale up or out efficiently even when put on hardware like this which supports 128TB of RAM or about an eighth of the size of the content itself. In reality it's the size of this type of index that proves cumbersome so we attempted a new approach to looking at a solution to get away from an index that is a high percentage (sometimes half) of the size of the content itself.

At the root of the solution was the idea that we would use the OutsideIn to parse all email, html, xml and binary document content into a homogeneous XML that could be stored in XMLDB and used with a XSLT stylesheet to product 'snippets' when those entries were tagged in a search. The real crux of the performance issue was addressed at crawling and parsing time by stripping away the Wordlist, Stoplist and Lexer from the 'index' structure that would be queried in searches. This very important part of any full text index was abstracted into a flat table of the entire english language as extracted from WordNet. There was also a semantic version of WordNet that would allow for expanding search terms. Since this was an all English corpus the approach was linguisically valid as there are any number of off the shelf solutions that translate end results to other languages. In each case allowing for searches in other languages was not intended to be supported.


As new words and acronyms were presented by subsequent crawls they would be added to the relational table as well as the semantic table. Support for white space, puntuation and other noise such as special characters was added to be used in exact quote matches. Of importance here is the realization that the English language base is only ~1 million words and let's say it even doubles or triples with slang, acronyms, etc. it is still a relatively small data field when done with a 3 byte integer. In addtion there was an XML CLOB to support the retrieval of snippets. The primary key columns contained a number for the document id and for each discreet instance of a word in each document. The latter would serve as the ordinal position for retrieving words quoted in the search criteria and receive its own index as well as being a sequence that would reset for each document. Total in line record size of these primitive types would yield 3+4+4=11 bytes so 10s of billions of total words would only yield roughly a terabyte of integer storage and could be managed on TimesTen. We now had petabytes of content able to be stored as searchable in terabytes worth of integers and indexes. This cardinality of the dictionary word id to its foreign key in the table would be supported via a bitmap index or bitmap join index. The issues left to be conquered would be the maintenance of that index on subsequent crawls (therefore inserts into the table) and how long it would take to create a multi-terabyte index of this type as that is the recommended approach since maintenance on the index would be prohibitive and yield fragmentation. Partitioning and its globally partitioned indexes can not be used as they do not support bitmap index type at this point.


The real value of the solution for the President Bush library is that it would run on a traditional scale-up, shared global memory system that would allow you to save data center costs as well as labor costs as the maintenance of the system would be the standard off the shelf components used in many places today. As an SQL based system queries would be expanded with a semantic query against the RDF version of WordNet, 3 byte integers would be retrieved from the relational representation of the WordNet data and used in the main query against the bitmap index join (or bitmap join index) resulting in a 'greedy' get on all of the 'hits' where that word or synonym, etc. occurs in the body of crawled data. This inner query of a correlated subquery would move the pertinent records out to another cache area where they could be again semantically filtered based on the search terms and used to retrieve their XML representation that gets merged with an XSLT to produce snippets of the results.


It's not hard to see that making clickstream or other 'ranking' mechanisms for search results would be an easy bolt on to this simplest of data representations. I should credit some documentation that I used to validate this approach here and some insight on the binary state of the bitmap index here and here. Since so much of what gets stored in this situation is redundant Advanced Compression techniques can be used against the document base and XML representation for a significant savings. So what I'm getting at here is that all of the tools are available on standard platforms and standard off the shelf software to build your own mini Google as it were albeit constrained to a single language it opens up a world of possibilities.


Let me tie all of this together and talk about why it's important that more mainstream computing power be used for Web 3.0 and not have it in the hands of the few who can fund their own data centers for 'clouds' or get multi million dollar grants to author a more single purpose system. There are many folks doing practical things with mainstream technology and I'll refer you two such folks here and here. I find it interesting that at the lead of the Web 3.0 entry in Wikipedia it says "it refers to aspects of the Internet which, though potentially possible, are not technically or practically feasible at this time" and that is what I think can be changed here. An old colleague from Oracle makes my case here. Although Oracle doesn't have a research division per se in the vein of a Microsoft or IBM what they do have are technologies that are practical, well thought out and (generally) ready for use without a PhD to understand them.

Back to the original National Science Foundation problem and the understanding that you start with essentially an infant system that doesn't understand much of anything except how to answer a search query. But using collective intelligence from information browsers you can begin to build an understanding of the relative bonds of the information and through experts tagging results of their interpretations build an 'understanding' of the underlying data. You've taken natural language processing in its classical sense somewhat out of the picture and let the users interacting with the system render the interpretive results including language.

Now imagine if experts in the subject matters were able to infuse their knowledge via OWL ontologies. There is a good book I've read on this area of research called Data Mining with Ontologies: Implementations, Findings and Frameworks which really begins to show how semantic content can not only be used to help enhance search queries and results navigation but in fact control the way in which bodies of data are mined for intelligence. Powerful huh? Now your browsing history and favorites can be made into a semantic package that gives you some context as well when you interface with this source and Google Desktop and others have seen this vision as well. One could make the case that it will be startups like 33Across and Peer39 that will actually monetize Facebook and other social networking sites.

Not surprisingly Oracle chose to put the semantic component inside their Spatial extension to the database. When you look at the Wikipedia Web 3.0 entry under Other Potential Research you see tremendous opporunity inside this data structure format as the data itself becomes dimensionally navigable. For one of the best explanations of this complex paradigm I refer you to a publication called The Geometry of Information Retrieval which is a brilliantly thought out explanation of the 'Existentialism' of the information itself. Mathematical explanations like Causy-Scwhartz Inequality gives light on how to use data mining techniques such as probability and variances to bucket data into correlated, informational assets.

I believe Second Life and the quote below it from Sir Tim Berners-Lee:

"I think maybe when you've got an overlay of scalable vector graphics—everything rippling and folding and looking misty—on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource."

are certainly the power of what we're dealing with which is not only a distributed model of collective intelligence that learns over time through interaction and data acquistion but an interface that allows users to immerse themselves inside of a navigable information space that is bourne from the very cognitive representation of the knowledge they seek. The real power comes in the future (hopefully in my lifetime) where it is understood how to translate this knowledge base into any language and other human interface devices that will see the reality of the Socio-technological research talked about in the other area of Potential Research shown on the Wikipedia Web 3.0 entry. It gives hope that the transfer of this knowledge and therefore enlightenment and understanding would be much easier for all to achieve.

As my hacker ethic would have me believe all of this should and will be accomplished by the masses as they will be the recipient of it and it will not succeed without the proper input where the 'Internet', Web 3.0 and beyond, remain an intangible, untaxed and unowned amorphous entity that can be used for the greater good.