Deep Thoughts: Searching Non-Textual, Unstructured Geospatial Images and Video

Another problem that I looked into last year and likely will remain a challenge for years to come is one I will discuss in the context of the DARPA research project called VIRAT (as the others are similar but can't be discussed). The problem, in a nutshell, is that with increasing sources of video being captured for national security and otherwise, how is it possible to create digitized events from that video as the number of human eyes to pore over this ever increasing amount of video has been exhausted? The thought occured to me to take a stab at this one using a database centric approach with offloaded, clustered object cache.

When I initially looked at this situation and began to decompose it, I saw the reverse of a task I used to do way back in my days of 3DStudio and Maya animation productions. The task of rendering video from a model went like this: you would enlist 'slaves' which for me were all the PC's in the entire office (486 at the time) and running a non-interactive version of the software, these slaves would receive frames to be rendered from the master where the vector model, overlays and effects merged and returned to the master that would assemble it into video. Now the obvious missing piece here is that you don't know the model you are trying to derive when approaching the problem as it's described here.

There are vendors out there like Eptascape who have products that perform 'video surveillance analytics' using 'computer vision'. This particular product garners mention here in that it uses MPEG Layer 7 (MPEG-7) data for event processing. The challenge here is that these products are designed for fixed or limited field of view cameras and simple motion detection flagging 'object descriptors' that aren't configured to be ignored. These would be items like texture, color, centroid, bounding box and 'shape'. With the problem we're looking at there are many factors that technologies such as this haven't likely addressed such as thousands of square miles of footage, moving cameras that may have the earth's curvature or atmospheric conditions to account for and in general a variable field of vision due to its own and its targets varying global positions. There are also high performance computing solutions out there like this that may in fact be employing similar bundles of technologies presented in this blog entry and SGI is a pioneer in the this subject area but the cost is likely somewhat prohibitive. This solution attempts to portray general use software products and open source frameworks that can be made to solve this very specific need.

So really what we need is a good format to store shapes extracted from frame samples and that is one that's been around for quite some time and has grown to be known as X3D. A good thought about how to abstract and semantically extend MPEG-7 into X3D is located here while a good reference for how to use RDF for such matters is located here. Getting a framework to use in an SOA and for complex event processing was something that I had looked into before with PEO STRI (the war games folks) who were trying to get real time data in from the battle field to achieve a live battle enhanced simulation. In this solution having a catalogue of known geomteries that can be infused into an offloaded clustered object cache like Coherence for event detection is the idea for the end result but how to generate the geometries from the images?

I will refer to Oracle's Data Mining and Oracle's orthogonal partitioning clustering (O-Cluster) which is a density-based method that was developed to handle large high-dimensional databases for a solution in deriving geometries from hyperspectral or image data. Thes geometries can be used as a baseline for comparison against current feeds to trap events such as those saught for military or law enforcement operations. Much of this type of processing is available and actively used in aerial devices such as ARCHER used in search and rescue, Homeland Security, etc. Extracting geometries from images is based on a technology that has been around for a long time called stereophotogrammetry.

Even for a semi-static baseline to compare against this is a massive amount of data and processing that we are talking about. We are also talking about taking in orders of magnitude more data for event identification which may be of a mission critical nature and demand more immediate results which as mentioned before, there are only so many human eyes available for such work. Where does that leave us in the analysis of this problem? Ouside of some promising new directions in computing technology such as this and this what we are looking for is identifying criteria that constitute an 'event' within physical observations that are digitized in some fashion. Given that this is criteria that is really potentially unknown it must be identified as different from some baseline but not in the category of something that doesn't apply such as an animal moving across a sensor area when we are looking for a vehicle but want to be flagged for other 'suspicious' activity.

This reality illuminates the need for a fusion of data into a package that supports processing millions of frames per second in order to stich together the information aggregate into a phenomenology of sorts. With the baseline data captured into some kind of Triangulated irregular network and perhaps derived from a Digital Elevation Model what is needed is some capture of data that facilitates this quick matching and processing for variations that constitute an event. A new technology called LiDAR has emerged as a method to not only retrieve elevation on the order of millions of a points per second over many square miles but also measure differences in small windows of time to ascertain phenomenon like speed which can be used to determine, for instance, the class of vehicle that can obtain such speed. This is similar to a method used by the Coast Guard and other sea based interdiction units called sonobuoys. Here is the rudimentary schematic of LiDAR from Wikipedia:

Since this scope of this blog is primarily geared towards computing solutions I will offer this as a good start for understanding the Spatial component of the Oracle Database as it relates to this type of data. As a small addendum to the analysis I will also add that Oracle 11g now supports TIN and DEM as a stored object type as the slides in the previous link are based on Oracle 10g. While I won't get into the SQL used to process events coming from all of the data I will say that it is nice that all of the work done in the past using a language such as AutoLISP that I used to provide utilities for the merging of surfaces and the like while prepping virtual worlds for 3D animations. Given so much data and said data being enriched with descriptive metadata, the possibilities for visualizing results is reaching the status of science fiction. Take this example of LiDAR points that render an image and then take a look at the site photosynth.net from Microsoft. Some of the examples on this site are of wide open spaces, some street level and some aerial but you get the idea that another angle for fusing this data is stiching it into a panorama or perhaps even a photorealistic 3D world.

In the final analysis the tools are avaible off the shelf to acquire, harmonize and associate these types of data in order to compare differences in them to constitute 'events' that should be presented for further review by humans. Any progress against this problem goes a long way as it stands now the entire body of data is subject to review. 'Greedy' flagging of too much is the acceptable direction for error at this time and like other systems will become smarter over time as anomalies are more accurately discerened and likewise targets and their possible permutations as presented in the various or combined media types are more readily identified for proactive routing of event data.

Deep Thoughts

Thursday, January 29, 2009

Searching Non-Textual, Unstructured Geospatial Images and Video

No comments: