Saturday, June 30, 2012

Topology Dictionary for 3D Video Understanding

By Tony Tung and Takashi Matsuyama

This paper presents a novel approach that achieves 3D video understanding. 3D video consists of a stream of 3D models of subjects in motion. The acquisition of long sequences requires large storage space (2 GB for 1 min). Moreover, it is tedious to browse data sets and extract meaningful information. We propose the topology dictionary to encode and describe 3D video content. The model consists of a topology-based shape descriptor dictionary which can be generated from either extracted patterns or training sequences. The model relies on 1) topology description and classification using Reeb graphs, and 2) a Markov motion graph to represent topology change states. We show that the use of Reeb graphs as the high-level topology descriptor is relevant. It allows the dictionary to automatically model complex sequences, whereas other strategies would require prior knowledge on the shape and topology of the captured subjects. Our approach serves to encode 3D video sequences, and can be applied for content-based description and summarization of 3D video sequences. Furthermore, topology class labeling during a learning process enables the system to perform content-based event recognition. Experiments were carried out on various 3D videos. We showcase an application for 3D video progressive summarization using the topology dictionary.


Friday, June 29, 2012

iSpy turns your PC into a full security and surveillance system

iSpy Connect

iSpy uses your cameras, webcams, IP cams and microphones to detect and record movement or sound. Captured media is compressed to flash video or mp4 and streamed securely over the web and local network. iSpy can run on multiple computers simultaneously and has full Email, SMS and MMS alerting functions and remote viewing.

An Adaptive Video Retrieval System Based On Recent Studies On User Intentions While Watching Videos Online

Article from

By Christoph Lagger , Mathias Lux , Oge Marques

We have developed a prototype of an adaptive video retrieval system that leverages the knowledge of users’ intentions (and their relationship to video genres/categories) uncovered by our recent studies on “user intentions while watching videos online” [1] to provide better search results and a user interface adapted to the intentions and needs of its users. The goal behind the development of this prototype is to provide a better solution to the problem of including the user’s context into video retrieval than the one offered by baseline video retrieval interfaces (e.g., YouTube). The prototype is called “You(r)Intent.” 

The diagram of the prototype (see Figure 1) consists of three main blocks: (i) The user interface, which includes a text input box for typing a query and four buttons to communicate the intention to the system, namely:

  • to learn something,
  • to be entertained,
  • to get informed, or
  • to solve a task;

(ii) A ruleset, derived from the results of our studies [1], so that videos whose categories provide higher correlation with the user’s intention are ranked higher in the search results; and (iii) A collection of sources of video content, e.g., Vimeo, YouTube, or Khan Academy.

Figure 1: Block diagram of the prototype for an adaptive video retrieval system.

Figure 1: Block diagram of the prototype for an adaptive video retrieval system.

For better understanding, we have outlined a simple example scenario. Let us assume a user wants to learn about a specific topic and she types the query “moonwalk.” Since the user has a “learning intention” in mind, she clicks on the 'I want to learn something’ button. Our pattern-based ruleset then optimizes the search to certain sources and categories. In this example case, YouTube and Khan Academy might be used as video sources and the videos will be ranked by categories from the strongest to the weakest correlation to the user’s intention [1], leading to a result screen whose screenshot appears in Figure 2. Notice how three clips from the “How-to & Style” category (all of which are related to the moonwalk dance popularized by Michael Jackson) appear at the top of the screen, with the second-highest-ranked category (“Science & Technology”) appearing in a second block, mostly containing NASA footage from historical Apollo-era moon explorations. By prioritizing the categories that are most strongly correlated to the user’s intentions and adopting a visually pleasant layout that shows them in easily distinguishable blocks, we circumvent the ambiguity caused by the query term and provide an intuitive way to navigate to the desired result.

 Figure 2: Result screen of the prototype when the user queries for videos containing the “moonwalk” keyword and expresses the intention to learn something.

Figure 2: Result screen of the prototype when the user queries for videos containing the “moonwalk” keyword and expresses the intention to learn something.

After finishing the development of the first version of the prototype, we performed a user survey, asking users to perform specific video retrieval tasks, and to report on how easily they carried out those tasks and how satisfied they were with the results. Evaluation methods used for this survey included: observation of the interviewee, analysis of mouse tracking heat maps, activity logging, and semi-structured interviews with the participants. Overall, participants solved each task somewhat easily and were very satisfied while working with the prototype. Moreover, the position at which a video of interest (which would solve the task at hand) appears was considered satisfactory throughout all tests. 


[1] C. Lagger, M. Lux and O. Marques, “What Makes People Watch Online Videos: An Exploratory Study”,ACM Computers in Entertainment (2012) [submitted to]

Wednesday, June 27, 2012

Janken (rock-paper-scissors) Robot with 100% winning rate

The purpose of this study is to develop a janken (rock-paper-scissors) robot system with 100% winning rate as one example of human-machine cooperation systems.

Recognition of human hand can be performed at 1ms with a high-speed vision, and the position and the shape of the human hand are recognized. The wrist joint angle of the robot hand is controlled based on the position of the human hand. The vision recognizes one of rock, paper and scissors based on the shape of the human hand. After that, the robot hand plays one of rock, paper and scissors so as to beat the human being in 1ms.

This technology is one example that show a possibility of cooperation control within a few miliseconds. And this technology can be applied to motion support of human beings and cooperation work between human beings and robots etc. without time delay.

Article from

Saturday, June 23, 2012

Predicting events in videos, before they happen. CVPR 2012 Best Paper

Article from

Intelligence is all about making inferences given observations, but somewhere in the history of Computer Vision, we (as a community) have put too much emphasis on classification tasks.  What many researchers in the field (unfortunately this includes myself) focus on is extracting semantic meaning from images, image collections, and videos.  Whether the output is a scene category label, an object identity and location, or an action category, the way we proceed is relatively straightforward:

  • Extract some measurements from the image (we call them "features", and SIFTand HOG are two very popular such features)
  • Feed those features into a machine learning algorithm which predicts the category these features belong to.  Some popular choices of algorithms are Neural Networks, SVMs, decision trees, boosted decision stumps, etc.
  • Evaluate our features on a standard dataset (such as Caltech-256, PASCAL VOC, ImageNet, LabelMe, etc)
  • Publish (or as is commonly know in academic circles: publish-or-perish)

While only in the last 5 years has action recognition become popular, it still adheres to the generic machine vision pipeline.  But let's consider a scenario where adhering to this template can hav disastrous consequences.  Let's ask ourselves the following question:

Q: Why did the robot cross the road?

A: The robot didn't cross the road -- he was obliterated by a car. This is because in order to make decisions in the world you can't just wait until all observations happened.  To build a robot that can cross the road, you need to be able to predict things before they happen! (Alternate answer: The robot died because he wasn't using Minh's early-event detection framework, the topic of today's blog post.)

This year's Best Student Paper winner at CVPR has given us a flavor of something more,something beyond the traditional action recognition pipeline, aka "early event detection."  Simply put, the goal is to detect an action before it completes.  Minh's research is rather exciting, which opens up room for a new paradigm in recognition.  If we want intelligent machines roaming the world around us (and every CMU Robotics PhD student knows that this is really what vision is all about), then recognition after an action has happened will not enable our robots to do much beyond passive observation.  Prediction (and not classification) is the killer app of computer vision because classification assumes you are given the data and prediction assumes there is an intent to act on and interpret the future.

While Minh's work focused on simpler actions such as facial recognition, gesture recognition, and human activity recognition, I believe these ideas will help make machines more intelligent and more suitable for performing actions in the real world.

Disgust detection example from CVPR 2012 paper

To give the vision hackers a few more details, this framework uses Structural SVMs (NOTE: trending topic at CVPR) and is able to estimate the probability of an action happening before it actually finishes.  This is something which we, humans, seem to do all the time but has been somehow neglected by machine vision researchers.

Max-Margin Early Event Detectors.
Hoai, Minh & De la Torre, Fernando
CVPR 2012

The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from human-robot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper proposes a maximum-margin framework for training temporal event detectors to recognize partial events, enabling early detection. Our method is based on Structured Output SVM, but extends it to accommodate sequential data. Experiments on datasets of varying complexity, for detecting facial expressions, hand gestures, and human activities, demonstrate the benefits of our approach. To the best of our knowledge, this is the first paper in the literature of computer vision that proposes a learning formulation for early event detection.
Early Event Detector Project Page (code available on website)
Minh gave an excellent, enthusiastic, and entertaining presentation during day 3 of CVPR 2012 and was definitely one of the highlights of that day. He received his PhD from CMU's Robotics Institute (like me, yipee!) and is currently a Postdoctoral research scholar in Andrew Zissermann's group in Oxford.  Let's all congratulate Minh for all his hard work.

Article from

Saturday, June 16, 2012

Is that smile real or fake?

Do you smile when you're frustrated? Most people think they don't — but they actually do, a new study from MIT has found. What's more, it turns out that computers programmed with the latest information from this research do a better job of differentiating smiles of delight and frustration than human observers do.
The research could pave the way for computers that better assess the emotional states of their users and respond accordingly. It could also help train those who have difficulty interpreting expressions, such as people with autism, to more accurately gauge the expressions they see

Read more about this story here

Related publication:

M. E. Hoque, R. W. Picard, Acted vs. natural frustration and delight: Many people smile in natural frustration, 9th IEEE International Conference on Automatic Face and Gesture Recognition (FG'11), Santa Barbara, CA, USA, March 2011. (PDF: 1693 KB)

Article from

Sunday, June 10, 2012

Driving without a Blind Spot May Be Closer Than It Appears

A side-by-side comparison of a standard flat driver's side mirror with the mirror Hicks designed, which has a much wider field of view and minimal image distortionA side mirror that eliminates the dangerous “blind spot” for drivers has now received a U.S. patent. The subtly curved mirror, invented by Drexel University mathematics professor Dr. R. Andrew Hicks,dramatically increases the field of view with minimal distortion.     

Traditional flat mirrors on the driver’s side of a vehicle give drivers an accurate sense of the distance of cars behind them but have a very narrow field of view. As a result, there is a region of space behind the car, known as the blind spot, that drivers can’t see via either the side or rear-view mirror. It's not hard to make a curved mirror that gives a wider field of view – no blind spot – but at the cost of visual distortion and making objects appear smaller and farther away.

Hicks’s driver’s side mirror has a field of view of about 45 degrees, compared to 15 to 17 degrees of view in a flat driver’s side mirror. Unlike in simple curved mirrors that can squash the perceived shape of objects and make straight lines appear curved, in Hicks’s mirror the visual distortions of shapes and straight lines are barely detectable.

Hicks, a professor in Drexel’s College of Arts and Sciences, designed his mirror using a mathematical algorithm that precisely controls the angle of light bouncing off of the curving mirror.

“Imagine that the mirror’s surface is made of many smaller mirrors turned to different angles, like a disco ball,” Hicks said. “The algorithm is a set of calculations to manipulate the direction of each face of the metaphorical disco ball so that each ray of light bouncing off the mirror shows the driver a wide, but not-too-distorted, picture of the scene behind him.” 

Hicks noted that, in reality, the mirror does not look like a disco ball up close. There are tens of thousands of such calculations to produce a mirror that has a smooth, nonuniform curve.

Hicks first described the method used to develop this mirror in Optics Letters in 2008.

In the United States, regulations dictate that cars coming off of the assembly line must have a flat mirror on the driver’s side. Curved mirrors are allowed for cars’ passenger-side mirrors only if they include the phrase “Objects in mirror are closer than they appear.”

Because of these regulations, Hicks’s mirrors will not be installed on new cars sold in the U.S. any time soon. The mirror may be manufactured and sold as an aftermarket product that drivers and mechanics can install on cars after purchase. Some countries in Europe and Asia do allow slightly curved mirrors on new cars. Hicks has received interest from investors and manufacturers who may pursue opportunities to license and produce the mirror.

The U.S. patent, “Wide angle substantially non-distorting mirror” (United States Patent 8180606) was awarded to Drexel University on May 15, 2012.

Article from

Monday, June 4, 2012

An Egg-Boiling Fuzzy Logic Robot


Fuzzy Logic is a Computational Intelligence methodology, suitable for representing knowledge and deciding upon actions. In this video, we present the fundamental aspects of Fuzzy Logic as used in a fictional robotic household appliance. In specific, this video presents the engineering process of designing a machine that decides for how many minutes to boil an egg. In practice, achieving a desired level of taste, e.g. soft-boiled, depends on various parameters, such as the egg weight, the altitude and the initial egg temperature. In this video, the altitude and initial egg temperature are considered known.
The fuzzy logic system presented measures the crisp egg weight, while by using two membership functions, it computes the fuzzy values for "Small" and "Large" egg sizes. Two fuzzy rules are considered: If the egg size is "Small"("Large"), then boil for "Less"("More") than 5 minutes. Next, the "Less" and "More" fuzzy values are inferred, and finally, the proposed system makes a balanced decision between the two fuzzy values, in analogy to the centre-of-gravity method, in order to compute the actual boiling time.

Playing Card Recognition Using AForge.Net Framework

Article from Codeproject


Playing card recognition systems can be coupled with a robotic system which acts like a dealer or a human player in a card game, such as blackjack. Implementing this kind of application is also a good example for learning computer vision and pattern recognition.

This article involves binarization, edge detection, affine transformation, blob processing, and template matching algorithms which are implemented in AForge .NET Framework.

Note that this article and this system is based on Anglo-American card decks, it may not work for other card decks. However, this article describes basic methods for detection and recognition of cards. Therefore, recognition algorithm might be changed according to features of the deck that is used.

Here’s a quick video demonstration.

Card Detection

We need to detect card objects on image so that we can proceed with recognition. For detection, we apply some image filters on image for helping detection.

First step, we apply grayscaling on image. Grayscaling is a process that converts a colored image to an 8 bit image. We need to convert colored image to grayscale image so that we can apply binarization on image.

After we convert colored image to grayscale image, we apply binarization on image. Binarization(thresholding) is the process of converting a grayscale image to black & white image. In this article, Otsu’s method is used for global thresholding.

Collapse | Copy Code

Bitmap temp = source.Clone() as Bitmap; //Clone image to keep original image

FiltersSequence seq = new FiltersSequence();
seq.Add(Grayscale.CommonAlgorithms.BT709); //First add GrayScaling filter
seq.Add(new OtsuThreshold()); //Then add binarization(thresholding) filter
temp = seq.Apply(source); // Apply filters on source image


Since we have binary image, we can proceed with blob processing for detecting cards in image. For blob processing, we use AForge.Net BlobCounter class.The class counts and extracts standalone objects in images using connected components labeling algorithm.

//Extract blobs from image whose size width and height larger than 150
BlobCounter extractor = new BlobCounter();
extractor.FilterBlobs = true;
extractor.MinWidth = extractor.MinHeight = 150;
extractor.MaxWidth = extractor.MaxHeight = 350;

After executing the code above, BlobCounter class filters (removes) blobs whose width or height that isn’t between [150,350] pixels. This helps us discriminate cards from other objects(if there’s any) in image. These filter values can be changed according to the test environment. Suppose that, if distance between ground and camera is bigger, then cards will be smaller in image. In that case, we shall change min, max width & height values.

Now, we can get information (edge points, rectangles, center point, area, fullness, …etc.) of all blobs by callingextractor.GetObjectsInformation(). However, we only need edge points of blob to find corner points of rectangle. For finding corner points, we invoke PointsCloud.FindQuadriteralCorners function with list of edge points.

Read More

A Virtual Opinion

John R. Smith, "A Virtual Opinion," IEEE Multimedia, vol. 19, no. 2, pp. 2-3, April-June 2012, doi:10.1109/MMUL.2012.18

Social media provides new opportunities for sharing health-related data online. Although crowdsourcing medical diagnoses is not yet the trend, people are using social media to seek answers and better understand treatments and outcomes as doctors, experts, and patients converge online.

The idea of crowdsourcing medical diagnosis is crazy, isn’t it? I mean, how could anyone consider putting something as important as their health in the hands of strangers with unknown credentials? Yet, as patients are increasingly becoming the keepers of their own personal electronic medical records, which includes all kinds of multimedia data, radiological images, doctor’s notes, and test results, they have the ability to do just that. Beyond the assortment of family doctors, general practitioners, and specialists and sequences of first-, second-, and higher-order opinions, the crowd too can have a role.

Read the Article

Friday, June 1, 2012

Special Session on "Secure Retrieval and Dissemination of Information (text and image) in Distributed and Wireless specific purpose Environments 2012"

The 16th Panhellenic Conference on Informatics
5-7 October, 2012
Piraeus, Greece

As mobile devices are enhanced continuously with more resources, wireless infrastructures provide support to a growing number of specific purpose environments. Advances on sensor technology, wireless environments, Information Retrieval, personalization, and Content Based Image Retrieval introduce new possibilities in various sectors, realizing the anytime-anywhere access to multimedia information. This Special Session investigates Secure Retrieval and Dissemination of Information (text and image) in Distributed and Wireless specific purpose Environments (SECRET_DIDWE).

The SECRET_DIDWE session aims at providing researchers and professionals with an insight on:

1. Wireless architectures to enable authorized users to access sensitive information in a secure and transparent manner.

2. Policy-based architectures utilizing wireless sensor devices, advanced network topologies and software agents. Applications related to remote monitoring of patients, elderly people, etc.

3. Evaluation and Integration of Information Retrieval and personalization techniques: Text Retrieval, and Content Based Image Retrieval techniques. Classification based on various techniques e.g. neural network techniques, fuzzy techniques, and its applications.

Paper contributions from the industry, government, business, academia and research are expected.


Topics of interest include, but are not limited to the following:

  • Security issues in IT applications. Benchmarks and evaluation of the applications
  • Specific purpose Wireless architectures
  • Agent based architectures. Transparent Information transfer using intelligent agents.
  • Authentication and information access and retrieval in ad-hoc networks and self-organized networks
  • Policies and policy-based architectures
  • Novel methods for Text Retrieval, personalization, and Content Based Image Retrieval and application in specific purpose environments
  • Integrated techniques for extracting information content. Classification.
  • Transparent and secure communication in distributed environments
  • Advanced remote medical treatment services through pervasive environments- Remote monitoring of patients and elderly people
  • Neural network techniques and its applications e.g. SVM based systems that support diagnosis


Paper submission:

Authors are invited to submit original manuscripts, in English, limited in length to six (6) pages. The required format is IEEE double-column (available in doc1 and LaTeX2 format). Instructions for paper submissions are included in the conference site (

All submitted papers will undergo a peer review process, coordinated by the Special Session Chairs. Authors are invited to send their manuscripts electronically in Postscript or PDF format to the Special Session Chairs at This e-mail address is being protected from spambots. You need JavaScript enabled to view it by June 8, 2012.

The PCI 2012 proceedings will be published by IEEE Computer Society, Conference Publishing Services (CPS) and distributed at the conference (Pending Approval). IEEE CPS arranges for indexing through Thomson ISI, IEE (INSPEC), EI (Compendex), and other indexing services and archives the publication to IEEEXplore and the IEEE Computer Society Digital Libraries (CSDL).