Friday, October 30, 2015
Wednesday, October 14, 2015
Tuesday, October 13, 2015
Thursday, October 8, 2015
Computers can recognise a complication of diabetes that can lead to blindness
ARTIFICIAL intelligence (AI) can sometimes be put to rather whimsical uses. In 2012 Google announced that one of its computers, after watching thousands of hours of YouTube videos, had trained itself to identify cats. Earlier this year a secretive AI firm called DeepMind, bought by Google in 2014, reported in Nature that it had managed to train a computer to play a series of classic video games, often better than a human could, using nothing more than the games’ on-screen graphics.
But the point of such diversions is to illustrate that, thanks to a newish approach going by the name of "deep learning", computers increasingly possess the pattern-recognition skills—identifying faces, interpreting pictures, listening to speech and the like—that were long thought to be the preserve of humans. Researchers, from startups to giant corporations, are now planning to put AI to work to solve more serious problems.
One such organisation is the California HealthCare Foundation (CHCF). The disease in the charity’s sights is diabetic retinopathy, one of the many long-term complications of diabetes. It is caused by damage to the tiny blood vessels that supply the retina. Untreated, it can lead to total loss of vision. Around 80% of diabetics will develop retinal damage after a decade; in rich countries it is one of the leading causes of blindness in the young and middle-aged. Much of the damage can be prevented with laser treatment, drugs or surgery if caught early, but there are few symptoms at first. The best bet is therefore to offer frequent check-ups to diabetics, with trained doctors examining their retinas for subtle but worrying changes.
But diabetes is common and doctors are busy. Inspired by recent advances in AI, the CHCF began wondering if computers might be able to do the job of examining retinas cheaply and more quickly.
Being medics, rather than AI researchers, the CHCF turned for help to a website called Kaggle, which organises competitions for statisticians and data scientists. (It was founded by Anthony Goldbloom, who once worked as an intern at The Economist.) The CHCF uploaded a trove of thousands of images of retinas, both diseased and healthy, stumped up the cash for a $100,000 prize, and let Kaggle’s members—who range from graduate students to teams working for AI companies—get to grips with the problem.
Wednesday, October 7, 2015
Text of ISO/IEC CD 15938-14 Reference software, conformance and usage guidelines for compact descriptors for visual search
This part of the MPEG-7 standard provides the reference software, specifies the conformance testing, and gives usage guidelines for ISO/IEC 15938-13: Compact descriptors for visual search (CDVS). CDVS specifies an image description tool designed to enable efficient and interoperable visual search applications, allowing visual content matching in images. Visual content matching includes matching of views of objects, landmarks, and printed documents, while being robust to partial occlusions as well as changes in viewpoint, camera parameters, and lighting conditions. This document is a Committee Draft (CD) text for ballot consideration and comment for ISO/IEC 15938-14: Reference software, conformance and usage guidelines for compact descriptors for visual search.
Files: w15371.zip
Objects2action: Classifying and localizing actions without any video example
The ICCV 2015 paper Objects2action: Classifying and localizing actions without any video example by Mihir Jain, Jan van Gemert, Thomas Mensink and Cees Snoek is now available. The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches authors do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. The key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Their semantic embedding has three main characteristics to accommodate for the specifics of actions. First, they propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, they incorporate the automated selection of the most responsive objects per action. And finally, they demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of the approach.
Article from http://www.ceessnoek.info/
A search runtime analysis of LIRE on 500k images
Article from http://www.semanticmetadata.net/
Run time for search in LIRE heavily depends on the method used for indexing and search. There are two main ways to store data and two search strategies for linear search and there is approximate indexing of course. The two storing strategies are to (i) store the actual feature vector in a Lucene text field and (ii) to use the Lucene DocValues data format. While the former allows for easy access, more flexibility and compression, the latter is much faster when accessing raw byte[] data. Linear search then needs to open each and every document and compare the query vector to the one stored in the document. For linear search in Lucene text fields, caching boosts performance, so the byte[] data of the feature vectors is read once from the index and stored in memory. For the DocValues data storage format access is fast enough to allow for linear search. With approximate indexing a query string is used on the inverted index and only the first k best matching candidates are used to fin the n << k actual results by linear search. So first a text search is done, then a linear search on much less images is performed [1]. In our tests we used k=500 and n=10.
Tests on 499,207 images have shown that with this order approximate search is already outperforming linear search. The following numbers are given in ms search time. Note at this point that the average value per search differs for a different number of test runs due to the context of the runs, ie. the state of the Java VM, OS processes, file systems, etc. But the trend can be seen.
(*) Start-up latency when filling the cache was 6.098 seconds
(**) Recall with 10 results on ten runs was 0.76, on 100 run recall was 0.72
As a conclusion with nearly 500,000 images the DocValues approach might be the best choice, as the approximate indexing is loosing around 25% of the results while not boosting runtime performance that much. Further optimization would be for instance query bundling or index splitting in combination with multithreading.
[1] Gennaro, Claudio, et al. “An approach to content-based image retrieval based on the Lucene search engine library.” Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2010. 55-66.
What, Where and How? Introducing pose manifolds for industrial object manipulation
In this paper we propose a novel method for object grasping that aims to unify robot vision techniques for efficiently accomplishing the demanding task of autonomous object manipulation. Through ontological concepts, we establish three mutually complementary processes that lead to an integrated grasping system able to answer conjunctive queries such as “What”, “Where” and “How”? For each query, the appropriate module provides the necessary output based on ontological formalities. The “What” is handled by a state of the art object recognition framework. A novel 6 DoF object pose estimation technique, which entails a bunch-based architecture and a manifold modeling method, answers the“Where”. Last, “How” is addressed by an ontology-based semantic categorization enabling the sufficient mapping between visual stimuli and motor commands.
http://www.sciencedirect.com/science/article/pii/S0957417415004418
SIMPLE Descriptors
SIMPLE [Searching Images with Mpeg-7 (& Mpeg-7 like) Powered Localized dEscriptors] begun as a collection of four descriptors [Simple-SCD, Simple-CLD, Simple-EHD and Simple-CEDD (or LoCATe)]. The main idea behind SIMPLE is to utilize global descriptors as local ones. To do this, the SURF detector is employed to define regions-of-interest on an image, and instead of using the SURF descriptor, one of the MPEG-7 SCD, the MPEG-7 CLD, the MPEG-7 EHD and the CEDD descriptors is utilized to extract the features of those image’s patches. Finally, the Bag-Of-Visual-Words framework is used to test the performance of those descriptors in CBIR tasks. Furthermore, recently SIMPLE was extended from a collection of descriptors, to a scheme (as a combination of a detector and a global descriptor). Tests have been carried out after utilizing other detectors [the SIFT detector and two Random Image Patches’ Generators (The Random Generator has produced the best results and is portrayed as the preferred choice.)] and currently the performance of that scheme with more global descriptors is being tested.
Searching Images with MPEG-7 (& MPEG-7 Like) Powered Localized dEscriptors (SIMPLE)
A set of local image descriptos specifically designed for image retrieval tasks
Image retrieval problems were first confronted with algorithms that tried to extract the visual properties of a depiction in a global manner, following the human instinct of evaluating an image’s content. Experimenting with retrieval systems and evaluating their results, especially on verbose images and images where objects appear with partial occlusions, showed that the accepted correctly ranked results are positively evaluated by the extraction of the salient regions of an image, rather than the overall depiction. Thus, a representation of the image by its points of interest proved to be a more robust solution. SIMPLE descriptors, emphasize and incorporate the characteristics that allow a more abstract but retrieval friendly description of the image’s salient patches.
Experiments were contacted on two well-known benchmarking databases. Initially experiments were performed using the UKBench database. The UKBench image database consists of 10200 images, separated in 2250 groups of four images each. Each group includes images of a single object captured from different viewpoints and lighting conditions. The first image of every object is used as a query image. In order to evaluate our approach, the first 250 query images were selected. The searching procedure was executed throughout the 10200 images. Since each ground truth includes only four images, the P@4 evaluation method to evaluate the early positions was used.
In the sequel, experiments were performed using the UCID database. This database consists of 1338 images on a variety of topics including natural scenes and man-made objects, both indoors and outdoors. All the UCID images were subjected to manual relevance assessments against 262 selected images.
In the tables that illustrate the results, wherever the BOVW model is employed, only the best result achieved by each descriptor with every codebook size, is presented. In other words, for each local feature and for each codebook size, the experiment was repeated for all 8 weighting schemes but only the best result is listed in the tables. Next to the result, the weighting scheme for which the result was achieved is noted (using the System for the Mechanical Analysis and Retrieval of Text – SMART notation)
Experimental Results of all 16 SIMPLE descriptors on the UKBench and the UCID dataset. MAP results in bold fonts mark performances that surpass the baseline performance. Grey shaded results mark the highest performance achieved per detector
Read more and download the open source implementation of the SIMPLE descriptors (C#, Java and MATLAB)
Tuesday, October 6, 2015
HOW TO BUILD MACHINE LEARNING WITH GOOGLE PREDICTION API
While not widely understood, machine learning has been easily accessible since Google Prediction API was released in 2011. With many applications in a wide variety of fields, this tutorial by Alex Casalboni on the Cloud Academy blog is a useful place to start learning how to build a machine learning model using Google Prediction API.
The API offers a RESTful interface as a means to train a machine learning model, and is considered a “black box” due to the restricted access users have to internal configuration. This leaves users with only the “classification” vs “regression” configuration, or the applying of a PMML (Predictive Model Markup Language) file with weighting parameters for categorical models.
This tutorial begins with some brief definitions before beginning on how to upload your dataset to Google Cloud Storage, as required by Google Prediction API. Since this API does not provide a user-friendly Web interface, the tutorial switches to Python scripts via an API call to obtain the modelDescription field, which contains a confusionMatrix structure which informs you how the model behaves.
Google later splits the dataset into two smaller sets; one to train the model, and the second to evaluate it. Users are then shown how to generate new predictions via an API call which returns two values, which are the classified activity and the reliability measure for each class respectively.
The open dataset applied here was built by UCI and will be used to train a multi-class model for HAR (Human Activity Recognition). Collected from accelerometer and gyroscope data on smartphones before being manually labelled, the data is defined by 1 of 6 input activities (walking, sitting, walking up stairs, lying down, etc.). By training the model as instructed here in this tutorial, it will be able to definitively associate sensor data with different activities, such as would be used in activity tracking devices or healthcare monitoring.