Pages

Thursday, August 6, 2009

New Robust OCR dataset

Article From Machine Learning, etc

I've collected this dataset for a project that involves automatically reading bibs in pictures of marathons and other races. This dataset is larger than robust-reading dataset of ICDAR 2003 competition with about 20k digits and more uniform because it's digits-only. I believe it is more challenging than the MNIST digit recognition dataset.
I'm now making it publicly available in hopes of stimulating progress on the task of robust OCR. Use it freely, with only requirement that if you are able to exceed 80% accuracy, you have to let me know ;)
The dataset file contains raw data (images), as well as Weka-format ARFF file for simple set of features.
For completeness I include matlab script used to for initial pre-processing and feature extraction, Python script to convert space-separated output into ARFF format. Check "readme.txt" for more details. <Yaroslav Bulatov>
Dataset

No comments: