ICDAR 2013 HIP WORKSHOP FAMILYSEARCH COMPETITION

ICDAR 2013 HIP WORKSHOP FAMILYSEARCH COMPETITION

The 2013 FamilySearch Competition is the first in a series of challenges relating to handwriting recognition within historical documents. The evaluation takes place in the Spring and Summer of 2013 and culminates at the HIP pre-conference workshop of the ICDAR Conference In Washington D.C. in August 2013.

Abstract

FamilySearch International (FSI), (https://familysearch.org/) a non-profit organization based in Utah in the United States, promotes genealogical research around the world. It gathers and publishes large amounts of historical documents on the Internet.

The competition provides participants with a large number of handwritten Mexican marriage records (based on a pre-printed form, see appendix for example). Since it is often the case that FSI can collect and digitize image collections several years before they can be transcribed, it often wishes to create a chronological and/or geographic browse structure that allows patrons to navigate the images online without a search-by-name option.

Participants will be asked to group a scrambled collection of these records by the contents of certain sub-regions of the document. These sub-regions contain geographic and chronological information. Specifically, the competition requires documents to be grouped and discriminated based on the contents of the 4 labeled red ROIs overlaid on the Mexican Marriage Record shown in the Appendix.

Competition participants will be evaluated based on correct classification of these images relative to ground truth created by human experts. Classification correctness will be measured using the B-cubed algorithm described below.

Background

FamilySearch International (FSI) (https://familysearch.org/) is a non-profit organization dedicated to helping people all over the world find and connect with their ancestors. It has gathered and published historical records of all kinds, including census returns, civil registration, probate and land records, military documents, church books, compiled genealogies, local histories, and much more.

Ideally these records are transcribed so that researchers can find their ancestors with a straightforward name search. This transcription activity is performed by human experts, however, and so is extremely expensive. The organization cannot create transcriptions of historical records as fast as it can digitize and publish them.

It is for this reason that FSI publishes “browse-only” collections of digitized images online. Each such collection is currently reviewed by genealogical experts to find the natural structure in the document flow (e.g., alphabetical order of case files, a geographic structure to the way documents are preserved, year markers as a church book was recorded chronologically through time – any structural or contextual division of historical records that helps researchers to restrict their browsing to a portion of the records rather than examining the entire corpus from beginning to end).

Once these browse structures are created, patrons use the resulting hierarchy to navigate through large un-transcribed image sets to find just those groups of historical documents that are of interest to them.

While the process of creating these browse structures is much less expensive than transcription by human operators, it is still time consuming. Those operators assigned to the task generally do not review every image, instead, they view large collections of thumbnail images and examine obvious changes in the image stream (e.g., a seperator page between sections, the beginning of a new bound volume, a new page format, etc.), see Figure 1 for a typical operator view of thumbnail images in a group of historical images.

FSI would like to increase the speed at which browse structures can be created for un-transcribed document collections, and also create more granular browse structures than is currently feasible, relying on a limited number of human experts.

Figure 1: Thumbnail view of several images from the training corpus

Competition Parameters

FSI is interested in algorithms, methods, and approaches to this problem that automate or significantly aid human operators in identifying changes in document structure and content so that resulting browse structures can be created more quickly and more granularly.

Researchers will be provided with a few representative samples of 20th century Mexican marriage records. It is important to note that in both the training and test sets, each image may vary from its neighbors slightly in terms of scale, pan, and skew. These representative samples will contain region-of-interest (ROI) polygons that indicate valuable document content that FSI wishes to use to create browse structures.

These ROIs correspond to four key pieces of information on the images:

month of marriage
year of marriage
“origin,” i.e., hometown of groom
“origin,” i.e., hometown of bride

Competition participants will be provided with a training set of 10,000 such digital images. They will also be provided with groundtruth in the form of a text file that indicates the correct grouping of the 10K images for each of the four ROIs.
The task of the participant will be to correctly group the images of the test set according to the content in these ROIs. Each image will be grouped four times based on the ROIs, above. It is not necessary to recognize the content in or assign semantic value to the ROIs, merely to correctly assign images to each of four groups (each group containing the same value from each ROI, respectively).

Algorithms, processes, or approaches developed using the training set will be applied to the test set. The test set consists of 20,000 images, an undisclosed 50% of which will be scored against groundtruth to determine competition winners. The test set images are in no particular order and will have scrambled filenames.

Even partial success in the competition represents potentially thousands of hours of effort saved on the part of human operators. More importantly, genealogical researchers, who use this content at no charge on the Internet, will have a much easier task in locating their ancestors in the musty records of the past.

Related Previous Competitions

Book Structure Extraction Competition (ICDAR 2009, ICDAR 2011)

Page Segmentation Competition (ICDAR 2009)

Historical Document Layout Analysis (ICDAR 2011)

Competition Timeline

10K training images and data will be made available	01 June 2013
Test images will be made available	01 Aug 2013
Evaluation results are due	11:59pm MST on Wednesday, 07 Aug 2013
Supporting documentation* due *A short paper describing the approach taken in the submission, see Evaluation Delivery Specifications, below.	11:59pm MST on Monday, 12 Aug 2013
Competition winners notified	15 Aug 2013
Prizes Awarded	24 Aug 2013 @ HIP2013 in Washington DC

First and second place winners will receive a cash prize of $2,000 and $1,000, respectively, and an additional $500 prize may be awarded at judges’ discretion to an entry with outstanding technical merit.** These prizes will be awarded at the HIP2013 pre-conference workshop @ ICDAR 2013 in Washington DC.

**Note that through using the B-cubed score (see scoring), a system that does no merging (which is sometimes referred to as a "shatter-all" baseline) or one that merges all documents into the same cluster may get a fair to moderate score. The outcome of using these two baselines will be shown by the judges as part of the overall collection of results for the evaluation. If no system outperforms these simple baselines by 10%, the judges reserve the right to award no prize money.

Evaluation Delivery Specifications

Participants will deliver a text file to the FSI point-of-contact (see below) by the evaluation results date. Formatting instructions for the text file are as follows:

Each line of the text file relates to one image in the test set;
Each line contains five pieces of information, separated by <TAB> characters: image filename, groupID1, groupID2, groupID3, groupID4

(The groupIDX is an alpha-numeric identifier used by the system to identify each of the classes of information for each particular field. For example, if we consider the "month" field, we expect there to likely be twelve separate categories. A system may determine, for example, that there appears to be a category "Abril" and it may label all documents which it thinks fall into that categorization with groupID of "2." It may also find a separate category of documents that fall under another cluster – maybe a "Diciembre" which it classifies as groupID of "3". And so forth. The alpha-numeric groupID is merely the system-provided labels for the various sets and it does not have to have a semantic meaning.)
Thus, the first few lines of a sample text file might look like this:

	Month of marriage	Hometown groom	Year of Marriage	Hometown bride
0000001.jpg	2	34	13	7
0000002.jpg	3	117	13	55
0000003.jpg	11	43	22	181
0000004.jpg	5	53	67	203

Additionally, competition participations are asked to submit a short paper (i.e., no longer than five pages) describing their work by Monday, 12 August 2013. The paper will help judges to weigh technical merit in the submission. Participants are free to publish the paper anywhere.

Judging and Points of Contact

Alan Cannaday, a member of the research staff at FamilySearch International, has been identified to respond to questions regarding the competition. All questions and answers provided will be published to all registered competition participants via an e-mail distribution list that all participants will be asked to join. Alan may be reached at acannaday@familysearch.org; please use the subject heading “ICDAR 2013 HIP Workshop FamilySearch Competition” in communications with Alan.

Judging of the competition will be performed by an evaluation committee comprised of individuals appointed by FamilySearch.

Scoring

When a system proposes its clustered results, it may be presenting a potentially different collection of clusters than those of the truth data. For example, suppose there were only 10 documents (represented as d1, d2, ..., d10) that needed to be clustered, and that for one of the desired fields of interests, documents should have been grouped as follows {d1, d5, d10}, {d2}, {d3,d9}, {d4,d6,d7}, {d8}. Suppose, too, that the system hypothesizes that the documents should be clustered as {d1,d5}, {d2,d10}, {d3}, {d4,d6,d7,d8,d9} for that field. An evaluation must have some score that can account for this kind of partial set merging and splitting. A frequently-use scoring technique for sets is the B-cubed method (Bagga and Baldwin, 1998). It is simple to describe and implement, and we leverage it as the basis for this competition's score.

The B-cubed algorithm begins by computing a precision and recall for each document using intersections and unions between the documents' hypothesized and truth sets. Then, it computes the average across all document-wise precisions and recalls to form an average precision and average recall for the collection. Since we would like a single number for our results, we compute a final F-score based on the average precision and average recall (where F-score is defined as [P*R]/(0.5*[P+R])) as the exact metric we will use to evaluate systems for a given field.

If we say that Ta is the truth set for document "a," that Ha is the hypothesis for document "a," and that |.| indicates the number of elements in a set (its cardinality), the document-wise precision is defined to be P= | Ta ∩ Ha |/| Ha| and the document-wise recall is defined to be R= | Ta ∩ Ha |/| Ta |.

For example, suppose that the document we use for "a" is d1. In the truth set listed above, d1 should have been in the set {d1,d5,d10}. Yet it was hypothesized to be in a set {d1,d5}. Its precision is 2/2 (100%) because all the documents in its hypothesis set are correct. However, its recall is 2/3 because only two of the three documents of its truth set are attested to in its hypothesis set. For the hypotheses of the 10 documents mentioned before, we would obtain the document-wise precisions and recalls indicated in Table 1.

Table 1: Document-wise precision/recall information of Example Scenario

Entry	Prec.	Recall	Entry	Prec.	Recall
d1	2/2	2/3	d6	3/5	3/3
d2	1/2	1/1	d7	3/5	3/3
d3	1/1	1/2	d8	1/5	1/1
d4	3/5	3/3	d9	1/5	1/2
d5	2/2	2/3	d10	1/2	1/3

In this example, we would average the precision columns and the recall columns to form final scores of average precision and average recall. The average precision would have been 62/100 and the average recall would have been 46/60. The final score, based on the overall F-score for this situation, would be 0.686.

Weighting of Results, Determination of Winner

Of the four groups into which each image must be classified, some are more genealogically important than others.

Group	Content	Scoring Weight
1	month of marriage	2
2	year of marriage	3
3	“origin,” i.e., hometown of groom	5
4	“origin,” i.e., hometown of bride	5

Therefore the scoring for each group will be weighted according to the above table, where weight increases with importance. Weighted scores will be summed to receive an overall composite score. The highest score will receive first prize, second highest will receive second prize. In the event that participants tie for first and/or second place, corresponding prize money will be evenly distributed amongst the tied participants.

Appendix A: Sample Page of Mexican Marriage Records with Chronological and Geographic ROIs Outlined in Red