Updated on Fri, 19 Aug, 2022
This is set of results from the AWS Textract service to demonstrate its ability to handle
handwritten text. I processed about 5000 handwritten pages from the 1920s-40s, mostly letters or rough notes. I then randomly selected 50 pages for this demonstration. I prepared ground truth files for each of the sample pages manually, and used fastwer to calculate the Character Error Rate and Word Error Rate (CER and WER). In the table below you can see the sample pages, sorted by CER from best to worst. Click the sample image to see the text returned by Textract and the page image.
The Python script used in this testing is available in the main branch
Notes on method
- I removed the text of any printed letterheads from the OCR and the ground truth, so as to evaluate only the handwritten text
Observations
- wide line spacing is good
- it’s remarkably good at ignoring printed lines or graph pattern
- sloped lines are very bad (see corr.1930-32_C_3611_002 - the last sample) - Textract doesn’t do well at following the slope - I am using
order_blocks_by_geo, which sorts the blocks by their y coordinates and fixes the fairly common problem of getting two lines out of order, but with sloped lines it ends up mingling the words to two consecutive lines. If you want the words and don’t care about the order, the results are usable.
- when it’s good, it is quite good, probably due to a good match of the hand of the image to the hands Textract was trained on
Table
| File |
Sample (click for OCR and page image) |
CER |
WER |
| corr.1924-27_W_711_001 |
 |
1.3 |
6.9 |
| corr.1937-38_B_8569_001 |
 |
3.7 |
17.1 |
| corr.1924-27_H_570_002 |
 |
6.1 |
16.3 |
| misc_7_2288_001 |
 |
8.0 |
28.6 |
| corr.1933-40_B_4902_007 |
 |
8.5 |
38.4 |
| misc_8_2311_002 |
 |
8.9 |
19.0 |
| topics_186_14618_001 |
 |
10.0 |
31.2 |
| I-D_Harvard2A2B-lecture-25_15870_003 |
 |
10.3 |
31.2 |
| corr.1931-33_P_7705_001 |
 |
10.4 |
36.0 |
| misc_43_9198_002 |
 |
11.4 |
27.1 |
| I.A_LibComm_15017_001 |
 |
11.9 |
28.0 |
| corr.1927-29_W_2070_003 |
 |
12.1 |
36.5 |
| misc_31_7242_001 |
 |
12.8 |
34.2 |
| corr.1931-33_N_7691_002 |
 |
13.1 |
37.6 |
| corr.1927-29_B_953_003 |
 |
13.3 |
33.0 |
| ww_II.X_14363_001 |
 |
14.0 |
37.9 |
| misc_8_2311_007 |
 |
14.3 |
25.9 |
| corr.1927-29_B_898_001 |
 |
17.0 |
40.3 |
| misc_65_2495_001 |
 |
20.0 |
29.8 |
| corr.1927-29_W_2145_003 |
 |
20.2 |
47.5 |
| corr.1937-38_R_8652_001 |
 |
21.0 |
44.3 |
| corr.1927-29_C_1091_001 |
 |
21.8 |
62.5 |
| corr.1927-29_C_1103_002 |
 |
22.9 |
47.7 |
| corr.1927-29_D_1279_005 |
 |
23.5 |
60.0 |
| topics_126_14589_002 |
 |
24.0 |
49.6 |
| I-D_Harvard2A2B-lecture-25_15876_001 |
 |
25.4 |
56.1 |
| corr.1930-32_W_4765_001 |
 |
25.9 |
60.7 |
| corr.1938-43_R_8234_001 |
 |
26.7 |
56.8 |
| I.A_ReadLists_14723_005 |
 |
28.4 |
65.8 |
| IV-C-1_SlavicL_14678_006 |
 |
28.8 |
75.0 |
| corr.1927-29_W_2142_003 |
 |
28.9 |
60.5 |
| CWRU_3HB5-2-10_13509_001 |
 |
29.7 |
59.8 |
| books_diaries_13207_008 |
 |
30.0 |
69.2 |
| I-D_Harvard2A2B-lecture-03_15716_001 |
 |
30.4 |
76.0 |
| corr.1924-27_W_733_004 |
 |
31.3 |
60.1 |
| I.A_LibCatsLowCost_14979_001 |
 |
33.3 |
80.0 |
| corr.1927-29_W_2094_001 |
 |
34.3 |
51.9 |
| hgww_1921_8773_002 |
 |
36.7 |
70.6 |
| corr.1927-29_L_1615_001 |
 |
37.0 |
65.6 |
| misc_64_14529_018 |
 |
37.0 |
67.1 |
| hgww_1919_120_014 |
 |
39.0 |
69.0 |
| ww_I.W_14452_001 |
 |
39.0 |
70.0 |
| CKB_19_13598_001 |
 |
39.7 |
84.6 |
| I.A_HistDeptPol_14863_001 |
 |
39.8 |
85.3 |
| misc_37_15484_003 |
 |
41.2 |
85.7 |
| books_diaries_13207_023 |
 |
43.1 |
94.7 |
| books_diaries_13206_100 |
 |
46.8 |
80.5 |
| misc_41_7396_002 |
 |
53.3 |
100.0 |
| corr.1930-32_C_3611_002 |
 |
61.5 |
83.2 |