Updated on Fri, 19 Aug, 2022

This is set of results from the AWS Textract service to demonstrate its ability to handle handwritten text. I processed about 5000 handwritten pages from the 1920s-40s, mostly letters or rough notes. I then randomly selected 50 pages for this demonstration. I prepared ground truth files for each of the sample pages manually, and used fastwer to calculate the Character Error Rate and Word Error Rate (CER and WER). In the table below you can see the sample pages, sorted by CER from best to worst. Click the sample image to see the text returned by Textract and the page image.

The Python script used in this testing is available in the main branch

Notes on method

  • I removed the text of any printed letterheads from the OCR and the ground truth, so as to evaluate only the handwritten text

Observations

  • wide line spacing is good
  • it’s remarkably good at ignoring printed lines or graph pattern
  • sloped lines are very bad (see corr.1930-32_C_3611_002 - the last sample) - Textract doesn’t do well at following the slope - I am using order_blocks_by_geo, which sorts the blocks by their y coordinates and fixes the fairly common problem of getting two lines out of order, but with sloped lines it ends up mingling the words to two consecutive lines. If you want the words and don’t care about the order, the results are usable.
  • when it’s good, it is quite good, probably due to a good match of the hand of the image to the hands Textract was trained on

Table

File Sample (click for OCR and page image) CER WER
corr.1924-27_W_711_001 1.3 6.9
corr.1937-38_B_8569_001 3.7 17.1
corr.1924-27_H_570_002 6.1 16.3
misc_7_2288_001 8.0 28.6
corr.1933-40_B_4902_007 8.5 38.4
misc_8_2311_002 8.9 19.0
topics_186_14618_001 10.0 31.2
I-D_Harvard2A2B-lecture-25_15870_003 10.3 31.2
corr.1931-33_P_7705_001 10.4 36.0
misc_43_9198_002 11.4 27.1
I.A_LibComm_15017_001 11.9 28.0
corr.1927-29_W_2070_003 12.1 36.5
misc_31_7242_001 12.8 34.2
corr.1931-33_N_7691_002 13.1 37.6
corr.1927-29_B_953_003 13.3 33.0
ww_II.X_14363_001 14.0 37.9
misc_8_2311_007 14.3 25.9
corr.1927-29_B_898_001 17.0 40.3
misc_65_2495_001 20.0 29.8
corr.1927-29_W_2145_003 20.2 47.5
corr.1937-38_R_8652_001 21.0 44.3
corr.1927-29_C_1091_001 21.8 62.5
corr.1927-29_C_1103_002 22.9 47.7
corr.1927-29_D_1279_005 23.5 60.0
topics_126_14589_002 24.0 49.6
I-D_Harvard2A2B-lecture-25_15876_001 25.4 56.1
corr.1930-32_W_4765_001 25.9 60.7
corr.1938-43_R_8234_001 26.7 56.8
I.A_ReadLists_14723_005 28.4 65.8
IV-C-1_SlavicL_14678_006 28.8 75.0
corr.1927-29_W_2142_003 28.9 60.5
CWRU_3HB5-2-10_13509_001 29.7 59.8
books_diaries_13207_008 30.0 69.2
I-D_Harvard2A2B-lecture-03_15716_001 30.4 76.0
corr.1924-27_W_733_004 31.3 60.1
I.A_LibCatsLowCost_14979_001 33.3 80.0
corr.1927-29_W_2094_001 34.3 51.9
hgww_1921_8773_002 36.7 70.6
corr.1927-29_L_1615_001 37.0 65.6
misc_64_14529_018 37.0 67.1
hgww_1919_120_014 39.0 69.0
ww_I.W_14452_001 39.0 70.0
CKB_19_13598_001 39.7 84.6
I.A_HistDeptPol_14863_001 39.8 85.3
misc_37_15484_003 41.2 85.7
books_diaries_13207_023 43.1 94.7
books_diaries_13206_100 46.8 80.5
misc_41_7396_002 53.3 100.0
corr.1930-32_C_3611_002 61.5 83.2