/ OCR / business license / entity matching
OCR Errors in Business License Review: Small Mistakes, Large Consequences
Why entity matching should not rely on raw OCR output alone.
Business license review often begins with OCR, especially when a buyer receives a screenshot or a low-resolution image from a supplier. OCR is helpful, but a single character error can change the entity being checked. This matters most when the legal name is in Chinese, the registration number is long, or the scan has stamps and compression artifacts.
A robust workflow treats OCR output as a draft. The analyst should compare the extracted company name and registration code with the image, then use those fields for public record checks. If the supplier provides an English name, it should be mapped back to the Chinese legal name rather than accepted as equivalent.
Layout also matters. Some documents contain historical names, branch names, issuer names, or shareholder names. A model may extract the wrong name as the main entity unless the template is understood. That is why document type classification and field labels are as important as raw text recognition.
A second problem is over-normalization. Systems that remove punctuation, translate names loosely, or collapse similar addresses may hide the difference between a factory, trading company, and related sales office. For supplier verification, conservative matching is safer than optimistic matching.
The best use of OCR is to speed capture while preserving a human review gate. Store the original image, extracted fields, reviewer corrections, and final entity selected for verification. Over time, those corrections become training data for a better workflow.
Working checklist
- Treat OCR as draft text.
- Verify Chinese legal names visually.
- Do not over-normalize entity names.
- Preserve reviewer corrections.
- Flag low-resolution documents for replacement.