DocUNet: Unwarping Document Images

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is a common use for capturing document images?

  • Analyzing weather patterns
  • Designing architectural blueprints
  • Creating digital art
  • Digitizing and recording physical documents (correct)

Digitally flattening a document image is often desired to make text recognition easier.

True (A)

What type of network is the stacked U-Net based on?

  • Recurrent Neural Networks (RNNs)
  • Convolutional Neural Networks (CNNs) (correct)
  • Feedforward Neural Networks
  • Generative Adversarial Networks (GANs)

The network predicts a mapping field that moves a pixel in the distorted source image to the ______ image.

<p>rectified</p> Signup and view all the answers

What inspires the use of U-Net in the network structure?

<p>Semantic Segmentation (C)</p> Signup and view all the answers

It is easy to obtain large-scale real-world data with ground truth deformation for training the network.

<p>False (B)</p> Signup and view all the answers

The network is trained on a ______ dataset with various data augmentations.

<p>synthetic</p> Signup and view all the answers

What is the purpose of the data augmentations used in training the network?

<p>To improve its generalization ability (C)</p> Signup and view all the answers

Document digitization is an unimportant means to preserve existing printed documents.

<p>False (B)</p> Signup and view all the answers

What devices have traditionally been used to digitize documents?

<p>Flat-bed scanners (A)</p> Signup and view all the answers

A common problem when taking document images is that the document sheets may be ______, folded, or crumpled.

<p>curved</p> Signup and view all the answers

The network is trained in an end-to-end manner to predict the backward mapping that can distort the document.

<p>False (B)</p> Signup and view all the answers

What is the primary purpose of synthesizing images of curved or folded paper documents?

<p>To generate training data (A)</p> Signup and view all the answers

The paper documents are captured by the ______ camera.

<p>mobile</p> Signup and view all the answers

The benchmark contains only the original photos of the paper documents.

<p>False (B)</p> Signup and view all the answers

Match the component with it's goal

<p>MS-SSIM = Multi-Scale Structural Similarity LD = Local Distortion</p> Signup and view all the answers

Models trained on synthetic data may not generalize well to real data if there is a big difference in [blank] data.

<p>gap (D)</p> Signup and view all the answers

Similar to semantic segmentation, they design their network to enforce [blank] supervision.

<p>pixel-wise (D)</p> Signup and view all the answers

The authors created a comprehensive benchmark that captures different types of distortions.

<p>True (A)</p> Signup and view all the answers

What is the first end-to-end, learning-based approach for document image?

<p>unwarping</p> Signup and view all the answers

Flashcards

Stacked U-Net for Document Unwarping

A stacked U-Net is proposed to predict the forward mapping from a distorted document image to its rectified version.

Synthetic Document Dataset

A dataset of approximately 100 thousand images is created by warping non-distorted document images to train the network to improve its generalization ability.

Goal of Document Image Unwarping

Digitally flatten document images to make text recognition easier when physical document sheets are folded or curved.

CNNs for Image Recovery

CNNs can be used for end-to-end image recovery, offering efficiency in the testing phase compared to optimization-based methods.

Signup and view all the flashcards

2D Image Warping

The task is formulated as seeking the 2D image warping that can rectify a distorted document image, predicting a mapping field.

Signup and view all the flashcards

Synthesize Training Images

A technique to synthesize images of curved or folded paper documents is created for training data.

Signup and view all the flashcards

Perturbed Mesh Generation

A dataset of flat digital documents including papers, books, and magazine pages are warped, by imposing a mesh and applying random vertex deformations.

Signup and view all the flashcards

Data Augmentation

Models are trained on synthetic data and augmented with various transformations to generalize better to real data, background textures and color variations.

Signup and view all the flashcards

Network Architecture Goal

Pixel-wise supervision where the network is designed to enforce supervision at the pixel level, where U-Net is selected as base model.

Signup and view all the flashcards

Shift Invariant Loss

A shift invariant loss does not care about the absolute value of (y_i) in (F), it enforces that the difference between (y_i) and (y_j) should be close to that between (y_i^) and (y_j^).

Signup and view all the flashcards

Study Notes

  • Presents DocUNet, a learning-based method for unwarping document images, making text recognition easier by digitally flattening folded or curved sheets

Overview

  • Mobile cameras ubiquity enables easy digitizing and recording of physical documents
  • Digitally flattening document images is desirable to improve text recognition
  • DocUNet uses a stacked U-Net to predict the forward mapping from distorted to rectified images
  • A synthetic dataset of approximately 100,000 images was created by warping non-distorted document images
  • Data augmentation techniques improved the network's generalization ability
  • A benchmark dataset was created to evaluate performance in real-world conditions
  • The model was evaluated quantitatively and qualitatively, and was compared against previous methods

Introduction

  • Document digitization helps preserve and provide access to printed documents
  • Mobile cameras have made capturing physical documents easier than traditional flat-bed scanners
  • Images undergo text detection and recognition for content analysis and information extraction
  • Document sheets are often not in ideal scanning condition: curved, folded, crumpled, complex backgrounds etc
  • These factors cause problems for automatic document image analysis, so digital flattening is desirable

Existing Approaches

  • Some systems utilize stereo cameras or structured light projectors needing calibrated hardware to measure doc distortion
  • Some reconstruct the 3D shape of document sheets using multi-view images, avoiding extra hardware
  • Some recover the rectified document by analyzing a single image, using hand-crafted low-level features like illumination/shading and text-lines

DocUNet Approach

  • The first end-to-end learning-based approach to directly predict document distortion
  • Relies on CNNs for end-to-end image recovery, efficient in testing phase
  • Data-driven method can generalize to text, figures, and handwriting with training data
  • Formulates the task as finding a 2D image warping to rectify distorted document image
  • U-Net inspires the network structure because task shares commonalities with semantic segmentation
  • Network assigns a 2D vector to each pixel, similar to assigning a class label in semantic segmentation
  • A novel loss function drives the network to regress the coordinate (x, y) in the result image for each pixel in the source image

Data and Benchmarking

  • A large number of document images distorted in various degrees with corresponding rectifications are needed to train network
  • A synthetic dataset of 100K images was created by randomly warping perfectly flat document images
  • The perturbed image is the input, and the mesh used is the inverse deformation to recover
  • A benchmark of 130 images with variations in document type, distortion degree/type and capture conditions was created to remedy existing evaluation limitations

Contributions

  • End-to-end learning approach using a stacked U-Net with intermediate supervision
  • A technique to synthesize images of curved/folded paper documents, creating a training dataset of ~100K images
  • A diverse evaluation benchmark dataset with ground truth
  • Rectifying documents has been studied using 3D shape reconstruction and shape from low-level features

3D Reconstruction

  • Brown and Seales used a visible light projector-camera system
  • Zhang et al. used a range/depth sensor and considered paper's physical properties
  • Meng et al. used structured laser beams
  • Ulges et al. calculated disparity maps via image patch matching
  • Yamashita et al. parameterized shape with Non-Uniform Rational B-Splines (NURBS)
  • Tsoi and Brown composed boundary information from multi-view images
  • Koo et al. used two uncalibrated images and SIFT matching

Shape from Low-Level Features

  • Incorporates illumination/shading and text lines
  • Wada et al. used Shape from Shading (SfS)
  • Courteille et al. used camera
  • Zhang et al. used a robust SfS system
  • Cao et al. modeled the curved document on a cylinder
  • Liang et al. used developable surface
  • Tian and Narasimhan optimized over textlines and character strokes

Distorted Image Synthesis in 2D

  • Training images are synthesized by manipulating a 2D mesh to warp rectified images into distortions, while neglecting physical modeling to prioritize speed and ease
  • Guidelines followed when creating the distortion maps:
  • Real paper is locally-rigid meaning deformation at a point propagates spatially, it doesn't compress or expand
    • Two kinds of distortions are folds and curves generate creases and paper curls, respectively

Perturbed Mesh Generation

  • Given an image I, impose an m x n mesh M on it to provide control points for warping
  • A random vertex p is selected on M as the initial deformation point with direction and strength denoted as v and also randomly generated
  • Based on point 1, v is propagated to other vertices by weight w where vertices on the distorted mesh are computed as p + wv vi
  • Define w where p and v define a straight line, first compute normalized distance d between each vertex and this line before defining w as a function of d
  • Based on point 2, define a function for each distortion type
  • For folds: w = alpha/(d + alpha)
  • For curves: w= 1 - da
  • Where a controls extent of the deformation propagation
  • Larger a leads w toward 1, meaning all vertices share the same deformation as p making the deformation more global
  • Small a limits the deformation to the local area around p

Perturbed Image Generation

  • Perturbed mesh provides a sparse deformation field and is interpolated linearly to build a dense warping map at pixel level
  • The warped image is then generated by the warping map to the original image; 100K images synthesized w/ up to 19 synthetic distortions

Data Augmentation

  • Models trained on synthetic data may not generalize well to real data due to the gap between real and synthetic data
  • Problem can be eased by domain adaption using GAN but large scale real-world data is not available
  • Synthesized images augmented with transformations, textures from DTD, jitter in HSV color space to magnify paper/ illumination color variations + viewpoint change with projective transform

Network Architecture

  • Architecture enforces pixel-wise supervision
  • U-Net is base model choice due to simplicity and effectiveness in semantic segmentation tasks
  • Fully Convolutional Network contains downsampling layers followed by upsampling layers with feature maps concatenated between them

Training Details

  • The output of a single U-Net may not be satisfactory and is refined by stacking another U-Net at the output of the first as a refiner
  • Has one layer to convert deconvolutional features into the final output (x,y) and it splits after the first deconvolution layer
  • The deconvolution features of the first U-Net and the intermediate prediction y1 are concatenated together the input of the second U-Net
  • Second U-Net gives a refined prediction which is used as the network output

Loss Function

  • Defined as a combination of element-wise (Le) and shift invariant loss (Ls)
  • Le = 1/n * sum((yi-yi*)^2)
  • Ls = 1/(2n^2) * sum(((yi-yj)-(yi*-yj*))^2)
  • Where N is number of elements in F, yi is the predicted value at index i and yi* is the corresponding ground truth value
  • If di = yi-yi* -> Ls = 1/(2n^2)*sum((di - dj)^2)
  • The first term is element-wise + the second decreases loss if distance between two elements is similar to that in ground truth
  • The second term is also knows as Scale-Invariant Error, and L1 loss is better than L2 Loss
  • Rewrites the loss function as Lf = (1/n)* sum|di| - lambda*|(1/n sum (di))|
  • Where lambda controls strength of the second term and set to 0.1
  • Elements in F corresponding to the background have constant negative value in section 3

Experiments Environment

  • Our benchmark for images captured using mobile cameras
  • Use selected document types: receipts, music sheets, academic papers, and books, most mostly containing a mix of text and figures
  • Original documents warped and folded
  • The setup consisted of Sunlight, indoor lights, and cellphone's built-in flash
  • Results indicate: Stacked U-NET is the best architecture, achieving a 0.41 MS-SSIM and 14.08 LD

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

DocUNet for Document Image Unwarping
20 questions
Use Quizgecko on...
Browser
Browser