Podcast
Questions and Answers
What is a common use for capturing document images?
What is a common use for capturing document images?
- Analyzing weather patterns
- Designing architectural blueprints
- Creating digital art
- Digitizing and recording physical documents (correct)
Digitally flattening a document image is often desired to make text recognition easier.
Digitally flattening a document image is often desired to make text recognition easier.
True (A)
What type of network is the stacked U-Net based on?
What type of network is the stacked U-Net based on?
- Recurrent Neural Networks (RNNs)
- Convolutional Neural Networks (CNNs) (correct)
- Feedforward Neural Networks
- Generative Adversarial Networks (GANs)
The network predicts a mapping field that moves a pixel in the distorted source image to the ______ image.
The network predicts a mapping field that moves a pixel in the distorted source image to the ______ image.
What inspires the use of U-Net in the network structure?
What inspires the use of U-Net in the network structure?
It is easy to obtain large-scale real-world data with ground truth deformation for training the network.
It is easy to obtain large-scale real-world data with ground truth deformation for training the network.
The network is trained on a ______ dataset with various data augmentations.
The network is trained on a ______ dataset with various data augmentations.
What is the purpose of the data augmentations used in training the network?
What is the purpose of the data augmentations used in training the network?
Document digitization is an unimportant means to preserve existing printed documents.
Document digitization is an unimportant means to preserve existing printed documents.
What devices have traditionally been used to digitize documents?
What devices have traditionally been used to digitize documents?
A common problem when taking document images is that the document sheets may be ______, folded, or crumpled.
A common problem when taking document images is that the document sheets may be ______, folded, or crumpled.
The network is trained in an end-to-end manner to predict the backward mapping that can distort the document.
The network is trained in an end-to-end manner to predict the backward mapping that can distort the document.
What is the primary purpose of synthesizing images of curved or folded paper documents?
What is the primary purpose of synthesizing images of curved or folded paper documents?
The paper documents are captured by the ______ camera.
The paper documents are captured by the ______ camera.
The benchmark contains only the original photos of the paper documents.
The benchmark contains only the original photos of the paper documents.
Match the component with it's goal
Match the component with it's goal
Models trained on synthetic data may not generalize well to real data if there is a big difference in [blank] data.
Models trained on synthetic data may not generalize well to real data if there is a big difference in [blank] data.
Similar to semantic segmentation, they design their network to enforce [blank] supervision.
Similar to semantic segmentation, they design their network to enforce [blank] supervision.
The authors created a comprehensive benchmark that captures different types of distortions.
The authors created a comprehensive benchmark that captures different types of distortions.
What is the first end-to-end, learning-based approach for document image?
What is the first end-to-end, learning-based approach for document image?
Flashcards
Stacked U-Net for Document Unwarping
Stacked U-Net for Document Unwarping
A stacked U-Net is proposed to predict the forward mapping from a distorted document image to its rectified version.
Synthetic Document Dataset
Synthetic Document Dataset
A dataset of approximately 100 thousand images is created by warping non-distorted document images to train the network to improve its generalization ability.
Goal of Document Image Unwarping
Goal of Document Image Unwarping
Digitally flatten document images to make text recognition easier when physical document sheets are folded or curved.
CNNs for Image Recovery
CNNs for Image Recovery
Signup and view all the flashcards
2D Image Warping
2D Image Warping
Signup and view all the flashcards
Synthesize Training Images
Synthesize Training Images
Signup and view all the flashcards
Perturbed Mesh Generation
Perturbed Mesh Generation
Signup and view all the flashcards
Data Augmentation
Data Augmentation
Signup and view all the flashcards
Network Architecture Goal
Network Architecture Goal
Signup and view all the flashcards
Shift Invariant Loss
Shift Invariant Loss
Signup and view all the flashcards
Study Notes
- Presents DocUNet, a learning-based method for unwarping document images, making text recognition easier by digitally flattening folded or curved sheets
Overview
- Mobile cameras ubiquity enables easy digitizing and recording of physical documents
- Digitally flattening document images is desirable to improve text recognition
- DocUNet uses a stacked U-Net to predict the forward mapping from distorted to rectified images
- A synthetic dataset of approximately 100,000 images was created by warping non-distorted document images
- Data augmentation techniques improved the network's generalization ability
- A benchmark dataset was created to evaluate performance in real-world conditions
- The model was evaluated quantitatively and qualitatively, and was compared against previous methods
Introduction
- Document digitization helps preserve and provide access to printed documents
- Mobile cameras have made capturing physical documents easier than traditional flat-bed scanners
- Images undergo text detection and recognition for content analysis and information extraction
- Document sheets are often not in ideal scanning condition: curved, folded, crumpled, complex backgrounds etc
- These factors cause problems for automatic document image analysis, so digital flattening is desirable
Existing Approaches
- Some systems utilize stereo cameras or structured light projectors needing calibrated hardware to measure doc distortion
- Some reconstruct the 3D shape of document sheets using multi-view images, avoiding extra hardware
- Some recover the rectified document by analyzing a single image, using hand-crafted low-level features like illumination/shading and text-lines
DocUNet Approach
- The first end-to-end learning-based approach to directly predict document distortion
- Relies on CNNs for end-to-end image recovery, efficient in testing phase
- Data-driven method can generalize to text, figures, and handwriting with training data
- Formulates the task as finding a 2D image warping to rectify distorted document image
- U-Net inspires the network structure because task shares commonalities with semantic segmentation
- Network assigns a 2D vector to each pixel, similar to assigning a class label in semantic segmentation
- A novel loss function drives the network to regress the coordinate (x, y) in the result image for each pixel in the source image
Data and Benchmarking
- A large number of document images distorted in various degrees with corresponding rectifications are needed to train network
- A synthetic dataset of 100K images was created by randomly warping perfectly flat document images
- The perturbed image is the input, and the mesh used is the inverse deformation to recover
- A benchmark of 130 images with variations in document type, distortion degree/type and capture conditions was created to remedy existing evaluation limitations
Contributions
- End-to-end learning approach using a stacked U-Net with intermediate supervision
- A technique to synthesize images of curved/folded paper documents, creating a training dataset of ~100K images
- A diverse evaluation benchmark dataset with ground truth
Related Work
- Rectifying documents has been studied using 3D shape reconstruction and shape from low-level features
3D Reconstruction
- Brown and Seales used a visible light projector-camera system
- Zhang et al. used a range/depth sensor and considered paper's physical properties
- Meng et al. used structured laser beams
- Ulges et al. calculated disparity maps via image patch matching
- Yamashita et al. parameterized shape with Non-Uniform Rational B-Splines (NURBS)
- Tsoi and Brown composed boundary information from multi-view images
- Koo et al. used two uncalibrated images and SIFT matching
Shape from Low-Level Features
- Incorporates illumination/shading and text lines
- Wada et al. used Shape from Shading (SfS)
- Courteille et al. used camera
- Zhang et al. used a robust SfS system
- Cao et al. modeled the curved document on a cylinder
- Liang et al. used developable surface
- Tian and Narasimhan optimized over textlines and character strokes
Distorted Image Synthesis in 2D
- Training images are synthesized by manipulating a 2D mesh to warp rectified images into distortions, while neglecting physical modeling to prioritize speed and ease
- Guidelines followed when creating the distortion maps:
- Real paper is locally-rigid meaning deformation at a point propagates spatially, it doesn't compress or expand
- Two kinds of distortions are folds and curves generate creases and paper curls, respectively
Perturbed Mesh Generation
- Given an image I, impose an m x n mesh M on it to provide control points for warping
- A random vertex p is selected on M as the initial deformation point with direction and strength denoted as v and also randomly generated
- Based on point 1, v is propagated to other vertices by weight w where vertices on the distorted mesh are computed as p + wv vi
- Define w where p and v define a straight line, first compute normalized distance d between each vertex and this line before defining w as a function of d
- Based on point 2, define a function for each distortion type
- For folds: w = alpha/(d + alpha)
- For curves: w= 1 - da
- Where a controls extent of the deformation propagation
- Larger a leads w toward 1, meaning all vertices share the same deformation as p making the deformation more global
- Small a limits the deformation to the local area around p
Perturbed Image Generation
- Perturbed mesh provides a sparse deformation field and is interpolated linearly to build a dense warping map at pixel level
- The warped image is then generated by the warping map to the original image; 100K images synthesized w/ up to 19 synthetic distortions
Data Augmentation
- Models trained on synthetic data may not generalize well to real data due to the gap between real and synthetic data
- Problem can be eased by domain adaption using GAN but large scale real-world data is not available
- Synthesized images augmented with transformations, textures from DTD, jitter in HSV color space to magnify paper/ illumination color variations + viewpoint change with projective transform
Network Architecture
- Architecture enforces pixel-wise supervision
- U-Net is base model choice due to simplicity and effectiveness in semantic segmentation tasks
- Fully Convolutional Network contains downsampling layers followed by upsampling layers with feature maps concatenated between them
Training Details
- The output of a single U-Net may not be satisfactory and is refined by stacking another U-Net at the output of the first as a refiner
- Has one layer to convert deconvolutional features into the final output (x,y) and it splits after the first deconvolution layer
- The deconvolution features of the first U-Net and the intermediate prediction y1 are concatenated together the input of the second U-Net
- Second U-Net gives a refined prediction which is used as the network output
Loss Function
- Defined as a combination of element-wise (Le) and shift invariant loss (Ls)
- Le = 1/n * sum((yi-yi*)^2)
- Ls = 1/(2n^2) * sum(((yi-yj)-(yi*-yj*))^2)
- Where N is number of elements in F, yi is the predicted value at index i and yi* is the corresponding ground truth value
- If di = yi-yi* -> Ls = 1/(2n^2)*sum((di - dj)^2)
- The first term is element-wise + the second decreases loss if distance between two elements is similar to that in ground truth
- The second term is also knows as Scale-Invariant Error, and L1 loss is better than L2 Loss
- Rewrites the loss function as Lf = (1/n)* sum|di| - lambda*|(1/n sum (di))|
- Where lambda controls strength of the second term and set to 0.1
- Elements in F corresponding to the background have constant negative value in section 3
Experiments Environment
- Our benchmark for images captured using mobile cameras
- Use selected document types: receipts, music sheets, academic papers, and books, most mostly containing a mix of text and figures
- Original documents warped and folded
- The setup consisted of Sunlight, indoor lights, and cellphone's built-in flash
- Results indicate: Stacked U-NET is the best architecture, achieving a 0.41 MS-SSIM and 14.08 LD
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.