Improved page segementation

segm module is implemented in C++

Segmentation Algorithm

Open an image file as numpy array in grayscale
Find all components containing connected pixels
For every connected component of pixels find bounding rectangle
Eliminate all rectangles contained inside other rectangles
Join all intersecting rectangles
Make a histogram of rectangle heights
The hight with the highest frequency is text height
Mark all components with the hight or width > than 5 * by text hight as images
For every rectangle find a neighboring rectangle to the left, it is a nearest rectangle to the left intersecting the interval from component y coordinate to y + height
Create a graph, connect all components with their right neighbors
Find connected components in that graph
Join intersecting components
Those components are text lines
Sort lines and symbols inside them
Calculate the average area of the connected pixel components for every text line, let us call it . All the intersymbol gaps bigger than will be the interword gaps. Use the interword gaps to split the text line into words.
alculate the baseline height for every text line using the histogram of lower y coordinates. The most often occurring height is the baseline. Calculate the baseline shift for every symbol.

Python extension build instructions

sudo apt install libboost-dev libleptonica-dev libflann-dev libopencv-dev liblz4-dev cmake
pip install requirements.txt
run python setup.py install

from segm import join_rects
import matplotlib.pyplot as plt
import cv2

def segment_image(filename):
    img = cv2.imread(filename, 0)
    orig = cv2.imread(filename)
    jr = join_rects(img)
    for i, r in enumerate(jr):
        cv2.rectangle(orig,(r.x,r.y),(r.x + r.width,r.y + r.height),(0,255,0),2)

    return orig
    
filename = 'libsegm/vd_p122.png'

img = segment_image(filename)

plt.rcParams['figure.figsize'] = [20, 20]
plt.imshow(img)

<matplotlib.image.AxesImage at 0x7feb138f58b0>

    
    import cv2
    from segmentation import find_ordered_glyphs
    
    filename = 'vd_p214.png'
    orig = cv2.imread(filename)
    glyphs = find_ordered_glyphs(filename)
    counter = 1
    font = cv2.FONT_HERSHEY_SIMPLEX
    for gl in glyphs:
        cv2.rectangle(orig, (gl.x, gl.y), (gl.x + gl.width, gl.y + gl.height), (255,0,0), 2)
        cv2.putText(orig, str(counter), (gl.x, gl.y), font, 1, (255,0,0), 2, cv2.LINE_AA)
        counter += 1

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
documentation		documentation
scripts		scripts
segm		segm
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Readme.md		Readme.md
output_1_1.png		output_1_1.png
output_2_1.png		output_2_1.png
preface.png		preface.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
segmentation.py		segmentation.py
segmented.png		segmented.png
segments.png		segments.png
setup.py		setup.py
testing.py		testing.py
vd_p214.png		vd_p214.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improved page segementation

segm module is implemented in C++

Segmentation Algorithm

Python extension build instructions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Improved page segementation

segm module is implemented in C++

Segmentation Algorithm

Python extension build instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages