You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For every connected component of pixels find bounding rectangle
Eliminate all rectangles contained inside other rectangles
Join all intersecting rectangles
Make a histogram of rectangle heights
The hight with the highest frequency is text height
Mark all components with the hight or width > than 5 * by text hight as images
For every rectangle find a neighboring rectangle to the left, it is a nearest rectangle to the left intersecting the interval from component y coordinate to y + height
Create a graph, connect all components with their right neighbors
Find connected components in that graph
Join intersecting components
Those components are text lines
Sort lines and symbols inside them
Calculate the average area of the connected pixel components for every text line, let us call it . All the intersymbol gaps bigger than will be the interword gaps. Use the interword gaps to split the text line into words.
alculate the baseline height for every text line using the histogram of lower y coordinates. The most often occurring height is the baseline. Calculate the baseline shift for every symbol.