Monday, June 3, 2019

Optical Character Recognition (OCR)

visual nature Recognition (OCR)INTRODUCTION1.1. optical source RecognitionOptical timber Recognition (OCR) is the mechanical or electronic witnessation, adaptation of images of written, typewritten or printed textual matter (usually captured by a digital scanner or eludet) into machine-editable text.OCR is a playing field of seek in pattern naming, factitious intelligence and machine vision. An OCR system enables you to take a book or a cartridge clip article, feed it directly into an electronic computer file, and then edit the file victimization a sound out processor.All OCR systems include an optical scanner for recitation text, and suave softw ar for analyzing images. Most OCR systems use a mishmash of hardw atomic offspring 18 (specialized circuit boards) and softw be to make point of references, although somewhat economical systems do it entirely by dint of softwargon. Advanced roman OCR systems can read text in large potpourri of fonts, plainly they stil l have difficulty with handwritten text.1.2. History Of Optical Character RecognitionTo comprehend the phenomena described in the above section, we have to look at the history of OCR 3, 4, 6, its improvement, intuition methods, computer technologies, and the differences between humans and machines 1, 2, 5, 7, 8. It is invariably intriguing to be able to find ways of enabling a computer to ape human functions, like the ability to read, to salvage, to see things, and so on. OCR explore and development can be traced back to the early 1950s, when scientists tried to confine the images of characters and texts, first by mechanical and optical means of rotating disks and photomultiplier, flying spot scanner with a cathode ray tube lens, followed by photocells and arrays of them. At first, the see operation was dawdling and star line of characters could be digitized at a time by moving the scanner or the written report medium. Subsequently, the contraptions of drum and flatbed scanne rs arrived, which extended scanning to the full page. Then, advances in digital-integrated circuits brought photo arrays with higher solidity, hurried transports for memorials and higher velocity in scanning and digital conversions.These vital improvements p individuallyyly accelerated the speed of character recognition and abridged the cost, and opened up the possibilities of bear on a great range of forms and documents. Throughout the 1960s and 1970s, new OCR natural coverings sprang up in retail businesses, banks, hospitals, post offices insurance, railroad, and aircraft companies newspaper publishers, and to a greater extent opposite industries 3, 4.In parallel with these advances in hardw ar development, squiffy research on character recognition was taking place in the research laboratories of both academic and industrial sectors 6, 7. Although both recognition techniques and computers were non that reigning in the in the early hours (1960s), OCR machines tended to m ake masses of errors when the print quality was poor, caused either by wide disparity in type fonts and roughness of the come in of the paper or by the cotton ribbons of the type redeemrs 5. To make OCR work proficiently and economically, there was a big ram from OCR manufacturers and suppliers toward the standardization of print fonts, paper, and ink qualities for OCR coats. tonic fonts much(prenominal) as OCRA and OCRB were designed in the 1970s by the American National Standards Institute (ANSI) and the European Computer Manufacturers Association (ECMA), respectively. These special fonts were quickly pass by the International Standards Organization (ISO) to facilitate the recognition process 3, 4, 6, 7. As an upshot, very high acknowledgement rates became achievable at high speed and at reasonable costs. Such accomplishments also brought better printing traits of data and paper for practical drills. Actually, they completely revolutionize the data input industry 6 and elim inated the jobs of thousands of keypunch operators who were doing the in truth mundane work of keying data into the computer.1.3. Common Steps Of OCR ProcessingThe method of converting documents into electronic forms, which is usually referred to as digitization is undertaken in different steps.The process of scanning a document and representing the scanned image for further bear on is cal direct the pre-processing or imaging stage.The process of manipulating the scanned image of a document to produce a searchable text is called the OCR processing stage.1.3.1. The Imaging StageThe imaging military operation involves scanning the document and storing it as an image. The more or less popular image format used for this purpose is called Tagged-Image File initialise (TIFF).The resolution (number of dots per inch dpi) determines the accurateness rate of the OCR process.1.3.2. The OCR ProcessThe major steps of the OCR processing stage ar shown below.1.3.3. Distinguishing Between Te xt And Images SegmentationIn this step, the process of recognizing the text and image blocks of the scanned image is undertaken. The boundaries of each image are analyzed in order to identify the text.1.3.4. Character Recognition Feature ExtractionThis step involves recognizing a character using a process know as feature extraction. OCR tools stockpiles rules about the characters of a given script using a method known as the learning course. A character is then identified by analyzing its shape and comparing its features adjacent to a set of rules stored on the OCR engine that distinguishes each character.1.3.5. Recognition Of CharacterFollowing the character identification process, character detection process is performed by comparing the string of characters against an existing dictionary of tidingss. Additional processes such as spell-checking are performed under this step.1.3.6. Output FormattingThe finishing step involves storing the output in one of the industry standard fo rmats such as RTF, PDF, intelligence activity and plain UNICODE text.1.4. strain RecognitionPattern recognition (also known as classification or pattern classification) is a field within the vicinity of artificial intelligence and can be defined as the act of taking in raw data and taking an action based on the family line of the data. It uses methods from statistics, machine learning and other vicinities.Typical applications of pattern recognition areAutomatic speech identification.Classification of text into numerous categories (e.g. spam/non-spam email messages).The automatic pistol identification of handwritten postal codes on postal envelopes.The automatic identification of images of human faces etc.The preceding three examples form the subtopicimage analysis of pattern recognition that pact with digital images as input to pattern recognition systems.Some trendy techniques for pattern recognition includeNeural Networks(NN)Hidden Markov Models(HMM)Bayesian networks (BN)The a pplication domains of pattern identification includeComputer VisionMachine VisionMedical Image AnalysisOptical Character RecognitionCredit Scoring.1.5. Applications Of The Pattern RecognitionPattern recognition has many useful applications. Some of them are outlined below.Utilizes as a telecommunicating aid for deaf, in airline reservation, in postal de kick downstairsment for postal address reading (both handwritten and printed postal codes/addresses) and for medical diagnosis.For use in customer billing as in visit exchange billing system, order data logging, and automatic finger print identification, as an automatic inspection system.In automated cartography, metallurgical industries, computer help forensic linguist system, electronic mail, information units and libraries and for facsimile.For direct processing of documents as a multipurpose document referee for large scale data processing, as a micro-film reader data input system, for high speed data entry, for changing text/ graphics into a computer readable form, as electronic page reader to handle large volume of mail.1.6. Scope Of This WorkThe Project is designed to classify and identify a scanned image containing Arabic characters using two pace approaches. In the first pace the Arabic text image is preprocessed. And in the second pace it features are extracted. During the itinerary of work it is sham that there is no noise in the image and the image is flawlessly scanned with no deviation from its original angle no skewing.1.7. Objectives And Applications Of This WorkArabic Optical Character Recognition can open a novel way of realizing the dream of the natural mode of communication amid man and machine in this part of the world. It will inflate and multiply already usable knowledge to new horizons. Centurys aged rare script in Arabic, Urdu and Persian will become available to common man.The ultimate goal of character recognition is to conjure up the human reading capabilities. Character recognit ion systems can contri moreovere immensely to the development of the automation process and can improve the interaction among man and machine in many applications, including office automation, check verification and a large variety of banking, business and data entry applications, library archives, documents identifications, e-books producing, invoice and shipping receipt processing, subscription collections, questionnaires processing, exam papers processing and many other applications9, beside online address and signboard reading.1.8. thesis OrganizationThe remaining part of this thesis is divided into four chapters. Chapter 2 describes review of literature. Chapter 3 describes Arabic script, its peculiarities and problems. Chapter 4 is regarding the development of Arabic Character identification and chapter 5 is about conclusions and future directions respectively.Chapter 2REVIEW OF LITERATURE2.1. Optical Character RecognitionSince the beginning of writing as a form of communica tion, paper prevailed as the medium for writing. Electronic media is replacing paper with time. Because it preserves space and is fast to access, electronic media are constantly gaining esteem. The convenience of paper, its pervasive used for communication and archiving, and the quantity of information already on paper, press for quick and accurate methods to automatically read that information and adapt it into electronic form Albadr95.The latent application areas of automatic reading machines are numerous. One of the earliest, and most thriving, applications is sorting checks in banks, as the volume of checks that circulates daily has proven to be too huge for manual(a) entry. Other applications are detailed in the next section Govindan90, Mantas86.The machine imitation of human reading (i.e. optical character recognition) has been the subject of widespread research for more than five decades. Character identification is pattern recognition application with a crucial aim of simul ating the human reading capabilities of both machine printed and handwritten cursive text. The currently available systems whitethorn interpret faster than humans, but cannot reliably read such a wide diversity of text nor mean context. One can say that a great quantity of further effort is required to, at least, narrow the gap between humans reading and machines reading capabilities. The practical significance of OCR applications, as well as the interesting nature of the OCR problem, has lead to great research interest and assessable advances in this field. Now, mercenary OCR systems for Latin characters are commonly accessible on personal computers achieving recognition rates above 99% McClelland91, Welch93. Further, systems on the market can now interpret a variety of writing styles (e.g., hand-written, printed Omni-font), and character sets including Chinese, Japanese, Korean, Cyrillic, and Arabic.Since the 50s, researchers have carried out far-reaching work and make many pap ers on character recognition. Nearly all of the published work on OCR has been on Latin, Japanese or Chinese characters. This has fall outed since the median 40s for Latin, the affectionateness of the 1960s for Chinese and Japanese. The following are positive surveys and reviews on Latin character recognition. Reference may be made to Mori92 for historical appraisal of OCR research and development. The survey of Govindan90 includes surveys of other spoken languages Mantas86 has an overview of character identification methodologies, Impedovo91 on commercial OCR systems, Tian91 on machine-printed OCR, Tappert90, Wakahara92 for on-line handwriting identification. Suen80 has a survey on automatic identification of hand printed characters (viz. numerals, alphanumeric, FORTRAN, and Katakana), while Nouboud90 produced a review of the recognition of hand-printed (non-cursive) characters and conducted beta tests on a business system. Bozinovic89, Simon92 surveyed off-line cursive word re cognition, Jain et al Jain2000 reviewed statistical pattern recognition methods, and Plamondon2000 comprehensive survey of online and offline handwriting identification. Two bibliographies of the fields of OCR and document scrutiny appeared in Jenkins93, Kasturi92. Stallings76, Mori84, produced surveys on identification of Chinese machine- and hand-printed characters, respectively, and Liu et al Liu2004 addressed the state of the art of online identification of Chinese characters.2.2. General Review Of Arabic Character RecognitionAlthough almost one billion people world-wide, in several diverse languages, use Arabic characters for writing (Arabic, Persian, and Urdu are the most noted examples), Arabic character identification has not been researched as thoroughly as Latin, Japanese, or Chinese. The first published work on Arabic character acknowledgment may be traced back to 1975 by Nazif Nazif75 in his masters thesis. In his thesis a system for the identification of printed Arabic characters was developed based on extracting strokes that he called radicals (20 radicals are used) and their positions. He used correlation between the templates of the deep-seated and the character image. A segmentation phase was included to segment the cursive text. Years by and by Badi and Shimura Badi78, Badi80 and Noah Nouh80 toiled on printed Arabic characters and Amin Amin80 on hand-written Arabic characters. Surveys on AOTR may be referred in Amin85a, Amin98, Shoukry89, Jambi91, Albadr95, Nabawi2000, Ahmed94.On-line systems are restricted to recognizing hand-written text. Some systems recognize remote characters Ali89, Amin80, Amin85b, Amin87, ElSheikh89, ElSheikh90b, ElWakil87, ElWakil89, Saadallah85 and hand-written mathematical formulas ElSheikh90c, Amin91b, while others recognize cursive words Badi78, Badi80, Badi82, Amin82a, Amin82b, Shaheen90, AlEmami90. Since the segmentation problem in Arabic is non-trivial the concluding systems deal with a much harder problem.Wh ile several off-line systems use video cameras to digitize pages of text (e.g., Abbas86, Goraine92, Amin86, HajHassan85, HajHassan90, Nouh80, Nouh87, Nouh89, Sarfraz2003, Sarfraz2004), the inclination now is to use scanners with resolutions ranging from 200 to 400 dots per- inch (e.g., AbdelAzim89c, AbdelAzim90a, AlYousefi88, Amin91a, Bouhlila89, ElDabi90, ElSheikh88a, Ramsis88, Sarfraz2003a, Sarfraz2003b, Zidouri2002, Zidouri2005). Scanners set up less noise to an image, are less pricey, and more convenient to use for character recognition, specially when coupled with automatic document feeders, automatic Binarization, and image elevatement.Among the off-line systems that identify hand-written isolated characters are Abuhaiba90, AlYousefi90, AlTikriti85, ElDesouky92, Hyder88. Abbas86, AbdelAzim89b, Goneid92 identify hand-written Arabic (Hindi) numerals, and Badi80, Badi82, Goraine92, Jambi92, Zahour91 distinguish hand-written words. The majority of off-line systems distinguish ty pewritten cursive words AbdelAzim89c, AbdelAzim90a, Bouhlila89, ElDabi90, Amin86, ElKhaly90, ElSheikh88b, Goraine89, Khella92, Margner92, Nazif75, Nouh87, Ramsis88, Tolba89, Tolba90, ElRamly89c, HajHassan90, HajHassan91, while ElShiekh88a, Mahdi89, Mahmoud94, Nouh80, Nouh89, NurulUla88, Fayek92, Sarfraz2005d, Zidouri2005 identify only typewritten isolated characters. The systems of Abdelazim90b, AlBadr92, ElGowely90, Kurdy92, Fakir93 are intended to recognize typeset words. One of the systems Abdelazim89a recognizes bilingual (Arabic/Latin) typewritten words. Examples of systems for detection of other languages that use Arabic script are Parhami81, Yalabik88, Hyder88, which are designed for the identification of Persian, Ottoman (Old Turkish), and Urdu, respectively.2.3. Applications Of Optical Character RecognitionOptical character recognition technology has many practical applications that are independent of the treated language. The following are some of these applicationsFinanci al backup ApplicationsFor cataloging bank checks since the number of checks per day has been far too large for manual arrangement.Commercial Data ProcessingFor inflowing data into commercial data processing files, for example inflowing the names and addresses of mail order customers into a database. In addition, it can be worn as a work sheet reader for payroll accounting.In Postal DepartmentFor postal address reading, cataloging and as a reader for handwritten and printed postal codes.In Newspaper Industry bonus typescript may be read by recognition equipment into a computer typesetting system to keep away from typing errors that would be introduced by keypunching the text on computer peripheral equipment.Use By BlindIt is used as a reading abet using photo sensor and tactile simulators, and as a sensory aid with sound output. Additionally, it can be worn for reading text sheets and reproduction of Braille originals.In Facsimile TransmissionThis procedure involves transmission of pictorial data over communications channels. In practice, the pictorial data is mainly text. Instead of transmitting characters in their pictorial representation, a character identification system could be used to recognize each character then transmit its text code. Finally, it is worth to say that the major potential application for automatic character identification is as a general data entry for the automation of the work of an ordinary office typist.2.4. Development Of New OCR TechniquesAs OCR research and development advanced, demands on handwriting identification also increased because a lot of data (such as addresses written on envelopes sums written on checks names, addresses, identity numbers, and dollar values written on invoices and forms) were written by hand and they had to be pierced into the computer for processing. But early OCR techniques were based mostly on template matching, simple line and geometric features, stroke detection, and the extraction of their deriv atives.Such techniques were not classy enough for practical identification of data handwritten on forms or documents. To cope with this, the Standards Committees in the United States, Canada, Japan, and some countries in Europe designed some handprint models in the 1970s and 1980s for people to write them in boxes 7. Hence, characters written in such specified shapes did not diverge too much in styles, and they could be recognized more easily by OCR machines, especially when the data were pierced by controlled groups of people, for example, employees of the same company were asked to write their data like the advocated models. Sometimes writers were asked to follow certain bonus instructions to enhance the quality of their samples, for example, write big, close the loops, use simple shapes, do not link characters, and so on. With such constraints, OCR detection of handprints was able to flourish for a number of years.2.5. Recent Trends And MovementsAs the years of exhaustive researc h and development went by, and with the birth of several new conferences and workshops such as IWFHR (International Workshop on Frontiers in script Recognition), 1 ICDAR (International Conference on Document Analysis and Recognition), 2 and others 13, identification techniques advanced rapidly. Moreover, computers became much more authoritative than before. People could write the way they commonly did, and characters need not have to be written like specified models, and the subject of unimpeded handwriting recognition gained considerable momentum and grew swiftly. As of now, many new algorithms and techniques in pre-processing, feature extraction, and powerful classification methods have been urbanized 8, 9.Chapter 3ARABIC A CURSIVE SCRIPT3.1. ArabicArabic is a semantic language used as principal language in most countries. Arabic is vocalized by 234 million people 9 and essential in the gardening of many more. While spoken Arabic varies across region, written Arabic, sometimes called Modern Standard Arabic (MSA), is a uniform version used for authoritative communication across the Arab world 9. The characters of Arabic script and same character are used by a much higher entitlement of the worlds population to write language such as Arabic, Farsi, Persian and Urdu. Thus the ability to automate the understanding of written Arabic would have wide spread benefits.Arabic is commonly written in the calligraphic Nastaliq script, whereas Naskh is more commonly used. Usually, bare transliterations of Arabic into Roman garner exclude many phonemic elements that have no transcript in English or other languages commonly written in the Roman first principle. National Language Authority of Pakistan has developed numeral systems with specific notations to intimate non-English sounds, but these can only be appropriately read by someone already familiar with Urdu, Persian, or Arabic for letters such as ? ? ? ? or ? and Hindi for letters. Most of Arabic characters w hen pooled form a degree of about 45 to the horizontal line because of which Arabic script reading is faster than roman script but on the other hand it makes it harder for the greenhorn readers and the machines to identify the word or segment one character from the rest.Unlike the English script there is no capital or small characters in Urdu, but the last character of a word can be metric as a capital character as in many compositors cases it presents the full form of the character and the characters at early and middle positions are considered as small. Every character has an impartial shape besides different joining forms, but some of the alphabet like the characters making the word Urdu (? ? ? ?) or of the similar category are not joinable or cannot be connected. Arabic alphabet utilizes consonant letters, vowels, diacritic marks, numerals, punctuations and a few superscripts signs.The graphical representation of each alphabet has surplus one form depending on its position a nd context in the word. In general each letter has four forms that is beginning, middle, final and stand alone as shown in table 3.1.3.2. Arabic LettersThe Arabic alphabet contains 28 letters. Each has between two and four shapes and the choice of which shape to use depends on the postal service of the letter within its word or sub word. The shape correspond to the four positions beginning of a (sub) word, middle of a (sub) word. End of a (sub) word and in isolation. Table 3.1 shows each shape for each letter. Letters without initial shapes are purely their isolated shapes, and their medial shapes are their final shapes.Some letters have descanters or ascenders which are position that extend below the primary line on which the letters sit or above the stature of most letters. Theres no upper or lower case, but only one case. Arabic script is written from good to go forth, andLetters within a word are usually joined even in machine print. Letter shapes and whether or not to connec t depend on the letter and its neighbors. Letters are connected at the same virtual height. The baseline is the line at the height at which letters are allied, and it is akin to the line on which some an English word sits. Letters are wholly above it except for decanters and some markings. Theres no association between separate words. So word boundaries are always represented by a breathing space. Six letters, however, can be allied only on one side. When they occur in the middle of a word, the word is divided into manifold sub-words separated by space.A ligature is a word shaped by combining two or more letters in an accepted manner. Arabic has numerous standard ligatures, which are exception to the above rules for joining letters. Most common is laam- alif, the combination of laam and alif and other include yaa-meem.3.3. Problems Of Arabic ScriptDespite a huge character set Arabic has a small set of characters which are easily observable from one another. The remaining character f luctuates from these character using dots or symbols above or below these shapes 19. The table 3.2 shows group of similar characters and their derived forms.As shown above table 3.2, only 21 different groups exits out of 32 character set. It will complicate the identification phase of Arabic characters. Further study of other forms ( initial, middle and final ) of these character divulges that ein( ) is analogous to hamza(?), wow (?) might be perplexing with (?) , ze (?) resembles noon () and mem(?) can be baffled with middle form of ein () and with stand alone goal-he (?).A key distinction between Latin scripts and Arabic script is the fact that many letters only differ by a dot(s) but the primary stroke is exactly the same. 193.4. Others Problems In Arabic OCRAll Muslims (almost of the people on the earth) can read Arabic because it is the language of Al-Quran, the holy book of Muslims. plain though, Arabic script identification has not received enough welfare by the researchers . Little research progress has been accomplished comparing to the one through with(p) on the Latin and Chinese. The elucidations available in the market are still far from being perfect 11, 14. There are few raison dtres led to this result.Require of financial support and platform accessible from any government (official language of countries). need of ample support in terms of journals, books etc. and lack of interaction between researchers in this playing fieldlack of broad-spectrum support utilities like Arabic text databases, dictionaries, programming tools, and supporting staffbelatedly start of Arabic text identification (first publication in 1975 compared with the 1940s in the case of Latin character recognition)The research carried out on Arabic language is typically scattered and outside from the Arab world.There are no specialized conferences or symposium demeanor so far.Algorithms developed for other language scripts are not pertinent on Arabic.3.5. Characteristics Of Ar abic CharactersThe calligraphic nature of the Arabic set is eminent from other languages in several ways. For example,Arabic text is written from office to left.No upper or lower cases subsist in Arabic, but sometimes the last character of a word is considered as upper case because its always remains in its full form.Arabic has 28 fundamental characters, of which 16 have from one to three dots. Those dots discriminate between the otherwise similar characters. Additionally, three characters can have a meander like stroke. The dots are called secondaries and they are located above the character primary part as in ALEF (?), or below like BAA (?), or in the middle like JEEM (?).Written Arabic text is cursive mutually in machine-printed and hand-written text. deep down a word, some characters unite to the preceding and/or following characters, and some do not connect. The connectivity of characters consequences in a word having one or more connected components. We will refer to each co nnected piece of a word as a sub-word.The shape of an Arabic character depends on its muddle in the word a character might have up to four different shapes depending on it being isolated, connected from the right (beginning form), connected from the left (ending form), or connected from both sides (middle form).A distinguishing feature of Arabic writing is the presence of a base-line. The baseline is a level line that runs through the connected portions of text (i.e. where the characters connection segments are located). The baseline has the highest number of text pixels. (See figure 3.2.)Characters in a word may overlie vertically (even without touching).Arabic characters do not have permanent size (height and width). The character size varies according to its pose in the word,Characters in a word can have diacritics. These diacritics are written as strokes, placed either on top of, or below, the characters. Poles apart diacritic on a character may change the content of a word. R eaders of Arabic are accustomed to reading un-diacritical text by deducing the meaning from context.Numerous characters can combine vertically to form a ligature, especially in typeset and handwritten text.Arabic words may perhaps consist of one or more sub-words. Each sub-word may have one or more characters, because some Arabic characters are not joinable to others from the left side. As an example, the word Ketab ( ) consists of two sub-words Keta ( ) which consists of three characters and BAA( ?) which is a single character.There are merely three characters that represent vowels, ? , ? or ? . However, there are other shorter vowels represented by diacritics in the form of over scores or underscores but practice of over score and underscore in Arabic is lessDots may materialize as two separated dots, touched dots, hat or as a stroke.Another style of Arabic handwriting is the arty or decorative calligraphy which is usually full of overlapping making the identification process ev en more difficult by human being rather than by computers.3.6. SummaryArabic script includes its cursive nature of writings, right to left style of writing and change of form and shape when a character is placed at different locations of a word, loops, half closed characters and dots on above or below a character. National Language Authority defined 32 characters set but it has 21 working characters beside numeral and diacritics.Chapter 4ARABIC CHARACTER RECOGNITION4.1. Phases Of Arabic Character RecognitionIn an offline character identification system, the user scans a particular script, runs the OCR and gets the documents saved in a file format of his choice. The alteration of the text from the scanning phase to the final document involves a number of phases that are transparent to the user. The proposed system can be implemented in the following stepsImage AcquisitionDigitizationPreprocessingFeature extractionRecognition.Figure 4.1 shows the componen

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.