寶寶啾與大寶寶日誌: tesseract-ocr 文字辨識

2014年11月20日星期四

tesseract-ocr 文字辨識

首先先感謝Google ~ 開放了這麼有趣又簡單操作工具

用Google用繁體搜尋 , 感覺台灣玩的有點不熱絡 , 讓我感覺有點失望的說

不過 tesseract-ocr 資料跟其他應用,還是不少的說

環境:

OS: Win 7 64

VC++ 2008

1.先到此網站下載 tesseract-ocr-setup-3.02.02.exe 安裝

https://code.google.com/p/tesseract-ocr/downloads/list

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16. 測試看看 tesseract-ocr 原本的自己exe檔案

17. 查看版本

18. 測試看看英文包

19.測試看看中文包

20.用中文包來測試看看 ~ 英文跟中文夾雜的圖片 ~沒想到還可以測試成功

21.

API參考資料：

http://zdenop.github.io/tesseract-doc/index.html

簡單的程式碼：

#include <baseapi.h>
#include <allheaders.h>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
string UTF8ToBig5(const std::string& strUTF8);

int main(void){
tesseract::TessBaseAPI api;
api.Init("", "chi_tra", tesseract::OEM_DEFAULT);
api.SetPageSegMode(static_cast<tesseract::PageSegMode>(7));
api.SetOutputName("out");
char image[256]="4.tif";
PIX   pixs = pixRead(image);
STRING text_out;
api.ProcessPages(image, NULL, 0, &text_out);
FILE output;
output=fopen("test.txt","wb");
fwrite( text_out.string(), 1, text_out.length(), output );
fclose(output);
string Text01 = UTF8ToBig5(text_out.string());
cout<<"輸出文字:"<<Text01.c_str()<<endl;
system("pause");
}

string UTF8ToBig5(const std::string& strUTF8)
{
    int len = MultiByteToWideChar(CP_UTF8, 0, strUTF8.c_str(), -1, NULL, 0);
    unsigned short * wszBig5 = new unsigned short[len + 1];
    memset(wszBig5, 0, len * 2 + 2);
    MultiByteToWideChar(CP_UTF8, 0,LPCSTR(strUTF8.c_str()), -1, LPWSTR(wszBig5), len);
    len = WideCharToMultiByte(CP_ACP, 0,LPCWSTR(wszBig5), -1, NULL, 0, NULL, NULL);
    char *szBig5 = new char[len + 1];
    memset(szBig5, 0, len + 1);
    WideCharToMultiByte(CP_ACP,0, LPCWSTR(wszBig5), -1, szBig5, len, NULL, NULL);
    std::string strTemp(szBig5);
    delete[]szBig5;
    delete[]wszBig5;
    return strTemp;
}

12 則留言:

Unknown2015年9月4日晚上10:10
您好，我也正在使用tesseract寫辨識的相關程式，想請教您tesseract是否只能在VS2008下運作，因為我用VS2010無法運作。
回覆刪除
回覆
Yen-Chu2015年9月5日上午8:26
我沒試過VS2010

照理說 ~ 只要正確引入檔案跟設定就可以用了

給你網址參考看看:
http://stackoverflow.com/questions/8153569/tesseract-ocr-how-to-includ-baseapi-h

http://tpgit.github.io/UnOfficialLeptDocs/vs2008/vs2010-notes.html
回覆刪除
回覆
Unknown2015年9月9日凌晨3:06
main.obj : error LNK2019: 無法解析的外部符號 _pixRead 在函式 _main 中被參考
不好意思又來請教您，我一些相關設定完之後，會跳出這樣的訊息，看了網路相關文章之後還是無解，想請問您知道這要怎麼解決嗎？
回覆刪除
回覆
Yen-Chu2015年9月10日清晨7:29
參考看看 ~ coka 部落格的:
http://coka.popcat.net/2011/11/error-lnk2019-unresolved-external-symbol.html

我猜應該是第三個 library 檔案中，沒有找到函式名稱
回覆刪除
回覆
linnil0001232016年1月11日凌晨3:57
1>main.obj : error LNK2019: 無法解析的外部符號 "public: virtual __thiscall tesseract::TessBaseAPI::~TessBaseAPI(void)" (??1TessBaseAPI@tesseract@@UAE@XZ) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: __thiscall STRING::~STRING(void)" (??1STRING@@QAE@XZ) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: char const * __thiscall STRING::string(void)const " (?string@STRING@@QBEPBDXZ) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: int __thiscall STRING::length(void)const " (?length@STRING@@QBEHXZ) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: bool __thiscall tesseract::TessBaseAPI::ProcessPages(char const *,char const *,int,class STRING *)" (?ProcessPages@TessBaseAPI@tesseract@@QAE_NPBD0HPAVSTRING@@@Z) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: __thiscall STRING::STRING(void)" (??0STRING@@QAE@XZ) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 _pixRead 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: void __thiscall tesseract::TessBaseAPI::SetOutputName(char const *)" (?SetOutputName@TessBaseAPI@tesseract@@QAEXPBD@Z) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: void __thiscall tesseract::TessBaseAPI::SetPageSegMode(enum tesseract::PageSegMode)" (?SetPageSegMode@TessBaseAPI@tesseract@@QAEXW4PageSegMode@2@@Z) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: __thiscall tesseract::TessBaseAPI::TessBaseAPI(void)" (??0TessBaseAPI@tesseract@@QAE@XZ) 在函式 _main 中被參考
1>main.obj : error LNK2019: 無法解析的外部符號 "public: int __thiscall tesseract::TessBaseAPI::Init(char const *,char const *,enum tesseract::OcrEngineMode,char * *,int,class GenericVector const *,class GenericVector const *,bool)" (?Init@TessBaseAPI@tesseract@@QAEHPBD0W4OcrEngineMode@2@PAPADHPBV?$GenericVector@VSTRING@@@@3_N@Z) 在函式 "public: int __thiscall tesseract::TessBaseAPI::Init(char const *,char const *,enum tesseract::OcrEngineMode)" (?Init@TessBaseAPI@tesseract@@QAEHPBD0W4OcrEngineMode@2@@Z) 中被參考
1>MSVCRTD.lib(crtexew.obj) : error LNK2019: 無法解析的外部符號 _WinMain@16 在函式 ___tmainCRTStartup 中被參考

我用VS2010跑出現這個，請問哪裡出了問題，上面步驟都照做了
回覆刪除
回覆
linnil0001232016年1月12日凌晨4:10
請問為什麼試了中文兩行以上的圖會辨識錯誤?英文也是
回覆刪除
回覆
Yen-Chu2016年1月12日晚上8:42
可以提供詳細的流程跟錯誤的訊息嗎?
回覆刪除
回覆
linnil0001232016年1月13日凌晨1:16
我截取用
民國100年10月10日
A1001010BCDEFG
那張圖
結果出來變

ˊ葷葷玉宣)宣盲)寞]羃羃I羃 l0 日
回覆刪除
回覆
linnil0001232016年1月19日下午6:03
請問可以同時使用多種語言跑程式嗎?
api.Init("", "chi_tra", tesseract::OEM_DEFAULT);表示中文封包
api.Init("", "eng", tesseract::OEM_DEFAULT);表示英文封包
要同時使用程式要怎麼寫?
回覆刪除
回覆
Yen-Chu2016年1月20日下午6:06
我記得中文包應該也可以辨識英文吧!!!

就像我倒數第二張圖片

如果說要混者其他語言用的話 ~ 最好把兩種語言訓練成一個包!!
回覆刪除
回覆

訂閱：張貼留言 (Atom)