The topic of Optical Character Recognition (OCR) is not an unexplored field to the Adatis audience. Some Adati like Kalina Ivanova (link1, link2) and Francesco Sbrescia (link3) have already explored this topic from the perspective of Azure Cognitive Services and Azure Data Lake. In my first blog, I would like to explore this topic from a different perspective: using Tesseract and Databricks.
Before extracting any information from a picture, we need to read it first. In our example, we will use this menu:
Let`s read the menu in Databricks:
menu = spark.read.format("image").load("/FileStore/tables/images/menu.jpg")
display (menu)
Now for the purpose of OCR, we would use the Google’s tesseract library. You can install it from here. The process of installation is simple – you just need to follow the instructions. After the installation, it is very important to add the installation folder to your PATH environment variable. You can do this from the Advanced System Properties:
As a next step, we will locate our PC folder from where we have imported the image to Databricks, and we will run the Command Prompt from there:
After we have opened the Command Prompt in the directory of the image, we type the following command:
tesseract menu.jpg menu
This command will create a txt file from our image. Look what we have before the execution and after it:
Before:
After:
Now we are seeing that we have a txt file with the same name next to the image.
Let`s check whether Tesseract has done its job well. We will import and read the txt file into Databricks:
We see that the menu content has been read quite well.
Of course, there are many ways to perform OCR, such as using Spark OCR for example, and doing the process in the cloud, without referring to the on-premises PC environment. Generally, I prefer this hybrid way because it is simple – it takes less resources (prerequisite files, license keys) and less code to accomplish it.
Thank you for your attention. This was my first blog. I hope you find it useful. Expect more blogs on different topics soon.
If you enjoyed this blog, check out our full blog list here.
Pareto Charts in Power BI and the DAX behind them
The Pareto principle, commonly referred to as the 80/20 rule, is a concept of prioritisation.
Apr
Databricks: Cluster Configuration
Databricks, a cloud-based platform for data engineering, offers several tools that can be used to
Apr
AI Assistance in Microsoft Fabric
The exponential growth of Large Language Models (LLMs) couples with Microsoft’s close partnership with OpenAI
Apr
10 reasons why it’s worth the effort to understand the value of your data
“If leaders really want to create a data driven culture, the journey starts with them!
Apr
Content Safety in Azure AI Studio
Azure AI Content Safety is a solution designed to identify harmful content, whether generated by
Apr
Model Benchmarks in Azure AI Studio
In the constantly changing field of artificial intelligence (AI) and machine learning (ML), choosing the
Apr
Celebrating International Women’s Day: from Classroom to Code
As we celebrate International Women’s Day, I want to share my journey of breaking stereotypes
Mar
Pretty Power BI – Adding Pagination to Bar Charts
Good User Experience (UX) design is crucial in enabling stakeholders to maximise the insights that
Feb