In my previous blog about Azure Purview I showed how it was possible, with Microsoft's new cataloguing technology on the Azure platform, to discover what data assets you have (in Azure, and beyond Azure), as well as understand what you are looking at (data classifications, sensitivity, business glossary, quality state, ownership, etc.). And although Purview accesses the data content when classifying the data, it is still fundamentally a metadata cataloguing tool. In other words, you can discover where credit card information is stored within your data ecosystem, but not where to find credit card number 1234 5678 8765 4321. This does not make Azure Purview a substandard tool, far from it, it is a very important tool within the Data Governance space, and as my previous article explains, fills a major gap in that area within Azure.
For content discovery, we rather have to look at something different, i.e. Azure Cognitive Search. Search as a service.
In this article, we will look at what Azure Cognitive Search is and why it is useful in the Data Analytics domain. And then take it for a quick test drive.
What is Azure Cognitive Search
Azure Cognitive Search (formerly 'Azure Search') is a cloud search service that allows for the building of a rich search experience over content. As my interest lies with data and analytics, I am hoping to test Azure Cognitive Search within the context of "a rich search experience over data".
Why is Azure Cognitive Search useful in the data analytics domain?
As explained earlier understanding what data assets you have and various aspects of your data is imperative in the world of modern data analytics practice, where an increasing number of persons across the organisation (often in non-technical roles) takes part in the data lifecycle, and where there is a huge demand for quicker delivery of data and data solutions to data workers. But adopting a modern data analytics approach, also means that more data is moved into your data ecosystem quicker and humans cannot keep up with manual cataloguing of data (this is where Azure Purview helps), and similarly, discovering content will be increasingly hard without some help from Artificial Intelligence.
This is different from the past where data warehouses were delivered after extensive business requirements gathering, solution design, and linear development. This rigid regime ensured all data elements were documented and the structure was known and well understood by the designers and developers, who often were also the main users of the data warehouse for report development. This is now of course no longer the case, as traditional data warehousing can no longer cope with the volume and variety of data, and the nature of the users and the speed with which data solutions must be delivered. Humans can no longer be expected to hold all the knowledge about the metadata and content of the plethora of data held in the data ecosystem. This is why tools such as Azure Purview, for the metadata part, and Azure Cognitive Search, for the content part, is so important.
How does Azure Cognitive Search work?
Azure Cognitive Search relies on indexing, and querying:
"Indexing is an intake process that loads content into to your search service and makes it searchable. Internally, inbound text is processed into tokens and stored in inverted indexes for fast scans.
Querying can occur once an index is populated with searchable text...all query execution is over a search index that you create, own, and store in your service." - Introduction to Azure Cognitive Search - Azure Cognitive Search | Microsoft Docs
Lets give it a simple test drive
Preamble
In this simple test drive I will:
Load the data to my Data Lake,
Provision Azure Search,
Create an index over my data in my Data Lake,
Add enhancements and Cognitive Services,
Run the index,
Test the search results,
Create a sample search app
Step 1 - load the data
I loaded a sample dataset, Staff list, in CSV format to my Azure Data Lake Gen2. The list is available from the reference at the end of this article.
As a quick side note, the sample data contains the weight of each staff member, and I am not sure how appropriate that would be in real life as I am sure it breaks a few privacy / ethical rules
Step 2 - Provision Azure Cognitive Search
Next I created an instance of Azure Cognitive Search using the Free tier "F" and kept all the defaults to keep it simple.
Step 3 - Create an index over my data
Next I created an index over my newly loaded Staff list data.
It is at this stage important to note which formats is supported in Azure Cognitive Search:
PDF
Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
Open Document formats: ODT, ODS, ODP
HTML
XML
KML (XML for geographic representations)
ZIP
GZ
EPUB
EML
RTF
Plain text files
JSON
CSV
A negative here, in my opinion, is that Parquet is not (yet as at the date of authoring of this article) supported. Parquet is increasingly pervasive as a file type in something like Azure Data Lake Gen2 so it would be good if Microsoft included it as a supported format for Azure Cognitive Search.
Step 3.1 - provision an Azure Search service.
Step 3.2 - import the data.
Step 3.3 - Add AI to enhance the search results, by attaching an Azure Cognitive Service (created during the attachment process).
Step 3.4 - Add a skills set to enhance the search results - I selected the EmpID from my dataset and selected other enhancements as shown below.
Step 3.5 - create and schedule the indexer by selecting the fields from my dataset I want to index and be searched (names, social security numbers/ SSN, etc.)
Step 3.5 - create and monitor the index - it may take a few minutes for the index to run, but the status should change to Success after the successful run.
Step 3.6 - I can now test the index by running a sample search on the SSN "222-11-7603". The results returns the correct result with the highest results score, but it also returns others that matches part of the search phase with lower scores (similar to how a search engine will always return more than one result).
Step 4 - How to use this search index?
The index can be used in either:
Option 1 - in an Azure Search SDK
Option 2 - by calling the RestAPI direction from your own app
Option 3 - generate your own UI by using the demo app function via the Index menu on the Cognitive Search service. My demo app is shown below where I used a simple wildcard search for "Buck*", with the search results correctly returned.
What is the benefit of such a 'search as a service' capability?
In my previous article related to search and artificial intelligence, Azure Purview, does it fill the data governance blind spot for Microsoft? (makingmeaning.info), I described how we can now effectively use artificial intelligence to catalogue our data's metadata, plus some other capabilities crucial to modern pragmatic data governance.
Whereas Azure Cognitive Search, as described here, goes further and allows us to now build intelligent search as a service solutions to go beyond metadata and search the actual content of our data, and do so in a contextual manner in the same way modern search engines do.
The biggest advantage in being able to search both metadata (Purview) and content (Azure Cognitive Search) is that it completely aligns to modern data practice where more data or greater varieties will be stored, and so the need to discover the data and to understand the data is more important than ever.
References
Disclaimer
The views expressed on this post are mine and do not necessarily reflect the views of any organisation I am associated with.
Comments