Until recently, the history of computing was all about training people to think and work in ways that were convenient to computers.
But search has flipped that on its head. Now, we’re living in a search-driven world where our customers and colleagues expect to use natural language when they’re working with data. And the tools to meet that expectation are now pretty much universally available.
Having the right tools, though, is just one step in a journey that includes understanding and preparing your data, creating the right search architecture for your needs, and optimizing performance.
This isn't just theory. In this article, I'll share lessons from two complementary examples: a months-long project where we prepared a large media company's data for search, and a hands-on GitHub repository I've created that demonstrates these same principles using a manageable books dataset from Kaggle. This way, you can see these concepts applied at both enterprise scale and in a way you can experiment with yourself.
Let's start by looking at why data preparation is so crucial for search in the first place.
Real data isn’t neatly organized into rows and columns or confined to a structured JSON document. It’s messy, unstructured, and highly varied. It could be anything. In a single project you could feasibly find a 35-minute documentary from the 1970s alongside time-series data from player-tracking software analyzing every sprint, pass, and tackle in a football game.
Even something seemingly straightforward like a book catalog will have its quirks—inconsistent author formatting, missing subtitles, or descriptions that range from a single sentence to multi-paragraph summaries.
But if we want to meet users’ expectations of interacting with computers in a more natural way, we first need to understand that data and prepare it for search. That’s essentially a process of giving computers more context.
To get our data into the right state, we need to go through four phases:
Let’s look at what each one involves.
Because data can be so varied, the first thing we need to do is get it into a format that tools like Elasticsearch can work with. Let me give you an illustration: in the media company project, we were dealing with around 2 million assets spanning decades of content in multiple formats. Video, audio, text, and more.
To create searchable indexes, we need text. So, the first step is all about converting that non-text content into text:
If you’ve ever worked on data transformation, you know how crucial this stage is. Mistakes here don’t just cause headaches—they ripple through everything that follows. Strip timestamps from video transcripts, for example, and you lose the ability to link search results to precise moments.
But it’s not just about mistakes in how you transform the data. It’s also down to what the data can give you.
First, there's metadata. In our media project, we were working in an "apples and oranges" environment. There was that 35-minute documentary I mentioned earlier but also 15-second video clips recorded a week ago. The older content typically had minimal descriptions (e.g. "Learn about biology"), while newer content came with extensive metadata.
Second, longer content naturally gives you more to work with and that can skew relevance algorithms. A 40-minute transcript will mention key terms more frequently than a 30-second clip, potentially making it appear more relevant for those terms even when it's not.
Knowing that those issues exist means you’re ready to solve them. In this particular project, we tackled these challenges in several ways. We built machine learning algorithms that generated detailed transcripts from videos. These algorithms then extracted the top 15 relevant keywords and created standardized three-paragraph descriptions for each asset. This approach normalized content across eras and formats, so we had a consistent baseline of searchable text regardless of what metadata originally came with an asset.
This shows how transformation is about more than just converting data from one format to another. You can think of it as creating a level playing field where all of your data has the best chance of being the search result that someone needed.
Even in a simpler, text-based dataset, transformation helps to make the data searchable. Let's look at how this applies to our books example.
There, we face a different transformation challenge. The data already exists in text format, but that doesn't mean it's ready for search. Across thousands of book entries seemingly minor formatting inconsistencies translate into stumbling blocks for our search. For example, where a book has more than one author, there are different ways of recording that. Sometimes, there’s a list of authors separated by semicolons, making it nearly impossible to find books by a single co-author without knowing their collaborators.
Transforming these semicolon-separated strings into proper arrays gives each author equal footing in the search landscape. It's a small change with outsized impact: a user searching for books by Jacquelyn Reinach can now discover "Rest, Rabbit, Rest" even without knowing she wrote it with Richard Hefter.
Notice how the same principle applies at both scales. At the media company, we were using machine learning to generate consistent metadata across millions of assets. In our books repository, we're using simpler string parsing to standardize author information. The technical approaches differ based on volume and complexity, but the goal remains the same: get to a point where all content has an equal chance of discovery.
Part of the beauty of a new dataset lies in allowing it to speak for itself. At the transformation stage, we’re still imposing some of our own ideas about what the data is. But to make search useful we need to uncover what is in the data, rather than what we think might be there.
And, so, exploration starts with stripping away all your preconceived notions about what the data might be. You can almost think of it like arriving at an unfamiliar library where all the books have been shelved but you don’t know the organizing system. Even then you can’t trust that any labels really mean what they seem to.
The books dataset gives us a nice illustration of the problem. Take a look at the description for two books:
Notice the difference? Not only is one much longer than the other but the description of ‘Tis arguably includes two different types of data: accolades followed by the actual description. Other books might not have a description at all.
To work around this, we need to find footholds. These are the natural features in the data that give us a way to navigate it. You can find them through diagnostic querying, such as using aggregation queries to understand the shape and characteristics of your collection, including:
When exploring the media company's 2 million assets, this approach allowed us to see, for example, that some fields contained nearly identical information across thousands of assets while others varied wildly. By letting the data speak for itself, we discovered natural divisions that would make search more intuitive. For instance, we found fields with a limited set of repeating values—perfect candidates for filters that would help users quickly narrow down results.
While the variety in data can be its strength, we need to impose some sort of order if we want our search to be useful. We do that through two related processes: normalization and standardization.
Normalization is where we start to strip away unnecessary variety to make sure we can make useful comparisons. To do that, we need to:
In our media project, we focused on balancing content depth across eras, while our books dataset presents different normalization challenges
The books dataset has an interesting quirk in that ratings have a precision of two decimal places. ‘Tis gets a rating of 3.68, while Dr. Seuss’s Adventure Stories gets 4.35. But that’s not really how people think about book quality. If a friend says they found a book to be pretty good, you might equate that with three stars but it’s unlikely you’d really be able to make an argument for why one book deserved a rating of 3.68 rather than 3.75.
So, here we can use the normalization step to turn those continuous values (0.00-5.00) into star ratings.
Standardization then takes the normalized data and imposes a consistent structure. Remember the idea of footholds in the data? This is where we not only map out where the footholds are but we also enhance them: chiseling some deeper, building others up, and adding new ones where needed.
In data terms, this means creating consistent metadata structures, filling in gaps for older content, and ensuring that every asset is equally discoverable through the same search approaches.
If you look at the books example repo, you’ll see that we’ve created a standardized full_title that combines the title and subtitle fields. This recognizes that, in practice, there isn’t really any difference between a book with a subtitle and one without. Instead, the subtitle provides additional context that’s going to be useful to people when they search by title. So, we now have a standardized title format whether or not a book has a subtitle.
In the final stage, we select which data will appear in our search indexes. This is where we must be careful not to fall into a common trap.
Search systems like Elasticsearch look an awful lot like databases. They store, index, and retrieve data. But thinking of them in that way can be a problem because search indexes are fundamentally different tools with different purposes. If you think of search as a presentation layer rather than as a data store then you can focus on creating indexes that serve search rather than overstuffing them with fields that might actually make it harder to get good search results.
With that understanding, selection becomes about making deliberate choices.
It starts by including only fields that directly impact search. If the field won't help someone find what they're looking for, why include it?
In the books dataset, this principle becomes immediately apparent. Take a look at those ISBN numbers—both ISBN-10 and ISBN-13 versions. Perfect unique identifiers for a database? Absolutely. But most readers don't have them at their fingertips while searching. Similarly, thumbnail URLs serve our display layer well but can introduce noise into search. Imagine someone searching for Zoom etiquette tips and being met with Dr. Seuss's 'Oh, the Places You'll Go!' simply because its thumbnail URL contains the word 'zoom'. By stripping these fields from our search index, we're not losing data—we're gaining clarity.
We should also be especially thoughtful about relationships between items. Search engines generally prefer flat, self-contained documents. When you need to represent relationships, it can be better to design this at the application layer rather than trying to nest complex structures in your search index.
Finally, remember that search is about human behavior, which constantly evolves. Ten years ago, users carefully constructed search queries with specific syntax. Today, they type loosely formed thoughts and expect systems to understand. There's a balance to strike here. You don't want to get bogged down in over-optimizations for the far off future but you might look at how things are changing today and what that might lead to in a year or two.
Ultimately, the goal isn't to include everything. What we’re building is a way to create the shortest path between someone's question and the answer they're looking for.
Both our example scenarios—processing 2 million media assets spanning decades and organizing a few thousand book records—show that effective search requires deliberate data preparation. Each phase—transformation, exploration, normalization, and selection—creates clearer paths between questions and answers.
The media project demonstrated how these principles scale to enterprise level, where even small inefficiencies multiply across petabytes of data. We saw how machine learning can generate consistent metadata and how smart partial indexing can reduce both costs and errors.
What connects both scenarios is their human-centered approach. We've moved beyond making people adapt to computers. Our search systems must now adapt to human language and behavior.
At Pelotech, we work daily with a range of clients, from mid-sized businesses to Fortune 500 companies, helping them navigate cloud-native projects such as search.
If you're looking to make your data more discoverable and valuable through search, we'd love to be part of that conversation. Reach out, and let's explore how we can help transform your data through intuitive search.