Transforming Data for Search

Key lessons in optimizing search systems from enterprise to personal scale - discover how structuring data, refining metadata, and choosing the right fields can transform search accuracy.

Until recently, the history of computing was all about training people to think and work in ways that were convenient to computers.

But search has flipped that on its head. Now, we’re living in a search-driven world where our customers and colleagues expect to use natural language when they’re working with data. And the tools to meet that expectation are now pretty much universally available.

Having the right tools, though, is just one step in a journey that includes understanding and preparing your data, creating the right search architecture for your needs, and optimizing performance. 

This isn't just theory. In this article, I'll share lessons from two complementary examples: a months-long project where we prepared a large media company's data for search, and a hands-on GitHub repository I've created that demonstrates these same principles using a manageable books dataset from Kaggle. This way, you can see these concepts applied at both enterprise scale and in a way you can experiment with yourself.

Let's start by looking at why data preparation is so crucial for search in the first place.

Why we need to prepare data for search

Real data isn’t neatly organized into rows and columns or confined to a structured JSON document. It’s messy, unstructured, and highly varied. It could be anything. In a single project you could feasibly find a 35-minute documentary from the 1970s alongside time-series data from player-tracking software analyzing every sprint, pass, and tackle in a football game. 

Even something seemingly straightforward like a book catalog will have its quirks—inconsistent author formatting, missing subtitles, or descriptions that range from a single sentence to multi-paragraph summaries.

But if we want to meet users’ expectations of interacting with computers in a more natural way, we first need to understand that data and prepare it for search. That’s essentially a process of giving computers more context.

To get our data into the right state, we need to go through four phases:

  1. Transformation
  2. Exploration
  3. Normalization and standardization
  4. Selection

Let’s look at what each one involves.

Transforming your data

Because data can be so varied, the first thing we need to do is get it into a format that tools like Elasticsearch can work with. Let me give you an illustration: in the media company project, we were dealing with around 2 million assets spanning decades of content in multiple formats. Video, audio, text, and more.

To create searchable indexes, we need text. So, the first step is all about converting that non-text content into text:

  • Transcribing video and audio
  • Extracting metadata from image content
  • Transforming time-series data into summary statistics or events
  • Converting PDFs or other document formats into plain text

If you’ve ever worked on data transformation, you know how crucial this stage is. Mistakes here don’t just cause headaches—they ripple through everything that follows. Strip timestamps from video transcripts, for example, and you lose the ability to link search results to precise moments.

But it’s not just about mistakes in how you transform the data. It’s also down to what the data can give you.

First, there's metadata. In our media project, we were working in an "apples and oranges" environment. There was that 35-minute documentary I mentioned earlier but also 15-second video clips recorded a week ago. The older content typically had minimal descriptions (e.g. "Learn about biology"), while newer content came with extensive metadata.

Second, longer content naturally gives you more to work with and that can skew relevance algorithms. A 40-minute transcript will mention key terms more frequently than a 30-second clip, potentially making it appear more relevant for those terms even when it's not.

Knowing that those issues exist means you’re ready to solve them. In this particular project, we tackled these challenges in several ways. We built machine learning algorithms that generated detailed transcripts from videos. These algorithms then extracted the top 15 relevant keywords and created standardized three-paragraph descriptions for each asset. This approach normalized content across eras and formats, so we had a consistent baseline of searchable text regardless of what metadata originally came with an asset.

This shows how transformation is about more than just converting data from one format to another. You can think of it as creating a level playing field where all of your data has the best chance of being the search result that someone needed.

Even in a simpler, text-based dataset, transformation helps to make the data searchable. Let's look at how this applies to our books example.

There, we face a different transformation challenge. The data already exists in text format, but that doesn't mean it's ready for search. Across thousands of book entries seemingly minor formatting inconsistencies translate into stumbling blocks for our search. For example, where a book has more than one author, there are different ways of recording that. Sometimes, there’s a list of authors separated by semicolons, making it nearly impossible to find books by a single co-author without knowing their collaborators.

Transforming these semicolon-separated strings into proper arrays gives each author equal footing in the search landscape. It's a small change with outsized impact: a user searching for books by Jacquelyn Reinach can now discover "Rest, Rabbit, Rest" even without knowing she wrote it with Richard Hefter. 

Notice how the same principle applies at both scales. At the media company, we were using machine learning to generate consistent metadata across millions of assets. In our books repository, we're using simpler string parsing to standardize author information. The technical approaches differ based on volume and complexity, but the goal remains the same: get to a point where all content has an equal chance of discovery.

Exploration

Part of the beauty of a new dataset lies in allowing it to speak for itself. At the transformation stage, we’re still imposing some of our own ideas about what the data is. But to make search useful we need to uncover what is in the data, rather than what we think might be there.

And, so, exploration starts with stripping away all your preconceived notions about what the data might be. You can almost think of it like arriving at an unfamiliar library where all the books have been shelved but you don’t know the organizing system. Even then you can’t trust that any labels really mean what they seem to.

The books dataset gives us a nice illustration of the problem. Take a look at the description for two books:

  • ‘Tis (Frank McCourt): "FROM THE PULIZER PRIZE-WINNING AUTHOR OF THE #1 ""NEW YORK TIMES"" BESTSELLER ""ANGELA'S ASHES"" Frank McCourt's glorious childhood memoir, ""Angela's Ashes, "" has been loved and celebrated by readers everywhere. It won the National Book Critics Circle Award, the ""Los Angeles Times"" Book Award and the Pulitzer Prize. Rarely has a book so swiftly found its place on the literary landscape. And now we have ""'Tis, "" the story of Frank's American journey from impoverished immigrant to brilliant teacher and raconteur. Frank lands in New York at age nineteen and gets a job at the Biltmore Hotel, where he immediately encounters the vivid hierarchies of this ""classless country,"" and then is drafted into the army and is sent to Germany to train dogs and type reports. It is Frank's incomparable voice that renders these experiences spellbinding. When Frank returns to America in 1953, he works on the docks, always resisting what everyone tells him. He knows that he should be getting an education, and though he left school at fourteen, he talks his way into New York University. There, he falls in love with the quintessential Yankee and tries to live his dream. But it is not until he starts to teach that Frank finds his place in the world."
  • Rest, Rabbit, Rest (Jacquelyn Reinach andRichard Hefter): Rabbit's schedule keeps him so busy his friends have to trick him into resting.

Notice the difference? Not only is one much longer than the other but the description of ‘Tis arguably includes two different types of data: accolades followed by the actual description. Other books might not have a description at all.

To work around this, we need to find footholds. These are the natural features in the data that give us a way to navigate it. You can find them through diagnostic querying, such as using aggregation queries to understand the shape and characteristics of your collection, including:

  • Analyzing the distribution of values (min, max, median) for each field.
  • Identifying which fields have discrete values versus continuous ranges.
  • Finding patterns of repetition that might indicate enumerations.
  • Looking for outliers and anomalies that could affect search performance.

When exploring the media company's 2 million assets, this approach allowed us to see, for example, that some fields contained nearly identical information across thousands of assets while others varied wildly. By letting the data speak for itself, we discovered natural divisions that would make search more intuitive. For instance, we found fields with a limited set of repeating values—perfect candidates for filters that would help users quickly narrow down results.

Normalization and standardization

While the variety in data can be its strength, we need to impose some sort of order if we want our search to be useful. We do that through two related processes: normalization and standardization.

Normalization is where we start to strip away unnecessary variety to make sure we can make useful comparisons. To do that, we need to:

  • Remove white space and other noise.
  • Make sure numeric data uses the same units: for example, if there’s a mix of imperial and metric measurements then we need to pick a system and convert everything to use it.
  • Balance the length and depth of metadata across different pieces and types of content.

In our media project, we focused on balancing content depth across eras, while our books dataset presents different normalization challenges

The books dataset has an interesting quirk in that ratings have a precision of two decimal places. ‘Tis gets a rating of 3.68, while Dr. Seuss’s Adventure Stories gets 4.35. But that’s not really how people think about book quality. If a friend says they found a book to be pretty good, you might equate that with three stars but it’s unlikely you’d really be able to make an argument for why one book deserved a rating of 3.68 rather than 3.75.

So, here we can use the normalization step to turn those continuous values (0.00-5.00) into star ratings.

Standardization then takes the normalized data and imposes a consistent structure. Remember the idea of footholds in the data? This is where we not only map out where the footholds are but we also enhance them: chiseling some deeper, building others up, and adding new ones where needed.

In data terms, this means creating consistent metadata structures, filling in gaps for older content, and ensuring that every asset is equally discoverable through the same search approaches.

If you look at the books example repo, you’ll see that we’ve created a standardized full_title that combines the title and subtitle fields. This recognizes that, in practice, there isn’t really any difference between a book with a subtitle and one without. Instead, the subtitle provides additional context that’s going to be useful to people when they search by title. So, we now have a standardized title format whether or not a book has a subtitle.

Selection

In the final stage, we select which data will appear in our search indexes. This is where we must be careful not to fall into a common trap.

Search systems like Elasticsearch look an awful lot like databases. They store, index, and retrieve data. But thinking of them in that way can be a problem because search indexes are fundamentally different tools with different purposes. If you think of search as a presentation layer rather than as a data store then you can focus on creating indexes that serve search rather than overstuffing them with fields that might actually make it harder to get good search results.

With that understanding, selection becomes about making deliberate choices.

It starts by including only fields that directly impact search. If the field won't help someone find what they're looking for, why include it? 

In the books dataset, this principle becomes immediately apparent. Take a look at those ISBN numbers—both ISBN-10 and ISBN-13 versions. Perfect unique identifiers for a database? Absolutely. But most readers don't have them at their fingertips while searching. Similarly, thumbnail URLs serve our display layer well but can introduce noise into search. Imagine someone searching for Zoom etiquette tips and being met with Dr. Seuss's 'Oh, the Places You'll Go!' simply because its thumbnail URL contains the word 'zoom'. By stripping these fields from our search index, we're not losing data—we're gaining clarity.

We should also be especially thoughtful about relationships between items. Search engines generally prefer flat, self-contained documents. When you need to represent relationships, it can be better to design this at the application layer rather than trying to nest complex structures in your search index. 

Finally, remember that search is about human behavior, which constantly evolves. Ten years ago, users carefully constructed search queries with specific syntax. Today, they type loosely formed thoughts and expect systems to understand. There's a balance to strike here. You don't want to get bogged down in over-optimizations for the far off future but you might look at how things are changing today and what that might lead to in a year or two.

Ultimately, the goal isn't to include everything. What we’re building is a way to create the shortest path between someone's question and the answer they're looking for.

Data preparation is the root of good search

Both our example scenarios—processing 2 million media assets spanning decades and organizing a few thousand book records—show that effective search requires deliberate data preparation. Each phase—transformation, exploration, normalization, and selection—creates clearer paths between questions and answers.

The media project demonstrated how these principles scale to enterprise level, where even small inefficiencies multiply across petabytes of data. We saw how machine learning can generate consistent metadata and how smart partial indexing can reduce both costs and errors.

What connects both scenarios is their human-centered approach. We've moved beyond making people adapt to computers. Our search systems must now adapt to human language and behavior.

At Pelotech, we work daily with a range of clients, from mid-sized businesses to Fortune 500 companies, helping them navigate cloud-native projects such as search.

If you're looking to make your data more discoverable and valuable through search, we'd love to be part of that conversation. Reach out, and let's explore how we can help transform your data through intuitive search.

Let’s Get Started

Ready to tackle your challenges and cut unnecessary costs?
Let’s talk about the right solutions for your business.
Contact us