Unstructured data is information that is not arranged according to a preset data model or schema, and therefore cannot be stored in a traditional relational database or RDBMS. Text and multimedia are two common types of unstructured content. Many business documents are unstructured, as are email messages, videos, photos, webpages, and audio files.
From 80% to 90% of data generated and collected by organizations is unstructured, and its volumes are growing rapidly — many times faster than the rate of growth for structured databases.
Unstructured data stores contain a wealth of information that can be used to guide business decisions. However, unstructured data has historically been very difficult to analyze. With the help of AI and machine learning, new software tools are emerging that can search through vast quantities of it to uncover beneficial and actionable business intelligence.
Let’s take structured data first: it’s usually stored in a relational database or RDBMS, and is sometimes referred to as relational data. It can be easily mapped into designated fields — for example, fields for zip codes, phone numbers, and credit cards. Data that conforms to RDBMS structure is easy to search, both with human-defined queries and with software.
Unstructured data, in contrast, doesn’t fit into these sorts of pre-defined data models. It can’t be stored in an RDBMS. And because it comes in so many formats, it’s a real challenge for conventional software to ingest, process, and analyze. Simple content searches can be undertaken across textual unstructured data with the right tools.
Beyond that, the lack of consistent internal structure doesn’t conform to what typical data mining systems can work with. As a result, companies have largely been unable to tap into value-laden data like customer interactions, rich media, and social network conversations. Robust tools for doing so are only now being developed and commercialized.
Unstructured data can be created by people or generated by machines.
Here are some examples of the human-generated variety:
Here are some examples of unstructured data generated by machines:
As we’ve already seen, structured data is organized in ways that make for easy searching. Unstructured data — comprising most other types — exists in formats such as audio, video, and social media postings, and is not easy for conventional tools to search.
The contrasting of one type versus the other should not be thought of as a conflict. You simply choose one or the other based on the applications you’re interested in. Relational databases handle structured data, and just about all other kinds of systems can house unstructured data.
Common RDBMS applications using structured data include airline reservation systems, inventory control, sales transactions, and ATM activity. Typical unstructured use cases are media-viewing and editing tools, presentation software, and word processing.
There is also a third category called semi-structured data. While not stored in relational databases, this type of information has some organizing properties, making it easier to parse and analyze. Specifically, semi-structured data contains internal tags and markings that allow for grouping and hierarchies.
Email is a common semi-structured data application. While detailed email analysis requires sophisticated tools, its native metadata allows for basic classification and keyword searches. Semi-structured data is only a 5% to 10% slice of the total enterprise data pie, but it has some critical use cases. Examples include the XML markup language, the versatile JSON data-interchange format, and databases of the NoSQL or non-relational variety. These last are a good choice for storing information such as text with variable lengths. The most widely used non-relational database, MongoDB, accommodates semi-structured documents by natively storing them in the JSON format.
Unstructured types of data can actually have internal structural elements. They’re considered “unstructured” because their information doesn’t lend itself to the kind of table formatting required by a relational database. As noted earlier, unstructured data can be textual or non-textual (such as audio, video, and images), and generated by people or by machines. Non-relational databases such as MongoDB are the preferred choice for storing many kinds of unstructured data.
Simple content searches can be performed on textual unstructured data. Traditional analytics tools are optimized for highly structured relational data, so they’re of little use for unstructured sources such as rich media, customer interactions, and social media data.
Big data and unstructured data often go together: IDC estimates that 90% of these extremely large datasets are unstructured. New tools have recently become available to analyze these and other unstructured sources. Powered by AI and machine learning, such platforms function at near real-time speed and educate themselves based on the patterns and insights they uncover. These systems are being employed against large unstructured datasets to enable never-before-possible applications like:
Unstructured data can be stored in a number of ways: in applications, NoSQL (non-relational) databases, data lakes, and data warehouses. Platforms like MongoDB Atlas are especially well-suited for housing, managing, and using unstructured data.