SMRaiz (pronounced “summarize”) is Summary Analytics’s software product. It uses mathematically proven artificial intelligence (AI) techniques that measure the diminishing marginal returns of the information contained in each record of a dataset and rank orders the records accordingly. SMRaiz prioritizes those records having the most unique information while relegating more redundant records to the end. This summarization and prioritization will bring your big data down to size without loss of fidelity — delivering better insight while reducing time and cost. It works on any kind of data, including customer profiles, health records, web or network logs, biological signals, sensor data, and even images, audio, and video streams. SMRaiz is lightning fast with even the largest datasets, but some of our customers are using it with many millions of separate smaller datasets, or high value datasets, or wherever high speed or low latency is critical.
SMRaiz performs universal summarization of featurized single- or multi-modal datasets. We define a summarization as a process that selects, from a (e.g., tabular) dataset, a small subset of data items (e.g., rows) — the few selected items represent the information contained in the many remaining unselected items. The “mathematically proven techniques” mentioned above are based on a rich area of mathematics called submodularity. SMRaiz is based on Summary Analytics’s proprietary Calibrated SubModular (CaSM) summarization processes. This involves quickly and cost-effectively producing a submodular function that accurately represents information in subsets of data, and then optimizing the resulting function to produce a small information-rich data subset. Our technology innovation has created a service that can summarize massive amounts of data — of any kind — quickly, easily, and without user expertise or even requiring an understanding or awareness of submodularity.
Missing from ML/AI efforts today is any consideration of “information efficiency.” SMRaiz achieves information efficiency. What do we mean by this? In fact, there is commonly a confusion between data and information. People often throw more data at a problem and think that more data is always better. But what they really mean is that more information is better. Data is the raw material delivered as a collection of records, and is only a carrier of information. But this raw material usually includes content irrelevant to the question being asked as well as significant amounts of redundancy, even among unique records, especially as dataset sizes increase. That is, there are often more records than necessary in a dataset to represent a given amount of information, leading to information inefficiency. SMRaiz quickly summarizes and prioritizes sets of data records to provide the most information possible represented in a given amount of data (i.e., a certain number of records) but it does so without changing the form of the records themselves. SMRaiz thus provides information efficiency, the ability to maximize the ratio of information to data. SMRaiz also provides this in a way that is computationally efficient, and is thus easy on your budget.
No matter what the industry or application, SMRaiz can benefit your data analytics efforts. The more data, the more you’ll benefit. That’s why we like to say: Bigger data? Bring it on!
A single run during AI model development or training can take hours, days, or even weeks. Whether you’re using AutoML, NAS, or other model development tools, with SMRaiz you can develop your model on a subset of your data without losing fidelity, dramatically lowering processing costs. Training your model on a SMRaiz summary has similar benefits. And because the summary data is also prioritized, you can do the early development or training on an even smaller set of the highest priority data, and then add more data in the final stages. Interested in the benefits of data augmentation in your training? With SMRaiz, you can afford to add this additional information because we can summarize that data as well.
SMRaiz allows for the creation of an “information cache”, where the smallest, highest priority, and most representative data is available for immediate access. In other words, SMRaiz enables tiered storage where the more redundant records can be stored for less-costly off-line access or in some applications deleted entirely if you choose! Depending on the amount and redundancy of your data, this savings alone can be huge.
The energy consumption of AI training is getting out of control. The computational power required to train state-of-the-art AI models is doubling every 3.4 months1 as Moore’s Law continues losing steam, no longer doubling processor performance every 18-months. The cloud processing costs for new AI models can run in the $millions, with concomitant energy consumption costs. So far, this problem has been addressed with machine learning algorithmic advances, increased parallel compute power, and computing systems efficiency improvements. These help, but more is needed. SMRaiz provides a new complementary tool, adding “information efficiency” to the process. In general, processing redundant data means wasted computation and higher costs. Information efficiency means processing data with the redundancy removed, so that we are efficient at dealing with the information in the data, and so that every unit of computation performed becomes essential.
For many applications, time is of the essence. Slow AI processing can mean keeping your customers waiting or delays in time-critical decision-making. SMRaiz runs fast, often in seconds or less or just a few minutes on even the largest datasets, and it runs without the need for specialized expensive processors. We don’t replace AI training, but by pre-processing your data with SMRaiz to eliminate redundancies, AI training can run orders of magnitude faster!
Data labeling is the bane of the AI/ML community. It’s a great example of the old saying, “Garbage in, garbage out” as errors in your data labeling will ruin results with even the most sophisticated algorithms. But data labeling tasks are arduous, expensive, time-consuming, and error-prone, all of which is made worse by the problems of human alert and decision fatigue caused by data redundancy and repetitiveness. By reducing size and eliminating redundancy, as SMRaiz does, human fatigue is reduced, human accuracy and efficiency are increased, annotation costs are reduced, and time is saved. In other words, it is much better to smartly label a summarized non-redundant subset of data than it is to blindly and blithely label all of it.
Data labeling is not the only area of data analytics prone to errors due to human alert and decision fatigue. For many applications, ML and AI are used to facilitate human analysts and decision makers. But too many alerts are like the boy who cried wolf. SMRaiz eliminates redundant alerts and prioritizes those that remain, reducing fatigue and thus reducing errors due to fatigue.
AI/ML systems unfortunately suffer from systematic biases ingrained within data. There are many types of bias, gender and racial being two of the most insidious. Thus, eliminating bias in AI is a top priority. Bias exists often because certain concepts represented in the data are imbalanced. For example, overrepresented majorities in reality are often over-represented in the data as well. SMRaiz can help you measure and identify the bias in your data, and through the process of summarization, bring the underrepresented data to the forefront.
Our product is available three ways. The first is via Amazon’s AWS SageMaker Marketplace. This is ideal for trying us out with minimal effort or for customers with occasional batch jobs which they wish to summarize and prioritize. It’s a simple pay-as-you-go model without the need for a contractual commitment.
SMRaiz is also available in a containerized distribution suitable for on-premises, Virtual Private Cloud (VPC) and Kubernetes deployments. Our Docker image is easy to integrate into your current pipeline, and even supports operation in environments without a connection to the outside world for maximum data privacy and security. There are two versions, one that is Python command-line based and the other that is a client-server system providing on the fly network summarization-as-a-service. For communication between client and server, we use gRPC and Protobuf which is fast, reliable, and scalable. Our gRPC-based client-server Docker container solution is suitable for integration within microservice environments such as Kubernetes, and allows you to integrate SMRaiz into your standard workflow. This is ideal for our OEM customers who want to embed SMRaiz functionality and performance into their own software product offerings.
The block diagram shown below describes how the SMRaiz process works. There are two steps. The first step (on the left and highlighted in green) is a feature extraction step, and the second (on the right and highlighted in blue) is a summary calibration phase.
Data is input into SMRaiz as a list of fixed-length feature vectors, one per data record. The figure shows this as a matrix representing the data in tabular form, but in general it can be a set of matrices each with the total number of rows equaling the number of initial records. SMRaiz does not currently do the feature extraction step, so you do this before using SMRaiz. Each row of the matrix of feature vectors in the diagram represents a record, shown using a unique color to indicate a unique record. Once you have processed records into features, those feature vectors are given to SMRaiz via Python numpy (.npy) files, CSV files, or recordio/protobuf files.
You set calibration parameters, run a summarization job and then examine the output using your own favorite visualization or analysis tool. Any strategy for visualizing all of the data can immediately be used to visualize a summary, as a summary is just a subset of rows. If you are satisfied with the summary, you are done and can reuse these calibration parameters over and over on similar datasets. Otherwise, you can adjust the calibration parameters and try again. The calibration parameters let you select if you want the most diverse possible summary based on all information in the records, or if you are looking for something in particular to be emphasized in your summary. Analogous to how Photoshop works, you repeat this until a satisfactory answer is obtained. Since there is no requirement to train a submodular function using supervised summarization training data, this calibration process is necessary. Summarization is fast however, so the calibration cycle is also fast and will open the door to universal summarization for anyone, without even needing to know you are creating and optimizing submodular functions.
There are other means of reducing dataset size but they are limited in different ways and none have all the advantages of SMRaiz, although some are complementary.
SMRaiz does deduplication, but it does much more. Dedup is just that — replacing multiple identical records with a single copy. SMRaiz does this, but will also eliminate (or deprioritize) unique records that are redundant not just relative to only one single other record, but instead that are redundant with multiple other records considered collectively. For example, it could be that no single record is a duplicate of any other single record, but out of, say, 1000 records, the first 127 of them make the remaining 873 redundant. That is, each of the 873 records is redundant only once we have selected the 127 essential records. If we have any fewer than the 127 essential records, however, none of the 873 records are fully redundant. SMRaiz therefore usually reduces the working dataset by orders of magnitude more than just pure deduplication, depending upon the dataset properties.
Compression changes the form and codebook of the data. You cannot operate on compressed data without first decompressing it back to its original uncompressed (and thus large) form, and this is true regardless of whether lossless or lossy compression is performed. Compression therefore works great for data storage and communications, and in fact can be used in conjunction with SMRaiz for compressing and then storing what SMRaiz determines is redundant or low priority. Also compressed data cannot be fed directly into an AI/ML algorithm to save compute time as can SMRaiz summary data. The bottom line: compressed data is not immediately and directly compatible with standard data-processing workflows, while summarized data is.
This approach can reduce the dataset size as much as you’d like but is almost certain to miss the corner cases in any diverse dataset that has many small quantity or poorly represented concepts (e.g., long tails) — with a random sample, many of the poorly represented concepts will be missing entirely. SMRaiz is designed specifically to find and retain diversity and produce the most representative subset so as to include all the information content, even the concepts that are poorly represented in the original data. Another way of putting it: when the data is very imbalanced, a random subset will miss the infrequent concepts and overrepresent the dominant concepts. SMRaiz’s mathematically proven approach is always better than the random hit or miss of random sampling.
Clustering does not reduce the dataset size you are working with. Its primary goal is to group together data items that are similar to each other in some way. SMRaiz’s primary goal, however, is to choose and prioritize a smaller subset of data that is most representative of the whole. In a SMRaiz summary, the few diverse data items that made it to the summary represent the many data items that didn’t make it into the summary. Once SMRaiz finds these representative summary records, it then enables you to determine a linkage from each non-summary item to all of the summary items representing it. Clustering produces no intrinsic prioritized representative set, is usually much slower to compute (especially when you want to change the number of clusters), usually does not have a mathematical quality guarantee, does not allow you to balance diversity, centrality, and preference, and does not easily allow for making further calibration adjustments to fit your data.
Data distillation algorithms take a data set and produce a new smaller data set where each record is newly and artificially synthesized. Similar to compression, this approach changes the form of the data, but unlike compression, data distillation produces new, hybrid, and hopefully syntactically valid data records. Thus, any pre-existing data workflow can still operate directly on the newly synthesized data. Data distillation algorithms can be computationally complex, often as or more complex than the AI algorithms themselves as they try to fuse the information from many records into a smaller synthesized hybrid subset. Data distillation also might produce very unnatural hybrid records, unlike any in the original data set, and can sometimes suffer from what we call the “Frankenrecord” problem, where new records might be an unrecognizable patching together of pieces of other records. Data distillation can also be machine learning model dependent, which means another large investment whenever the model changes. SMRaiz, however, is a form of extractive summarization, and therefore it is extremely fast, simple, scalable to massive data sizes, and never suffers from the Frankenrecord problem since every record is an original. SMRaiz, however, is complementary to data distillation — a SMRaiz summary would be a great subset to perform data distillation on, thus making the data distillation process much faster while retaining quality.
Subsets that are known as “core sets” are those that are meant to be core, or critical, to retain in order to maintain fidelity in some way to the original data. Often core sets are discussed in specific geometric settings, where we wish to choose a subset of data that maintains a certain specific property of the original (e.g, the diameter of the data in some geometric space). Core sets are usually very specific to a particular problem. We consider, however, a summarization to be synonymous with a core set since the goals, in general, are the same. SMRaiz, however, works as a universal summarizer that allows you to calibrate the summarization process to fit whatever summarization needs you might have, and allows you to trade off natural concepts such as diversity, centrality, and representativeness.
This usually refers to a broad and general class of algorithms that are meant to do things like compression, randomized projections, dimensionality reduction, compressed sensing, sparsification of matrices in a way to retain fidelity, and this even includes core sets. Sketching is often more specific though — while dimensionality reduction means projecting the features of each sample down to a lower dimensional feature space, sketching is more like projecting the samples of each feature down to a lower dimensional sample space and thereby reducing the size of the data, which is a form of data distillation. Hence, we consider summarization to also be a kind of sketching, but one that retains all of the advantages mentioned above. In SMRaiz’s case, this includes universality, speed, scalability to massive data sizes, fidelity, calibration ability to suit your needs (including balancing between redundancy, centrality, and representativeness), and so on.
LINKaiz (pronounced “LINK-ize”) is Summary Analytics’s software product that works together with SMRaiz to show weighted connections between the data included in a SMRaiz summary and data that is pushed aside and not included in the summary. As the summary (which can be thought of as a set of representatives, or delegates) represents all of the information in the whole set, each record not included in the summary (which may be considered a constituent) is well represented by one or more records from the summary. LINKaiz enables users to view all constituent records linked to a delegate, and this is useful to determine what precisely is being represented by a given delegate. Conversely, LINKaiz also enables a user to examine the delegates linked to a given constituent, showing how much of a constituent's information is covered by each delegate.
There are many applications where understanding the relationship between the summary records and corresponding constituents can be important. For example in cybersecurity, once a given delegate is found to be malware, a security analyst can quickly investigate other records represented by that delegate to see if they, too, are malignant. Another common use is customer analytics. Once you've identified a highly qualified lead in the summary, LINKaiz can help you rank the closest leads that did not make it into the summary, but that are related to the qualified lead, thereby prioritizing lead follow-up.
SMRaiz and LINKaiz work great together to help you better, and more quickly, understand your data, even if the data is enormous. SMRaiz gives the big picture view while LINKaiz provides an organized way of diving into the details and looking at the complete picture. SMRaiz and LINKaiz allow you to understand the big using only the small.
Copyright © Summary Analytics, Inc.