Images come in many shapes and sizes, and with the ubiquity of smartphones and high-quality cameras in everyone’s pocket, image datasets are larger and richer than ever before. Unfortunately, such datasets possess enormous redundancy, exemplified by there being many similar images taken of the same scene. This makes it challenging, time consuming, and expensive to get a sense of the full extent of information present within large image datasets. Summary Analytics’s summarization works great on image data sets, where the large is reduced to the small, and the information is preserved.
One way summarization can help is when looking for a house rental on, say, Airbnb. Typically there are many homes that match your initial query, and each home is shown as a collection of images, with multiple images of the same room. In order to scan through as many homes as quickly as possible, it is useful to see only one high-quality image of each room in each house. This is a problem summarization can solve since multiple different camera angles of the same room constitutes a redundancy and imbalance to be removed.
Benefits
Airbnb has provided an open source dataset of pictures of available rental properties. It consists of 864 house pictures from 356 rooms, with many rooms having multiple images (from differing camera angles) while some rooms having only one. We summarize this data down to a set of 356 images and then count the number of unique rooms. As the plot shows, the SMRaiz summary successfully identifies 331 (out of 356) unique rooms, whereas the best out of 1000 random subsets (each of size 356) identifies only 267 unique rooms (the average number of distinct rooms of those 1000 random subsets is 249.6).
We also show how LINKaiz associates each of the summary items (the delegates) with each of the non-summary items (the constituents). In each case, we see that the delegate is linked to the same room but an image of it taken from a different angle. We also see an instance of a constituent that is better represented by multiple delegates than by just one delegate.
The ImageNet dataset is a large collection of human annotated photographs widely used in academic research for developing image classification and object localization algorithms at large scale. It consists of 1,281,167 images as the training set.
Using SMRaiz, we order the images to prioritize the most representative records and relegate more residually redundant records to the end. On the right, you can see the first 400 and the last 400 images according to a SMRaiz order. The first 400 images are diverse in terms of the variety of colors, shapes, and objects whereas the last 400 consists of more redundant and more mundane images. Such patterns are not revealed when creating an unbiased random sample of 400 images (e.g., uniformly at random).
Methodology
Intrusion detection system (IDS) logs are chock-full of bias, redundancy, and imbalance and are a good dataset type for Summary Analytics summarization. Our software, SMRaiz, significantly reduces data size while preserving all the various types of attacks, thus making any subsequent analysis easier, faster, and more cost-effective. Summarization also reduces decision and alert fatigue for the human analysts allowing them to focus only on what is most important. In cybersecurity datasets, the majority of the traffic is benign with only a small part being malicious. Summarization reduces such imbalance. A summary is diverse and representative of every concept present in the data, and is thus free from inherent bias present in the entirety. Therefore, models trained on the summary can obtain high accuracies, potentially even exceeding the full training set accuracy since a summary is balanced.
CSE-CIC-IDS2018 is a public dataset of 16-million labeled network flows (IDS records) covering 15 different categories: 1 benign traffic and 14 different types of attack scenarios. With a summary size of only 1000 records (more than a 15,000x reduction), all 14 attack types made it into the SMRaiz summary (blue bar plot). Whereas when selected randomly, all 1000 random subsets of size 1000 miss at least 2 of the attack types, with the majority missing 4 or more attack types (gray bar plot). And remember, our software has no knowledge of attacks — this is a fully unsupervised summary of the data.
Whether the analysis is manual, automated or a hybrid, it is obviously much faster to go through 1000 records than 16-million! After the attack has been found in the summary, our LINKaiz product can be used to find the other instances of attack in the full dataset by automatically determining which records are most closely associated with the attack types found in the summary.
Benefits
Methodology
Having good labeled data is the lifeblood of machine learning. Unfortunately, acquiring good labeled data is costly, time consuming, and error prone. Surprisingly, it can also be extremely wasteful. Summary Analytics helps to reduce all of these problems. By labeling only a summary rather than all of the data, labeling costs and time are reduced. Also, since a summary is diverse and representative, the labeling process is more efficient.
FashionMNIST is a famous and well studied academic dataset of 60,000 apparel images. This case study shows how Summary Analytics can produce better subsets to train on than random subsets. The plot shows the accuracy on the test set of subsets that were chosen either randomly (five random subsets) or via a fully unsupervised summarization (using SMRaiz). Importantly, no label information at all was used to produce the summary (hence, unsupervised).
Benefits
Methodology
Video data (e.g., vehicle dash-cam footage, security cameras, etc.) is plagued with redundancy, as most of it consists of repetitive images where nothing new happens. Occasionally, important images are encountered containing information needing to be acted upon (e.g., pedestrians, moving vehicles, prowlers in security data, etc.). This autonomous transportation case study is a great example of what Summary Analytics can do with video. Even though no two records (or video frames) are identical, there is a massive amount of redundancy, especially in successive frames if no new objects have entered or exited the field of view. In the examples below, as you watch the video, green flashes show which frames made it into the SMRaiz summary (usually those where unique objects are encountered).
Benefits
In the first example, we use the well-known KITTI dataset, used for testing vision-based autonomous transportation algorithms. We use a part of this data captured on a highway. As you can see, the summary frames are diverse and capture a representative set of events that happen during the drive, such as traffic signs, ramps, and passing cars.
Methodology
The NuScenes dataset comprises 6-camera angle recordings while a car is driving through 1000 scenes (each ~20 sec) in Boston and Singapore. Our demonstration uses data recorded in a rural area. As shown, summarization captures a diverse set of scenes that include people and new objects on the curbside.
Methodology
The Waymo dataset is comprised of high-resolution images collected by 5 front-and-side-facing cameras mounted on a Waymo self-driving vehicle. It contains data from 1,000 driving segments, each capturing 20 seconds of continuous driving, corresponding to 200,000 frames at 10 Hz sampling rate per sensor. This dataset covers dense urban and suburban environments over a wide spectrum of driving conditions (day and night, dawn and dusk, sun and rain).
Here, you can compare the SMRaiz summary of size 375 images (a 25x15 montage) with a random set of the same size. The random set has 19% redundancy (highlighted with yellow borders) where we define redundancy as two or more images being similar to each other. The SMRaiz summary, on the other hand, captures a representative and diverse set of 375 images with 0% redundancy.
Methodology
Sensors within large industrial machines and manufacturing plants are increasingly being deployed throughout the world and are producing ever larger collections of vector-valued time series data. Sensors that measure pressure, position, temperature, speed, vibration, and signal intensity (to name only a few) are combined and then used to offer new ways to view the state of the machine. This is useful to monitor the machine for compliance and safety, operating efficiency, and structural failure prediction. Unfortunately, the core information in such large sensor networks is difficult to extract because the sensor data is large and always expanding. One property of such data that can be exploited is its redundancy: Many sensor time series are quite redundant making them ripe for summarization.
To demonstrate this, we apply summarization to a synthetic vehicle telemetry dataset, although the same capability can be applied to any kind of industrial sensor data. The dataset used was provided by iRacing as a demo dataset as part of a guide to using telemetry from their racing simulator with popular telemetry visualization software. In this case, 46 sensors were captured from the car to measure speed, direction, location as well as various vehicle conditions, such as oil pressure, fuel consumption, braking, gear shifts, tire pressure, shock absorber deflection, etc. These were measured while the car was driven around a race track in a high-fidelity racing simulation. The dataset was summarized from 6886 records down to 20 summary records with the goal of quickly finding anomalies to optimize performance.
Results are shown via this interactive graphical demo. The demo consists of two tabs. The first tab allows you to hover over SMRaiz summary items shown as orange marks on a race track, and the corresponding position in each sensor plot will be highlighted. This tab also utilizes LINKaiz so that whenever you hover over a summary (i.e., delegate) point on the track, the corresponding constituent points on the track are also highlighted. While the constituents associated with each summary point were automatically produced using LINKaiz, we found that these groups of points correspond to unique meaningful events during the race and small tool-tips pop up with a short description of each event. On the second tab, it is possible to hover over any point on the track (most of which are constituent elements), and the corresponding delegate for each constituent will be highlighted. Overall, this shows how it is possible to summarize, visualize, interpret, and then quickly understand high dimensional sensor signal data in a cost-effective manner.
Benefits
Methodology
Copyright © Summary Analytics, Inc.