Table of Contents
1. Executive Summary
The number of connected devices, including the machines, sensors, and cameras that make up the Internet of Things (IoT), continues to grow rapidly. By 2025, we can expect tens of billions of Internet of Things (IoT) devices to produce zettabytes of data, even as machine-generated data outstrips that generated by humans. At the same time, data-driven organizations are generating expanding volumes of log-like data and relying on analytic databases to load, store, and analyze volumes of log data at high speed to derive insights and take timely actions.
This report focuses on the performance of cloud-enabled, enterprise-ready, popular log analytical platforms Microsoft Azure Data Explorer (part of Azure Synapse Analytics), Google BigQuery, and Snowflake. Due to cost limitations with Elasticsearch and AWS OpenSearch, we could not run our tests on Elasticsearch. Microsoft invited GigaOm to measure the performance of the Azure Data Explorer engine and compare it with its leading competitors in the log analytics space. The tests we designed intend to simulate a set of basic scenarios to answer fundamental business questions that an organization from nearly any industry might encounter in their log analytics.
In this report, we tested complex workloads with a volume of 100TB of data and concurrency of 1 and 50 concurrent users. The testing was conducted using comparable hardware configurations on Microsoft Azure and Google Cloud.
Of course, testing platforms across cloud vendors is challenging. Configurations can favor one cloud vendor over another in feature availability, virtual machine processor generations, memory amounts, optimal input/output storage configurations, network latencies, software and operating system versions, and the testing workload. Our testing demonstrates a slice of potential configurations and workloads.
As the sponsor of the report, Microsoft selected the specific configuration of its platform that it wanted to test. GigaOm then selected the configuration for Snowflake closest to the ADX configuration in terms of CPU and memory. Google BigQuery is offered “as-is” with no configuration choices other than slot commitments (discussion provided).
We leave the test’s fairness for the reader to determine. We strongly encourage you to look past marketing messages and discern its value for yourself. We hope this report is informative and helpful in uncovering some of the challenges and nuances involved in platform selection.
The parameters to replicate this test are provided in this document. We developed the testing suite from scratch because no existing benchmark framework met our needs. The testing application and the queries used are part of a code repository on GitHub.
Overall, the test results were insightful and revealed the query execution performance of the three platforms tested. Some of the highlights include:
- Azure Data Explorer (ADX) outperformed Google BigQuery and Snowflake on all 19 test queries with a single user and 18 of 19 with 50 concurrent users.
- ADX completed all 19 queries under 1 second with a single user, while the average execution time on BigQuery and Snowflake was 15 and 13 seconds per query, respectively1.
- ADX completed all 19 queries in under 10 seconds with 50 concurrent users.
- BigQuery and Snowflake both had 8 queries (different ones) that did not complete within the 2-minute timeout2.
- We found the cost of data ingestion and query execution on ADX to be significantly lower than on Google BigQuery, Snowflake, and Elasticsearch/OpenSearch
- We found it infeasible to test Elasticsearch and OpenSearch with this workload due to slow data loading speeds, expansion of the on-disk data volume (as compared to compression), and the inability to scale hot-tier (SSD) storage independently or in a cost-effective manner. The parameters to replicate this test are provided. You are encouraged to compile your own representative queries, data sets, sizes, and test compatible configurations applicable to your requirements.
- There were several performance outliers. The geometric mean of query execution for BigQuery and Snowflake was approximately 4.5 seconds per query.
- Using the 50th percentile or the 25th query to complete (or be canceled)
2. Log Analytics Platforms Tested
For this field test, we analyzed the performance of three popular platforms often used to store and analyze machine-generated data. Note that these platforms are not just for storing, processing, and analyzing log data, but can also be used for other analytical purposes.
Azure Data Explorer (ADX)
Microsoft Azure Data Explorer (ADX) is a fully-managed, cloud-analytics platform created as a unique combination of three common database technologies: column store, text indexing, and data sharding. It was launched in 2018 and is used for ingesting and querying massive telemetry data streams in scenarios of logs and metrics monitoring and diagnostics, machine learning, and time series analytics. ADX natively supports the Kusto Query language and TSQL. Unlike BigQuery BI Engine and Snowflake Search Optimization features, ADX does not charge extra for query accelerators and optimizers. ADX has separation between compute and storage that allows the creation of many clusters over the same dataset and also provides predictive auto-scale, which optimizes cost.
BigQuery
Google BigQuery is a fully-managed, serverless, data-warehouse technology, exclusively offered in Google Cloud, that enables scalable analysis over large datasets. It is a Software as a Service (SaaS) that supports querying using ANSI SQL. It also has some built-in machine-learning functions. It was first offered in 2011 and is based on Dremel, a distributed system developed by Google for interactively querying large datasets.
BigQuery provides compute through “slots.” While you can commit to a certain number of slots for a month or a year for a discount, you can also buy Flex Slots at $4 per 100 slots per hour. There is a separate charge for BigQuery Active Storage.
Additionally, BigQuery has an in-memory analysis service called BI Engine. By using BI Engine, you can analyze data stored in BigQuery with faster query response times for certain types of queries. The BI Engine design automatically tunes queries by moving data between the in-memory storage, so no tuning was required for our queries to utilize the service. At the time of testing, we purchased 100GB of in-memory storage for $30.36 per GB/month (or $0.0416 per GB/hour). The BigQuery optimizer automatically managed the BI Engine and which queries utilize the service. With a maximum of 100GB, it is likely that BI Engine only made an impact on queries where small time ranges were filtered.
Snowflake
As a cloud-only, fully managed solution, Snowflake has a clear separation between compute and storage. For Snowflake on Azure, which is what we used for the queries, data is stored in Azure Storage and is cached when queries are executed to bring the data in closer proximity to compute resources. Snowflake essentially offers two configuration “levers”—the size of the warehouse cluster and how many clusters are permitted to spin up to handle concurrency. Snowflake scales by cluster server count in powers of two (1, 2, 4, 8, 16, and so on). If enabled, Snowflake will spin up additional clusters to handle multi-user concurrent query workloads. Snowflake would automatically spin the additional clusters down once demand has passed. If not enabled, it will place paused queries in a queue until resources free up.
For Snowflake, you pay a flat hourly fee when hourly compute resources are being used. We paid $3.00 per hour for the Enterprise tier. Once the compute warehouse goes inactive, you no longer pay. However, there is a separate charge for data storage.
Elasticsearch/OpenSearch
Elasticsearch is a distributed, open search and analytics engine for many types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch was built on the Apache Lucene project and was first released in 2010 by Elastic. Elasticsearch is the “E” component of the ELK Stack, a set of tools for data ingestion, enrichment, storage, analysis, and visualization. (The “L” and “K” of the ELK Stack are Logstash and Kibana, respectively.) Today, the Elastic Stack also includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch. AWS OpenSearch is Amazon’s fully-managed implementation of the open-source version of Elasticsearch, called OpenSearch. In addition to the above Elasticsearch features, OpenSearch can also be loaded with Amazon Kinesis Firehose, which we attempted for this test.
We wanted to run our field test on a fully-managed Elasticsearch service because this workload seems like a good fit for the platform. We investigated Elastic Cloud and AWS OpenSearch but encountered difficulties that made running the test infeasible. We will fully disclose the reasons for this in the Results section.
3. Field Test Setup
GigaOm Log Analytics Field Test
The GigaOm Log Analytical Field Test is a workload created to simulate the activities found in complex, machine-generated log environments. The data schema, population, and implementation have been designed to be broadly representative of modern use cases. In this case, the implementation is the instrumented activity of a cloud service processing input files from different sources using different components. The testing exercises typical analytical needs over this dataset. The data generator and set of queries used for this test are located in this repository.
Test Environments
Selecting and sizing the compute and storage for comparison can be challenging, particularly across different cloud vendors’ offerings.
Results may vary across different configurations, and again, you are encouraged to compile your own representative queries, data sets, data sizes, and test-compatible configurations applicable to your requirements. All told, our testing included three different database environments, as shown in Table 1.
Table 1. Configurations Used for Tests *
Platform | ADX | BigQuery | Snowflake |
---|---|---|---|
Cloud | Azure | Azure | |
Version/Release | 1.05230.12040 | May 23, 2022 | 6.17.0 |
Region | East US 2 | US | East US 2 |
Node Count | 9 | 2,000 | 16 |
Instance Type | L16as_v3 | Regular Slots | X-Large |
Cost Per Hour | $32 | $55 | $48 |
Source: GigaOm 2022 |
* Note: The native JSON type preview in BigQuery was released after our testing.
When conducting field tests like this, we attempt to balance the costs, hardware, and software as much as possible. However, in fully-managed cloud services with very different architectures, this is difficult.
Field Test Data
The testing simulates a workload of machine-generated log files. The log files are comma-separated values generated using a custom application built for this test. The data generator was written in C#. It can be used to generate data sizes of 1GB and 1TB for quality assurance purposes and 100TB for the actual production test. The generator also uploads the data to an Azure Storage Blob. The program generates random, human-readable data typically found in system logs. The schema of the data set and the respective data types used for ADX, BigQuery (BQ), and Snowflake (SNOW) are as shown in Table 2.
Table 2. Field Test Dataset Schema
Field Name | Brief Description | ADX | BigQuery | Snowflake |
---|---|---|---|---|
Timestamp | The date and time the log event occurred | Datetime | Timestamp | Timestamp |
Source | What triggered the log event | String | String | String |
Node | The name of the node where the log event originated | String | String | String |
Level | The level of severity of the event | String | String | String |
Component | The application that threw the log message | String | String | String |
ClientRequestId | The identifier making the request that triggered the event | String | String | String |
Message | The full text of the log message | String | String | String |
Properties | JSON-encoded readings of certain system components (Optional) | Dynamic | Struct | Object |
Source: GigaOm 2022 |
The volume of the uncompressed total dataset is 100TB, but when compressed with Gzip, it is roughly 12TB on disk. The data is time-based and covers a span of nearly three months. The total record count is over 404 billion rows. The dataset was designed to simulate a very high volume of log activity with approximately between 250 and 300 million rows represented for every hour in our Timestamp field.
Loading the Dataset
Loading the data into respective cluster environments was straightforward for ADX, BigQuery, and Snowflake. First, we will describe how we loaded ADX and BigQuery. Note that our 100TB dataset was spread equally across ten Azure Blob Storage containers, which permitted some parallelization of the loading (and allowed us to circumvent our Azure Blob Storage per-container storage quotas).
ADX
At the recommendation of Microsoft, we used a program called LightIngest to load ADX. LightIngest is a freely-available Windows application. We used it to ingest the data directly from our Azure Blob Storage containers. For ADX we used the default configuration so there was no need to specify partitioning or clustering or any other config option.
It took LightIngest approximately four hours to load the entire dataset.
BigQuery
To load BigQuery, we first transferred the data from our Blob Storage to Google Cloud Storage buckets using Google Cloud Storage Transfer service. This process took a couple of hours. Once the data was transferred and verified, we used the GCloud BigQuery command line utility to load the data into BigQuery. Here is a sample of the command we used to load BigQuery:
bq --project_id=our_project_id --location=US load --field_delimiter "," --source_format=CSV --allow_quoted_newlines --time_partitioning_field Timestamp --time_partitioning_type HOUR --clustering_fields ClientRequestId,Source,Level,Component logbenchmark100tb.logs gs://our-bucket/* logs.json
There are a number of configuration details to note in this statement. First, we followed Google’s documentation on partitioning and partitioned the dataset on the Timestamp column. We created a new partition for every hour. Then, we followed Google’s documentation on clustering and clustered the data within those partitions using Client Request ID, Source, Level, and Component (in that order). The JSON config file delineated the schema of the data at the end of the statement (and those specifics are detailed above in Table 2).
BigQuery does not permit CSV data to be loaded directly into the Struct column of Properties. Thus, after the Properties data was loaded in raw form to a String column, we updated the schema of our Logs table, rather than adding a Struct column and an UPDATE to parse the raw JSON data using the JSON_EXTRACT_SCALAR() function to populate the new Struct column. This turned out to have similar performance to a Struct column and semantically more correct since the dynamic column in ADX is both dynamic and schemeless, while the Struct is schematized JSON that requires a priori definition of the column. So, we modified the test queries to use JSON query functions over that column.
All told, it took the GCloud BigQuery command-line utility about ten hours to load, partition, and cluster the data.
Snowflake
To load Snowflake, we created a table clustered by Timestamp, Level, Source, Node, and Component. We loaded the data with the COPY INTO command to load the data directly from Azure Blob Storage. We also parsed the JSON in the Properties column into a Snowflake OBJECT. Finally, we added Search Optimization to the table and waited for the search table to be completely built in the background. It took Snowflake about 22 hours to load and cluster the 100TB dataset.
Field Test Execution
We executed the log analytics field test using a custom-written Python script that connected to each platform, executed the test queries and measured the time elapsed. The script is capable of both single-user tests and concurrent-user tests.
For the single-user tests, each query was executed serially 20 times in a row, and the median and 95th percentile values of query execution times are reported in the next section.
For the concurrent-user tests, 50 user threads were created and each submitted the same query simultaneously. The Python driver waited for all 50 users’ queries to complete before moving on to the next query. If the two-minute timeout was reached, the query (or queries) were canceled. The median and 95th percentile values of query execution times are reported in the next section.
4. Field Test Results
This section analyzes the results of our log benchmark and the queries we executed to return these results. First, a disclaimer: We made a best effort to translate the queries between Azure Data Explorer’s Kusto Query Language (KQL), Google BigQuery’s Standard SQL, and SnowSQL.
Tables 3 shows the average execution time with 50 concurrent users over a single run. Also, note that the values in this table are based on all query execution, including timeouts (marked with an asterisk). In cases with extended timeouts, we set the execution time ceiling to 120 seconds. The query syntax can be found in the appendix.
Table 3. 50 Concurrent Users Average Execution Time Over a Single Run
Query | Azure Data Explorer | Google BigQuery | Snowflake |
---|---|---|---|
1 | 5.78 | 22.81 | 50.84 |
2 | 7.53 | 120.00 * | 120.00 * |
3 | 0.38 | 14.68 | 111.40 * |
4 | 0.20 | 46.55 | 86.11 * |
5 | 3.57 | 13.54 | 27.24 |
6 | 2.90 | 120.00 * | 120.00 * |
7 | 1.07 | 7.31 | 8.38 |
8 | 2.12 | 6.04 | 11.88 |
9 | 2.60 | 8.75 | 15.37 |
10 | 3.65 | 120.00 * | 120.00 * |
11 | 8.57 | 35.40 | 37.21 |
12 | 2.51 | 5.53 | 14.41 |
13 | 0.56 | 37.10 | 23.97 |
14 | 1.53 | 1.44 | 5.19 |
15 | 2.27 | 120.00 * | 120.00 * |
16 | 1.27 | 120.00 * | 22.60 |
17 | 2.54 | 120.00 * | 119.63 * |
18 | 2.07 | 120.00 * | 115.76 * |
19 | 3.38 | 120.00 * | 108.50 * |
Sum of Averages | 54.52 | 1,159.15 | 1,238.47 |
Source: GigaOm 2022 |
Figure 1 shows the total time on task for all queries, based on the average for each query.
Figure 1. Sum of Average Query Runs, All 19 Queries, 50 Concurrent Users
In Table 4 we see the results of our query tests based on a single user, with the execution time calculated over three consecutive runs. Again, the values in this table are based on all query execution, including timeouts (marked with an asterisk). In cases with extended timeouts, we set the execution time ceiling to 120 seconds. The query syntax can be found in the appendix.
Table 4. Single User Average Execution Time Over 3 Consecutive Runs
Query | Azure Data Explorer | Google BigQuery | Snowflake |
---|---|---|---|
1 | 0.23 | 2.06 | 6.20 |
2 | 0.38 | 40.78 | 115.63 * |
3 | 0.06 | 1.75 | 9.55 |
4 | 0.05 | 5.66 | 4.11 |
5 | 0.13 | 2.22 | 4.94 |
6 | 0.14 | 9.35 | 35.53 |
7 | 0.10 | 1.83 | 1.83 |
8 | 0.13 | 0.71 | 1.69 |
9 | 0.14 | 0.92 | 3.64 |
10 | 0.20 | 9.02 | 19.81 |
11 | 0.45 | 1.85 | 3.22 |
12 | 0.18 | 1.11 | 1.37 |
13 | 0.11 | 4.44 | 2.32 |
14 | 0.17 | 0.60 | 1.15 |
15 | 0.15 | 20.31 | 37.56 |
16 | 0.13 | 120.00 * | 7.11 |
17 | 0.13 | 28.65 | 11.69 |
18 | 0.12 | 32.15 | 11.09 |
19 | 0.16 | 25.62 | 8.80 |
Sum of Averages | 3.15 | 309.01 | 287.24 |
Source: GigaOm 2022 |
Finally, Figure 2 shows the total time on task for all single-user queries, based on the average for each query.
Figure 2: Sum of Average Query Runs, All 19 Queries, 1 User
Issues with Elasticsearch/OpenSearch
The architecture of Elasticsearch made loading 100TB of data in a hot tier very expensive. Elastic Cloud has a number of storage tiers (hot, warm, cold, and frozen), and AWS OpenSearch has instance storage (locally attached SSDs and Elastic Band Storage volumes), UltraWarm (separate instance), and cold storage (S3). According to AWS, the minimum storage requirement for OpenSearch is:
Minimum Storage = Source Data * (1 + Number of Replicas) * 1.45
This means that a single replica (not recommended for disaster recovery) would require a minimum of 145TB of storage for our tests. ADX, BigQuery, and Snowflake all compress data when loading. Elasticsearch actually expands the storage footprint.
Unfortunately, Elastic Cloud does not allow compute and hot storage to be scaled independently because the hot tier depends on locally-attached SSDs, which are physically attached to the cloud instance. AWS OpenSearch can scale Elastic Band Storage (EBS) independently, but this is not as performant as locally-attached SSDs, which cannot be scaled. Only locally-attached SSDs provide the highest performance and I/O throughput we need for our tests.
For Elasticsearch, to achieve the minimum storage requirement of 145TB in the hot tier, we would need an Azure (Virginia) storage-optimized deployment across three availability zones with a total of 147TB, 4TB of RAM, and 575 vCPU. This deployment costs $170 per hour to run. For AWS OpenSearch, we would need 40 nodes of r6gd.16xlarge.search (each instance has 3.8TB of local SSDs) to achieve 145TB of hot tier storage, costing $244 per hour. We had the option of a smaller cluster (35TB over nine nodes) to keep the most queried data on local SSD storage and put the rest on EBS (110TB). This would bring the compute cost down to $55 per hour, but the EBS storage would cost approximately $21 per hour.
On the surface, this may seem feasible for our testing, which has a run-time of no more than 38 minutes per run (19 queries x 2-minute timeout). However, the loading time on Elasticsearch makes the costs very high. Elasticsearch does not have a high-speed, bulk data loading mechanism like ADX, BigQuery, and Snowflake. To load and cluster the 100TB dataset on BigQuery and Snowflake took less than a day. On Elastic Cloud, we used 80 instances of Logstash (which have a non-trivial hourly cost), and we could only achieve an indexing rate of 500GB per hour, which would have taken approximately eight days to complete. This drove the costs of our test to well over $30,000.
For AWS OpenSearch, we attempted using Kinesis Firehose, which AWS recommends for loading large amounts of data. This achieved a faster indexing rate than Logstash, but the costs of running multiple Kinesis agents plus the cost of our OpenSearch cluster, brought the total to nearly $10,000. The costs of loading the data and running these tests on Snowflake and BigQuery were well under $5,000. The costs for loading the data and running these tests on ADX was less than $500.
5. Conclusion
This report outlines the results from a GigaOm Log Analytics Field Test to compare Azure Data Explorer (ADX), Google BigQuery, and Snowflake. Overall, they were insightful in revealing the query execution performance of the three platforms tested. Some of the test highlights include:
- Azure Data Explorer (ADX) outperformed Google BigQuery and Snowflake on all 19 test queries with a single user and 18 of 19 with 50 concurrent users.
- ADX completed all 19 queries under one second with a single user, while the average execution time on BigQuery and Snowflake was 15 and 13 seconds per query, respectively.
- ADX completed all 19 queries in under 10 seconds with 50 concurrent users.
- BigQuery and Snowflake both had eight queries (different ones) that did not complete within the two-minute timeout.
Based on our testing and cost information, we found that Azure Data Explorer produced a 97% price-performance advantage over both Snowflake and Google BigQuery. In our tests of 50 concurrent users, Azure Data Explorer produced a $0.48 per hour run rate, compared to BigQuery and Snowflake, which produced rates of $17.71 and $16.51 per hour, respectively.
Keep in mind that tests are configured to get the best from each platform according to publicly documented best practices. Optimizations on both platforms would be possible as their offerings evolve or internal tests point to different configurations. Besides simply performance, we gathered some interesting takeaways from our study:
- The Azure Data Explore query language KQL was by far the simplest and easiest syntax to write. It has thoughtful and useful functions for an analyst (like arg_max()) not found or easily written in SQL.
- BigQuery was an efficient data loader and a decent performer, but it would be an expensive solution for log analytics where a slot commitment would be required—otherwise, the on-demand rate of $5/TB of scanned data would be outrageously expensive for queries.
- Snowflake was easy to use. Both loading and running the queries was a simple, straightforward process. However, during the 50 concurrent-user tests, it would only run 5-8 queries at a time and queue the rest. We even set the Maximum Concurrent Queries system parameter to its maximum allowed value of 32, and this did not change Snowflake’s self-protective behavior. To account for this, we could have enabled up to 10 multi-clusters to handle the concurrent workload. However, the test costs would have been up to 10x as well and given Snowflake an unfair advantage over the other platforms.
All in all, we found Azure Data Explorer (Kusto) to be a great performer, with lowest cost, and an easy-to-use, feature-rich solution for log analytics.
6. Disclaimer
Performance is important, but it is only one criterion for a business-critical database platform selection. This test is a point-in-time check into specific performance. There are numerous other factors to consider in selection across factors of Administration, Integration, Workload Management, User Interface, Scalability, Vendor, Reliability, and numerous other criteria. In our experience, performance changes over time and is competitively different for different workloads. Also, a performance leader can hit up against the point of diminishing returns, and viable contenders can quickly close the gap.
GigaOm runs all of its performance tests to strict ethical standards. The report results are the objective findings of the application of queries to the simulations described. The report clearly defines the selected criteria and process used to establish the field test. It also clearly states the data set sizes, the platforms, the queries, etc., that were used. The reader is left to determine how to qualify the information for their individual needs. The report does not make any claim regarding the third-party certification. It presents the objective results received from applying the process to the criteria as described in the report. The report strictly measures performance and does not purport to evaluate other factors that potential customers may find relevant when making a purchase decision.
This is a sponsored report. Microsoft chose the competitors, the test, and the Azure Data Explorer configuration. GigaOm chose the most compatible configurations for the other tested platform and ran the testing workloads. Choosing compatible configurations is subject to judgment. We have attempted to describe our decisions in this paper.
7. Appendix
Table 5 provides a description of the queries used for the test for each of the three platforms. For those wishing to replicate the tests the query syntax of each follows the table.
Table 5: Summary of Queries
Query | Description |
---|---|
1 | Performs a simple count of terms in a case-insensitive search |
2 | Performs a simple count of a rare found term in a case-insensitive search |
3 | Performs an advanced text search using both “contains” and “starts with” qualifiers over a short time period in a case-insensitive search |
4 | Performs a full text search across multiple columns and fetches a sample |
5 | Finds session logs from a list of Client Request IDs |
6 | Performs an aggregation by a calculated column based on multi-value lookup in a case-sensitive search |
7 | Extracts Exception type and Message from error traces |
8 | Aggregates by a single low cardinality dimension |
9 | Performs an aggregation by two dimensions and getting top 50 results |
10 | Performs an aggregation by multiple dimensions, including the high cardinality column Client Request ID |
11 | Performs an aggregation over data resulting from sub-query, resolves nodes that produce most of errors, and drill into their logs distribution per-level |
12 | Performs a distinct count of sessions that had warning traces over a time period |
13 | Performs an aggregation of dynamic properties and provides statistics of ingestion per source |
14 | Performs filtering, aggregation, and finds top values based on a dynamic column of semi-structured data |
15 | Finds top activities (by CPU consumption) and then gets analysis of the nodes performance over buckets of 5 minutes |
16 | Performs a top nested drill down into errors with 3 levels: Source (all), Nodes (top 3), and components (top 3 + others) |
17 | Finds the 10 tables with the top-10 count of ingested rows for a specific Source |
18 | Performs an aggregation based on the dynamic property, and then drills down into the top 10 tables by row count |
19 | Performs a join on a calculated key and checks if there is any file was downloaded and completed by two different activities (i.e., look for a bug) |
Source: GigaOm 2022 |
Query 1 – Count the appearance of a common term
This query performs a simple count of terms in a case-insensitive search.
ADX
Logs | where Timestamp between(datetime(2014-03-08) .. 12h) | where Level == 'Warning' and Message has 'enabled' | count
BigQuery
select count(*) from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08' and timestamp_add(timestamp '2014-03-08', interval 12 hour) and Level = 'Warning' and regexp_contains(Message, r"(?i)(\b|_)enabled(\b|_)")
Snowflake
select count(*) from logs_c where Timestamp between timestamp '2014-03-08' and timestampadd(hour, 12, to_date('2014-03-08')) and Collate(Level, '') = 'Warning' and regexp_like(Message, '.*enabled.*', 'is');
Query 2 – Count the appearance of a rare term
This query performs a simple count of a rare found term in a case-insensitive search.
ADX
Logs | where Level == 'Error' | where Message has 'safeArrayRankMismatch' | count
BigQuery
select count(*) from gigaom-microsoftadx-2022.logs100tb.logs where Level = 'Error' and regexp_contains(Message, r"(?i)(\b|_)safeArrayRankMismatch(\b|_)");
Snowflake
select count(*) from logs_c where Level = 'Error' and regexp_like(Message, '.*SafeArrayRankMismatch.*', 'is');
Query 3 – Advanced text search using “contains” and “starts with”
This query performs an advanced text search using both “contains” and “starts with” qualifiers over a short time period in a case-insensitive search.
ADX
Logs | where Timestamp between(datetime(2014-03-8 03:00:00) .. 1h) | where Source startswith 'im' and Message contains 'PONSE' | summarize Count=count() by Component
BigQuery
select Component, count(*) from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 03:00:00' and timestamp_add(timestamp '2014-03-08 03:00:00', interval 1 hour) and starts_with(lower(Source), 'im') and lower(Message) like '%ponse%' group by Component ;
Snowflake
select Component, count(*) from logs_c where Timestamp between to_timestamp('2014-03-08 03:00:00') and timestampadd(hour, 1, to_timestamp('2014-03-08 03:00:00')) and startswith(Source, collate('im', 'en-ci')) and contains(lower(Message), 'ponse') group by Component order by Component;
Query 4 – Full text search across multiple columns and fetch a sample
This query performs a full text search across multiple columns and fetches a sample.
ADX
Logs | where Timestamp between(datetime(2014-03-08 03:00:00) .. 1h) | where * has 'Internal' | top 1000 by Timestamp
BigQuery
select * from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 03:00:00' and timestamp_add(timestamp '2014-03-08 03:00:00', interval 1 hour) and ( regexp_contains(Source, r"(?i)(\b|_)Internal(\b|_)") or regexp_contains(Node, r"(?i)(\b|_)Internal(\b|_)") or regexp_contains(Level, r"(?i)(\b|_)Internal(\b|_)") or regexp_contains(Component, r"(?i)(\b|_)Internal(\b|_)") or regexp_contains(ClientRequestId, r"(?i)(\b|_)Internal(\b|_)") or regexp_contains(Message, r"(?i)(\b|_)Internal(\b|_)") ) order by Timestamp limit 1000;
Snowflake
select * from logs_c where Timestamp between to_timestamp('2014-03-08 03:00:00') and timestampadd(hour, 1,to_timestamp('2014-03-08 03:00:00')) and ( regexp_like(Source,'.*\\bInternal\\b.*', 'is') or regexp_like(Node,'.*\\bInternal\\b.*', 'is') or regexp_like(Level,'.*\\bInternal\\b.*', 'is') or regexp_like(Component,'.*\\bInternal\\b.*', 'is') or regexp_like(ClientRequestId,'.*\\bInternal\\b.*', 'is') or regexp_like(Message,'.*\\bInternal\\b.*', 'is') ) order by Timestamp limit 1000;
Query 5 – Find session logs by Client Request ID
This query finds session logs from a list of Client Request IDs.
ADX
Logs | where ClientRequestId in ( 'd71ab629-ebaf-9380-5fe8-942541387ce5', '6bb29a30-ce0d-1288-36f0-27dbd57d66b0', '1f82e290-a7c4-ac84-7117-52209b3b9c91', 'ecc12181-8c5a-4f87-1ca3-712b4a82c4f0', 'd275a6f0-ba1d-22cf-b06b-6dac508ece4b', 'f0565381-29db-bf73-ca1b-319e80debe1c', '54807a9a-e442-883f-6d8b-186c1c2a1041', 'f1d10647-fc31-dbc3-9e25-67f68a6fe194') | count
BigQuery
select count(*) from gigaom-microsoftadx-2022.logs100tb.logs where ClientRequestId in ( 'd71ab629-ebaf-9380-5fe8-942541387ce5', '6bb29a30-ce0d-1288-36f0-27dbd57d66b0', '1f82e290-a7c4-ac84-7117-52209b3b9c91', 'ecc12181-8c5a-4f87-1ca3-712b4a82c4f0', 'd275a6f0-ba1d-22cf-b06b-6dac508ece4b', 'f0565381-29db-bf73-ca1b-319e80debe1c', '54807a9a-e442-883f-6d8b-186c1c2a1041', 'f1d10647-fc31-dbc3-9e25-67f68a6fe194' )
Snowflake
select count(*) from logs_c where ClientRequestId in ( 'd71ab629-ebaf-9380-5fe8-942541387ce5', '6bb29a30-ce0d-1288-36f0-27dbd57d66b0', '1f82e290-a7c4-ac84-7117-52209b3b9c91', 'ecc12181-8c5a-4f87-1ca3-712b4a82c4f0', 'd275a6f0-ba1d-22cf-b06b-6dac508ece4b', 'f0565381-29db-bf73-ca1b-319e80debe1c', '54807a9a-e442-883f-6d8b-186c1c2a1041', 'f1d10647-fc31-dbc3-9e25-67f68a6fe194' );
Query 6 – Aggregate a calculated column based on a multi-value lookup
This query performs an aggregation by a calculated column based on multi-value lookup in a case-sensitive search.
ADX
Logs | where Timestamp between(datetime(2014-03-08) .. 10d) | where Source in ('IMAGINEFIRST0', 'HAVINGCOLUMN182', 'THEREFORESTORE156', 'HOSTNODES207') | extend LogType = case(Component in ('CLOUDREPORTSERVER', 'COMMON1', 'FABRICINTEGRATOR', 'REQUESTPROTECTION', 'DIRECTORYSERVICE', 'REPORTSERVERSERVICETRACE', 'ACONFIGURATION', 'EXPLORESERVICEWATCHDOG', 'COMMUNICATIONRUNTIME'), 'Security', Component in ('REPORTNETWORKING', 'PUSHDATASERVICETRACE', 'HEALTHSERVICE', 'UTILS', 'PROVIDERSCOMMON'), 'Performance', Component in ('WORKERSERVICECONTENT', 'XMLACOMMON', 'INTEGRATIONDATABASE', 'DATABASEMANAGEMENT'), 'Ingestion', 'Other') | summarize Count=count() by LogType
BigQuery
select case when Component in ('CLOUDREPORTSERVER', 'COMMON1', 'FABRICINTEGRATOR', 'REQUESTPROTECTION', 'DIRECTORYSERVICE', 'REPORTSERVERSERVICETRACE', 'ACONFIGURATION', 'EXPLORESERVICEWATCHDOG', 'COMMUNICATIONRUNTIME') then 'Security' when Component in ('REPORTNETWORKING', 'PUSHDATASERVICETRACE', 'HEALTHSERVICE', 'UTILS', 'PROVIDERSCOMMON') then 'Performance' when Component in ('WORKERSERVICECONTENT', 'XMLACOMMON', 'INTEGRATIONDATABASE', 'DATABASEMANAGEMENT') then 'Ingestion' else 'Other' end as LogType, count(*) from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08' and timestamp_add(timestamp '2014-03-08', interval 10 day) and Source in ('IMAGINEFIRST0', 'HAVINGCOLUMN182', 'THEREFORESTORE156', 'HOSTNODES207') group by LogType;
Snowflake
select case when Component in ('CLOUDREPORTSERVER', 'COMMON1', 'FABRICINTEGRATOR', 'REQUESTPROTECTION', 'DIRECTORYSERVICE', 'REPORTSERVERSERVICETRACE', 'ACONFIGURATION', 'EXPLORESERVICEWATCHDOG', 'COMMUNICATIONRUNTIME') then 'Security' when Component in ('REPORTNETWORKING', 'PUSHDATASERVICETRACE', 'HEALTHSERVICE', 'UTILS', 'PROVIDERSCOMMON') then 'Performance' when Component in ('WORKERSERVICECONTENT', 'XMLACOMMON', 'INTEGRATIONDATABASE', 'DATABASEMANAGEMENT') then 'Ingestion' else 'Other' end as LogType, count(*) from logs_c where Timestamp between to_timestamp('2014-03-08') and timestampadd(day, 10, to_timestamp('2014-03-08')) and Source in ('IMAGINEFIRST0', 'HAVINGCOLUMN182', 'THEREFORESTORE156', 'HOSTNODES207') group by LogType;
Query 7 – Extract exception types and messages from error traces
This query extracts Exception type and Message from error traces.
ADX
Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 1h) | where Level=='Error' and Message startswith 'exception' | parse Message with 'Exception=' ExeptionType ';' * 'Message=' ExceptionMessage ';' * | summarize Count=count() by ExeptionType, ExceptionMessage | top 10 by Count
BigQuery
select regexp_extract(Message, 'Exception=(.*);') as ExceptionType, regexp_extract(Message, 'Message=(.*);') as ExceptionMessage, count(*) as ExceptionCount from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 1 hour) and Level = 'Error' and starts_with(lower(Message), 'exception') group by ExceptionType, ExceptionMessage order by ExceptionCount desc limit 10;
Snowflake
select regexp_substr(Message, 'Exception=(.*);', 1, 1, 'e', 1) as ExceptionType, regexp_substr(Message, 'Message=(.*);', 1, 1, 'e', 1) as ExceptionMessage, count(*) as ExceptionCount from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 1, to_timestamp('2014-03-08 12:00:00')) and Level = 'Error' and startswith(Message, collate('exception', 'en-ci')) group by ExceptionType, ExceptionMessage order by ExceptionCount desc limit 10;
Query 8 – Aggregate by a single low cardinality dimension
This query aggregates by a single low cardinality dimension.
ADX
Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 6h) | summarize Count=count() by Level
BigQuery
select Level, count(*) as Count from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 6 hour) group by Level;
Snowflake
select Collate(Level, ''), count(*) as Count from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 6, timestamp '2014-03-08 12:00:00') group by Collate(Level, '');
Query 9 – Aggregate by two dimensions and get top results
This query performs an aggregation by two dimensions and getting top 50 results.
ADX
Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 6h) | summarize Count=count() by Level, Component | top 50 by Count
BigQuery
select Level, Component, count(*) as Count from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 6 hour) group by Level, Component order by Count desc limit 50;
Snowflake
select Level, Component, count(*) as Count from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 6, timestamp '2014-03-08 12:00:00') group by Level, Component order by Count desc limit 50;
Query 10 – Aggregate by multiple dimensions including a high cardinality column
This query performs an aggregation by multiple dimensions, including the high cardinality column Client Request ID.
ADX
Logs | where Timestamp between(datetime(2014-03-08) .. 3d) | where Source == 'IMAGINEFIRST0' and Message has 'downloaded' | summarize hint.shufflekey=ClientRequestId Count=count() by Component, Level, Node, ClientRequestId | top 10 by Count
BigQuery
select Component, Level, Node, ClientRequestId, count(*) as Count from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 00:00:00' and timestamp_add(timestamp '2014-03-08 00:00:00', interval 3 day) and Source = 'IMAGINEFIRST0' and regexp_contains(Message, r"(?i)(\b|_)downloaded(\b|_)") group by Component, Level, Node, ClientRequestId order by Count desc limit 10;
Snowflake
select Component, Level, Node, ClientRequestId, count(*) as Count from logs_c where Timestamp between to_timestamp('2014-03-08 00:00:00') and timestampadd(day, 3, timestamp '2014-03-08 12:00:00') and Source = 'IMAGINEFIRST0' and regexp_like(Message,'.*\\bDownloaded\\b.*', 'is') group by Component, Level, Node, ClientRequestId order by Count desc limit 10;
Query 11 – Resolve nodes that produce most of errors and drill down
This query performs an aggregation over data resulting from sub-query, resolves nodes that produce most of errors, and drill into their logs distribution per-level.
ADX
let top_nodes = Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 6h) | where Level == 'Error' | summarize count() by Node | top 10 by count_ | project Node; Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 6h) | where Node in (top_nodes) | summarize count() by Level, Node
BigQuery
with top_nodes as ( select Node from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 6 hour) and Level = 'Error' group by Node order by count(*) desc limit 10 ) select Level, Node, count(*) as Count from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 6 hour) and Node in (select Node from top_nodes) group by Level, Node;
Snowflake
with top_nodes as ( select Node from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 6, timestamp '2014-03-08 12:00:00') and Level = 'Error' group by Node order by count(*) desc limit 10 ) select Level, Node, count(*) as Count from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 6, timestamp '2014-03-08 12:00:00') and Node in (select Node from top_nodes) group by Level, Node;
Query 12 – Distinct count of sessions that had warning traces over a short period of time
This query performs a distinct count of sessions that had warning traces over a time period.
ADX
Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 6h) | where Level == 'Error' | summarize dcount(ClientRequestId) by bin(Timestamp, 1h)
BigQuery
select timestamp_seconds(1*3600 * div(unix_seconds(Timestamp), 1*3600)) as bin, count(distinct ClientRequestId) as dcount from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 6 hour) and Level = 'Error' group by bin order by bin;
Snowflake
select time_slice(Timestamp, 1, 'HOUR') as bin, count(distinct ClientRequestId) as dcount from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 6, timestamp '2014-03-08 12:00:00') and Level = 'Error' group by bin order by bin;
Query 13 – Aggregate statistics of ingestion per source
This query performs an aggregation of dynamic properties and provides statistics of ingestion per source.
ADX
Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 6h) | where Message startswith 'IngestionCompletionEvent' | where Source in ('IMAGINEFIRST0', 'CLIMBSTEADY83', 'INTERNALFIRST79', 'WORKWITHIN77', 'ADOPTIONCUSTOMERS81', 'FIVENEARLY85', 'WHATABOUT98', 'PUBLICBRAINCHILD89', 'WATCHPREVIEW91', 'LATERYEARS87', 'GUTHRIESSCOTT93', 'THISSTORING16') | where Properties has 'parquet' | summarize MaxRowCount= max(tolong(Properties.rowCount)), percentiles(totimespan(Properties.duration), 50, 90, 95) by Source
BigQuery
with PercentilesTable as ( select Source, max(PropertiesStruct.rowCount) as MaxRowCount, approx_quantiles(time_diff(PropertiesStruct.duration, '00:00:00', microsecond) / 1000000, 100) percentiles from gigaom-microsoftadx-2022.logs100tb.logs_c where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 6 hour) and starts_with(Message, 'IngestionCompletionEvent') and Source in ('IMAGINEFIRST0', 'CLIMBSTEADY83', 'INTERNALFIRST79', 'WORKWITHIN77', 'ADOPTIONCUSTOMERS81', 'FIVENEARLY85', 'WHATABOUT98', 'PUBLICBRAINCHILD89', 'WATCHPREVIEW91', 'LATERYEARS87', 'GUTHRIESSCOTT93', 'THISSTORING16') and regexp_contains(Properties, r"(?i)(\b|_)parquet(\b|_)") group by Source ) select Source, MaxRowCount, percentiles[offset(50)] as p50, percentiles[offset(90)] as p90, percentiles[offset(95)] as p95 from PercentilesTable order by MaxRowCount desc;
Snowflake
select Source, max(cast(PropertiesStruct:rowCount as int)) as MaxRowCount, max(case when startswith(PropertiesStruct:duration, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end), percentile_cont(0.50) within group (order by case when startswith(PropertiesStruct:duration, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end) as percentiles50, percentile_cont(0.90) within group (order by case when startswith(PropertiesStruct:duration, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end) as percentiles90, percentile_cont(0.95) within group (order by case when startswith(PropertiesStruct:duration, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:duration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end) as percentiles95 from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 6, timestamp '2014-03-08 12:00:00') and startswith(Message, 'IngestionCompletionEvent') and Source in ('IMAGINEFIRST0', 'CLIMBSTEADY83', 'INTERNALFIRST79', 'WORKWITHIN77', 'ADOPTIONCUSTOMERS81', 'FIVENEARLY85', 'WHATABOUT98', 'PUBLICBRAINCHILD89', 'WATCHPREVIEW91', 'LATERYEARS87', 'GUTHRIESSCOTT93', 'THISSTORING16') and regexp_like(Properties,'.*\\bparquet\\b.*', 'is') group by Source;
Query 14 – Filter and aggregate top values from semi-structured data
This query performs filtering, aggregation, and finds top values based on a dynamic column of semi-structured data.
ADX
Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 1h) | where Component == "DOWNLOADER" | summarize DownloadRate=max(todouble(Properties.compressedSize) / ((totimespan(Properties.downloadDuration) / 1sec))) by Source | top 10 by DownloadRate
BigQuery
select Source, max(PropertiesStruct.compressedSize / time_diff(PropertiesStruct.downloadDuration, '00:00:00', microsecond) / 1000000) as DownloadRate from gigaom-microsoftadx-2022.logs100tb.logs_c where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 1 hour) and Component = 'DOWNLOADER' group by Source order by DownloadRate desc limit 10;
Snowflake
select Source, max( cast(PropertiesStruct:compressedSize as int) / ( case when startswith(PropertiesStruct:downloadDuration, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:downloadDuration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:downloadDuration, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end )) as DownloadRate from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(hour, 1, timestamp '2014-03-08 12:00:00') and Component = 'DOWNLOADER' group by Source order by DownloadRate desc limit 10;
Query 15 – Find top activities and get analysis of node performance
This query finds top activities (by CPU consumption) and then gets analysis of the nodes performance over buckets of 5 minutes.
ADX
let Data = Logs | where Source == 'IMAGINEFIRST0' | where Timestamp between(datetime(2014-03-08 12:00) .. 5d) | where Message startswith 'IngestionCompletionEvent'; let TopNodesByCPU = Data | summarize MaxCPU = max(totimespan(Properties.cpuTime)) by Node | order by MaxCPU desc, Node desc | take 10 | project Node; Data | where Node in (TopNodesByCPU) | summarize AverageProcessTimeInSeconds=avg(totimespan(Properties.cpuTime)) by bin(Timestamp, 5m), Node
BigQuery
with TopNodesByCPU as ( select Node, time_diff(PropertiesStruct.cpuTime, '00:00:00', microsecond) / 1000000 as cpuTime from gigaom-microsoftadx-2022.logs100tb.logs_c where Source = 'IMAGINEFIRST0' and Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 5 day) and starts_with(lower(Message), 'ingestioncompletionevent') group by Node, cpuTime order by cpuTime desc, Node desc limit 10 ) select Node, avg(time_diff(PropertiesStruct.cpuTime, '00:00:00', microsecond) / 1000000), timestamp_seconds(5*60 * div(unix_seconds(Timestamp), 5*60)) as bin from gigaom-microsoftadx-2022.logs100tb.logs_c where Node in (select Node from TopNodesByCPU) and Source = 'IMAGINEFIRST0' and Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 5 day) and starts_with(lower(Message), 'ingestioncompletionevent') group by Node, bin order by bin ;
Snowflake
with TopNodesByCPU as ( select Node, case when startswith(PropertiesStruct:cpuTime, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:cpuTime, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:cpuTime, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end as cpuTime from logs_c where Source = 'IMAGINEFIRST0' and Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(day, 5, timestamp '2014-03-08 12:00:00') and startswith(lower(Message), 'ingestioncompletionevent') group by Node, cpuTime order by cpuTime desc, Node desc limit 10 ) select Node, avg(case when startswith(PropertiesStruct:cpuTime, '1.') then timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:cpuTime, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 + 24*3600 else timediff(microsecond, to_time('00:00:00'), to_time(regexp_replace(PropertiesStruct:cpuTime, '^1\\.(.*)', '\\1', 1, 1, 'm'))) / 1000000 end), time_slice(Timestamp, 5, 'MINUTE', 'START') as bin from logs_c where Node in (select Node from TopNodesByCPU) and Source = 'IMAGINEFIRST0' and Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(day, 5, timestamp '2014-03-08 12:00:00') and startswith(lower(Message), 'ingestioncompletionevent') group by Node, bin order by bin ;
Query 16 – Top nested drill down into 3 levels of errors
This query performs a top nested drill down into errors with 3 levels: Source (all), Nodes (top 3), and components (top 3 + others).
ADX
Logs | where Timestamp between (datetime(2014-03-08) .. 3h) | where Level in ("Error") | summarize (FirstErrorTime, FirstErrorComponent)=arg_min(Timestamp, Component), (LastErrorTime, LastErrorComponent)=arg_max(Timestamp, Component), Errors=count() by Source | top 5 by Errors
BigQuery
with ArgMax as ( select Source, min(Timestamp) as MinTS, max(Timestamp) as MaxTS, count(*) as Errors from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08' and timestamp_add(timestamp '2014-03-08', interval 3 hour) and Level = 'Error' group by Source ) select ArgMax.Source, MinTS as FirstErrorTime, FirstComp.Component as FirstErrorComponent, MaxTS as LastErrorTime, LastComp.Component as LastErrorComponent, Errors from ArgMax left join gigaom-microsoftadx-2022.logs100tb.logs FirstComp on FirstComp.Source = ArgMax.Source and FirstComp.Timestamp = ArgMax.MinTS left join gigaom-microsoftadx-2022.logs100tb.logs LastComp on LastComp.Source = ArgMax.Source and LastComp.Timestamp = ArgMax.MaxTS order by Errors desc limit 5 ;
Snowflake
with ArgMax as ( select Source, min(Timestamp) as MinTS, max(Timestamp) as MaxTS, count(*) as Errors from logs_c where Timestamp between to_timestamp('2014-03-08 00:00:00') and timestampadd(hour, 3, timestamp '2014-03-08 00:00:00') and Level = 'Error' group by Source ) select ArgMax.Source, MinTS as FirstErrorTime, FirstComp.Component as FirstErrorComponent, MaxTS as LastErrorTime, LastComp.Component as LastErrorComponent, Errors from ArgMax left join logs_c FirstComp on FirstComp.Source = ArgMax.Source and FirstComp.Timestamp = ArgMax.MinTS left join logs_c LastComp on LastComp.Source = ArgMax.Source and LastComp.Timestamp = ArgMax.MaxTS order by Errors desc limit 5 ;
Query 17 – Find top count of ingested rows for a specific source
This query finds the 10 tables with the top-10 count of ingested rows for a specific Source.
ADX
let Data = Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 3d) | where Source == 'IMAGINEFIRST0'; Data | where Message startswith '$$IngestionCommand' | parse Message with '$$IngestionCommand table=' Table ' ' * | distinct hint.shufflekey=ClientRequestId ClientRequestId, Table | join kind=inner hint.shufflekey=ClientRequestId ( Data | where Message startswith 'IngestionCompletionEvent' ) on ClientRequestId | summarize TotalRows = sum(tolong(Properties.rowCount)) by Table | top 10 by TotalRows
BigQuery
with Data as ( select * from gigaom-microsoftadx-2022.logs100tb.logs_c where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 3 day) and Source = 'IMAGINEFIRST0' ), Data2 as ( select distinct regexp_extract(Message, 'table=(.*) ') as icTable, ClientRequestId from Data where starts_with(lower(Message), '$$ingestioncommand') ) select icTable, sum(l.PropertiesStruct.rowCount) as TotalRows from Data l inner join Data2 d on l.ClientRequestId = d.ClientRequestId where starts_with(l.Message, 'IngestionCompletionEvent') group by icTable order by TotalRows desc limit 10 ;
Snowflake
with Data as ( select * from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(day, 3, timestamp '2014-03-08 12:00:00') and Source = 'IMAGINEFIRST0' ), Data2 as ( select distinct regexp_substr(Message, 'table=\\w+') as icTable, ClientRequestId from Data where startswith(lower(Message), '$$ingestioncommand') ) select icTable, sum(cast(l.PropertiesStruct:rowCount as int)) as TotalRows from Data l inner join Data2 d on l.ClientRequestId = d.ClientRequestId where startswith(l.Message, 'IngestionCompletionEvent') group by icTable order by TotalRows desc limit 10 ;
Query 18 – Aggregate a dynamic property and drill-down
This query performs an aggregation based on the dynamic property, and then drills down into the top 10 tables by row count.
ADX
let Data = Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 3d) | where Source == 'IMAGINEFIRST0'; Data | where Message startswith '$$IngestionCommand' | parse Message with '$$IngestionCommand table=' Table ' ' * | distinct hint.shufflekey=ClientRequestId ClientRequestId, Table | join kind=inner hint.shufflekey=ClientRequestId ( Data | where Message startswith 'IngestionCompletionEvent' ) on ClientRequestId | extend Format=tostring(Properties.format), RowCount=tolong(Properties.rowCount) | top-nested of Format by max(1), top-nested 10 of Table by Rows=sum(RowCount) | project Format, Table, Rows
BigQuery
with Data as ( select * from gigaom-microsoftadx-2022.logs100tb.logs_c where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 3 day) and Source = 'IMAGINEFIRST0' ), Data2 as ( select distinct regexp_extract(Message, 'table=(.*) ') as icTable, ClientRequestId from Data where starts_with(lower(Message), '$$ingestioncommand') ) select Format, icTable, RowCount from ( select *, row_number() over (partition by Format order by RowCount desc) as rownum from ( select l.PropertiesStruct.format as Format, icTable, sum(l.PropertiesStruct.rowCount) as RowCount from Data l inner join Data2 d on l.ClientRequestId = d.ClientRequestId where starts_with(l.Message, 'IngestionCompletionEvent') group by Format, icTable ) ) where rownum <= 10 ;
Snowflake
with Data as ( select * from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(day, 3, timestamp '2014-03-08 12:00:00') and Source = 'IMAGINEFIRST0' ), Data2 as ( select distinct regexp_substr(Message, 'table=(.*) ') as icTable, ClientRequestId from Data where startswith(lower(Message), '$$ingestioncommand') ) select Format, icTable, RowCount from ( select *, row_number() over (partition by Format order by RowCount desc) as rownum from ( select l.PropertiesStruct:format as Format, icTable, sum(cast(l.PropertiesStruct:rowCount as int)) as RowCount from Data l inner join Data2 d on l.ClientRequestId = d.ClientRequestId where startswith(l.Message, 'IngestionCompletionEvent') group by Format, icTable ) ) where rownum <= 10 ;
Query 19 – Join on a calculated key and look for a bug
This query performs a join on a calculated key and checks if there is any file was downloaded and completed by two different activities (i.e., look for a bug).
ADX
let Data = Logs | where Timestamp between(datetime(2014-03-08 12:00) .. 3d) | where Source in ('IMAGINEFIRST0') ; let Downloading = Data | where Message startswith 'Downloading file path:' | parse Message with * 'path:' path | project DownloadClientRequestId = ClientRequestId, path; let IngestionCompletion = Data | where Message startswith 'IngestionCompletionEvent' | parse Message with * 'path:' path | project CompleteClientRequestId = ClientRequestId, path; Downloading | join hint.shufflekey=path kind=inner (IngestionCompletion) on path | where DownloadClientRequestId != CompleteClientRequestId | count
BigQuery
with Data as ( select * from gigaom-microsoftadx-2022.logs100tb.logs where Timestamp between timestamp '2014-03-08 12:00:00' and timestamp_add(timestamp '2014-03-08 12:00:00', interval 3 day) and Source in ('IMAGINEFIRST0') ), Downloading as ( select regexp_extract(Message, 'path:(.*)') as Path, ClientRequestId as DownloadClientRequestId from Data where starts_with(Message, 'Downloading file path:') ), IngestionCompletion as ( select regexp_extract(Message, 'path:(.*)') as Path, ClientRequestId as CompleteClientRequestId from Data where starts_with(Message, 'IngestionCompletionEvent') ) select count(*) from Downloading d inner join IngestionCompletion ic on d.Path = ic.path where DownloadClientRequestId <> CompleteClientRequestId ;
Snowflake
with Data as ( select * from logs_c where Timestamp between to_timestamp('2014-03-08 12:00:00') and timestampadd(day, 3, timestamp '2014-03-08 12:00:00') and Source in ('IMAGINEFIRST0') ), Downloading as ( select regexp_substr(Message, 'path:(.*)') as Path, ClientRequestId as DownloadClientRequestId from Data where startswith(Message, 'Downloading file path:') ), IngestionCompletion as ( select regexp_substr(Message, 'path:(.*)') as Path, ClientRequestId as CompleteClientRequestId from Data where startswith(Message, 'IngestionCompletionEvent') ) select count(*) from Downloading d inner join IngestionCompletion ic on d.Path = ic.path where DownloadClientRequestId <> CompleteClientRequestId ;
8. About Microsoft
Microsoft (Nasdaq “MSFT” @microsoft) enables digital transformation for the era of an intelligent cloud and an intelligent edge. Its mission is to empower every person and every organization on the planet to achieve more.
Microsoft offers Azure Data Explorer (ADX). To learn more about ADX visit https://azure.microsoft.com/en-us/services/data-explorer/.
9. About William McKnight
William McKnight is a former Fortune 50 technology executive and database engineer. An Ernst & Young Entrepreneur of the Year finalist and frequent best practices judge, he helps enterprise clients with action plans, architectures, strategies, and technology tools to manage information.
Currently, William is an analyst for GigaOm Research who takes corporate information and turns it into a bottom-line-enhancing asset. He has worked with Dong Energy, France Telecom, Pfizer, Samba Bank, ScotiaBank, Teva Pharmaceuticals, and Verizon, among many others. William focuses on delivering business value and solving business problems utilizing proven approaches in information management.
10. About Jake Dolezal
Jake Dolezal is a contributing analyst at GigaOm. He has two decades of experience in the information management field, with expertise in analytics, data warehousing, master data management, data governance, business intelligence, statistics, data modeling and integration, and visualization. Jake has solved technical problems across a broad range of industries, including healthcare, education, government, manufacturing, engineering, hospitality, and restaurants. He has a doctorate in information management from Syracuse University.
11. About GigaOm
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.
12. Copyright
© Knowingly, Inc. 2022 "Log and Telemetry Analytics Performance Benchmark" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.