Under the Hood
Mixpanel的分析UI由一个名为ARB的内部数据库提供动力,该数据库旨在实时摄入,存储和查询数万亿个事件。此页面涵盖了我们设计的核心方面,它消除了用户的痛苦以及与其他系统的比较。
以事件为中心
Mixpanel is built for ingesting, storing and querying events. Each event has a name, a timestamp, a unique identifier, a distinct_id that identifies the entity that performed the event, and a JSON blob of properties.
{“ event”:“注册”,“属性”:{“ time”:1618716477000,“ dimption_id”:“”[电子邮件保护]“,” $ insert_id”:“ 29fc2962-6d9c-455d-95AD-95B84F09B9E4”,“由”:“ friend”,“ url”,“ url”:“ www.jy710.com/signup”,}}}
由于以下原因,事件是收集和存储数据的一种简单而有力的方法:
- Events map cleanly to real-world actions. When something happens at a point in time to a user, you can track it with all the context you know about that event.
- Events are granular. Any question about user engagement, conversion, or retention can be modeled as an aggregation over a user's event stream. The event data model makes no assumption about the queries it might receive, so it serves as a flexible foundation to power arbitrary queries. Events can be summarized by any property (to form a metric) or segmented by any property (to drill down into a metric) completely on-the-fly.
- 事件是不变的,只附加。当现实世界中发生某些事情时,它从来没有“无意”。不变性使得以表现且具有成本效益的方式设计事件本机系统。
- Events are flexible. They can model actions that go beyond user activity: support tickets, pull requests, Slack messages, payments, CDC from a database, etc.
这消除了什么:预先计算,汇总或索引。
*例外是遵守我们专门处理的GDPR之类的隐私法律。
Optimized for Interactive User Joins
Mixpanel's UI is built for interactive exploration of event-based metrics. We must respond to queries within seconds to make data exploration delightful at scale.
我们的查询引擎采用了CPU性能的通常技术:柱状存储,字典编码,查询优化器以及C/C ++中的专用查询引擎。
我们还根据他们的事件将distinct_id
,标识执行活动并通过的演员的财产时间
. This combined with our in-memory query engine enables behavioral queries (funnels, flows, retention) to be computed with high parallelism, no shuffling, and no expensive fact-on-fact joins, leading to low query latency.
Finally, we benefit from cloud economics: 1 CPU for 100 seconds costs the same as 100 CPUs for 1 second, but the latter can respond to queries 100x faster. Multitenancy makes this approach possible at scale and enables fast queries over billions of events.
这消除了什么:数据采样,事实接合和手动刷新仪表板。
Real-Time
事件可用于在摄入我们的摄入服务器的几秒钟内在Mixpanel中进行分析。ARB利用Lambda体系结构以行列的格式收集最新事件,同时以时间分配的柱状格式存储历史事件。这可以快速,实时分析和有效的历史分析。
这消除了什么: Waiting for periodic ETL jobs or caches to populate.
读取模式
事件包含一组任意的JSON属性。与给定事件类型相关的属性通常是稳定的,但是当将新功能添加到要跟踪的产品中时,它们可能有时会更改。在收集时间强制执行模式的系统需要进行某种架构升级,这可能会耗时耗时,尤其是在从客户端设备中收集事件时。也就是说,模式对于向执行分析的人提供属性自动完成之类的功能很有用。
Mixpanel solves both problems with schema-on-read. Events are ingested and stored with arbitrary JSON and we infer schemas in real-time to power the autocomplete menus in our UI. Our schema inference also accounts for recency so that stale schemas naturally age out.
这消除了什么:模式迁移。
星模架
Mixpanel's数据模型is fundamentally a star-schema: events are facts and user profiles/lookup tables are dimensions. Events are typically streamed in from client devices and server logs, while dimensional data is periodically loaded from a system of record and provides enrichment to the events for analysis.
Arb's query and storage engine can run star-schema joins on the fly. This means events and dimensions can be loaded at any time without any coordination, rather than needing to be joined at ingestion. This query-time approach also enables backfills of events and dimensions to be done retroactively.
这消除了什么: Ingestion-time enrichment. Coordination between streaming and batch systems.
愿意
Mixpanel的摄入管道是愿意的,这意味着多次意外发送的事件不会影响分析。这简化了Mixpanel的集成到您自己的流媒体或批处理数据管道中,因为它会完全摄入到最终的摄入中。与其他在摄入时插入短时间窗口的系统不同,我们采用一种新颖的查询时间方法来重复数据删除概述。engineering blog. This enables us to detect duplicates even if they arrive months later.
这消除了什么: Keeping state about what you have already sent to Mixpanel.
云本地
Mixpanel is a fully managed cloud application, maintained by the Mixpanel team and deployed on Google Cloud. Like other cloud-native databases, it decouples compute from storage, which reduces costs at high-scale.
在Google Cloud上,我们还可以使我们利用云原语来更快地运输功能,在负载增加时无缝缩放并利用Google提供的企业级安全性。
这消除了什么:服务器维护,升级和容量配置。
打开API
有整合的许多方法with Mixpanel, but all are based on our JSON-over-HTTP APIs. We believe it should be easy to bring data into or out of Mixpanel with whatever tools you already use, whether it's a CDP, a data pipeline, or a simple cURL.
这种方法使我们能够插入更广泛的数据生态系统,这使得可以轻松地将实时事件流传输到Mixpanel和来自数据仓库(例如数据仓库)的真实性系统中的负载。这是通过我们的API启用MixPanel的参考,用于将数据带入MixPanel的混合体系结构。
这消除了什么: Complex integrations and proprietary data formats.
Comparison to other systems
Mixpanel makes a set of tradeoffs to achieve the above design goals. Here, we compare Mixpanel to other popular database systems to put these tradeoffs in context:
MixPanel不是MySQL,Postgres或DynamoDB之类的OLTP系统。我们不支持酸性交易,索引,点外观或任意SQL。也就是说,Mixpanel可以从OLTP数据库中摄取更改数据捕获事件,以提供行为分析。
Mixpanel不是Snowflake或BigQuery之类的数据仓库。数据仓库提供了一个关系模型和SQL语义。它们是您收集并启用各种查询的所有业务数据的可扩展记录系统的绝佳选择。但是,这种普遍性的性能和人体工程学较差,用于摄入实时事件和回答产品分析问题。188金宝博金宝博188滚球Mixpanel可以通过我们的蜜蜂或反向ETL工具。
Mixpanel is similar to OLAP systems like Clickhouse, Druid, or Pinot. All are optimized for fast, self-serve analytics on immutable data and support both real-time or batch ingestion. The latter systems are open-source and support a SQL-like dialect. This means they can be plugged into open-source or off-the-shelf collection or visualization tools. However, they require developer time to maintain, are expensive because they do not decouple compute and storage, and can also be slower as they do not benefit from multi-tenancy. Finally, they are slow for the types of user joins that are needed for behavioral analytics. Mixpanel is fully managed, vertically integrated from the UX to the database, and built for user joins.
Mixpanel的后端也类似于RockSet等现代无服务器数据库。两者都是云的实时,实时的,并提供了读取模式。Rockset是一个低级原始性,需要配置计算集群和写作SQL,该群集提供隔离和灵活性。当您希望对数据进行程序化访问(为数据应用程序供电或为应用程序的最终用户提供分析)时,这是一个不错的选择。Mixpanel之所以有所不同,是因为它是从可视化层到数据库的垂直集成的,它专门针对诸如Funnels之类的查询进行了优化,并且是为人类生成的查询而不是直接编程访问而构建的。
最终,Mixpanel专门回答有关用户行为(Funnels,Flow,保留)在大规模,不可变的事件流的问题。我们通过专注于此用例,并提供开放的API来插入数据生态系统中的任何其他工具,从而在性能,成本和易用性方面提高了步骤功能。
有关Mixpanel如何构建的更多技术细节,请在我们的engineering blog!
更新 10 months ago