-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path34
33 lines (17 loc) · 18.3 KB
/
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
YouTube:https://www.youtube.com/watch?v=4QjbfDDjGJ4
Text:
I think back to my story. I'm kind of like an enterprise software guy, but before I worked in SAP for BI software, like business intelligence product. And in 2012, I joined Splunk and I spent like six years there. Just more about like, probably you know, the Splunk actually, they actually make big data actually a little bit more, kind of a near real time, you know, enable IT or security operation guy to make the real time decision. But you know, we still like consider, you know, analytics are still very slow, right? So a lot of the analytics software just pretty much like for dashboarding or reporting. But if you enable like the end user, say make a decision, you know, but it's really hard. So that's why we two years ago, we considered that's a very great opportunity to make like a real time and like latency sensitive application, analytic application. So that's why I'm so lucky, you know, to collaborate with Charles, the MemVerge. I think that's two words actually, they can get a converge. And then Tony also mentioned a lot of use case, you know, like a high frequency, like a trading or a lot of things. Actually Timeplus right now is working with some financial, the customer to working on this domain. The customer really care about the latency. But we talk about latency, not a really nanosecond, but actually the millisecond or maybe, or maybe like even below the one second. So I think it's a lot of the different use case.
So yeah, in general, I just like a very quick introduction about our model. It's not a technology here, just like it's really like a very different database. But we call some industry called a streaming database, or maybe something called, you know, maybe not a typical database, because, you know, not purely index driven. So it's really like, we call the append lock, you know, structure. So that's something very different compared to index heavy database. The second thing we introduced a lot of use case, I think Charles also introduced one, like a fail, like a recovery use case. But we actually, a lot of the relevant use case, you know, between the CXL and the streaming database. The third item, so we're still like working on this kind of the third item. So we actually want to really want to deep dive, like a failover recovery integration with Gismo. So we have a lot of early data, but we're still working on that.
But in general, like why we do the real time, you know, for our perspective, you know, the analytics really, we consider that's a real time. Now it's kind of a mainstream, you know, to look at that example, it's just like, even the aircraft, you know, for a flight, like from San Francisco to London. And, you know, when looking to the, you know, that's the one flight, it's just like a lot of the different dimension, the real time data from sensor data, you know, social network data or, you know, IT infrastructure data is huge. But you know, if just like for historical reason, you can do a lot of the reporting, but we saw a lot of the customer requirements, say how to enable, you know, for, you know, to do like a security detection, or maybe do like, improve the customer experience, a lot of saying, we definitely, we see a lot of the real time is really, really actionable or valuable, right? So that's something we consider that's a really, is a big, big, you know, reason when analytics should be go to the real time, the stage.
But looking to the use case, I think we talk about like a finance trading, those kind of thing everybody can understand. But if you go from the, you know, from manufacturing to transportation, or financial, or maybe observability, or even for customer experience, right? So I think that's a lot of the same, it's just like a kind of a, just like a new revolution, you know, consider like a previous, like it's a traditional, like a reporting or data warehouse, you know, technology, but it's just like now it's just how you enable different, you know, like a business department to make that decision. I think that's something really critical.
So for us, so looking to deep dive technology. I think that's something it's not a really industry, not a total new concept. For going back to like 2000, or even early stage, it's more about a database. So database, but the database at that time just say, okay, so if really like a really long running, or maybe large as a query, maybe take like a second or minutes or even hours, but how to enable a continuous query, I think a lot of people just around that. So that's something that stage is more focused on database. Second stage is just like probably you guys know, Flink, or Kafka, or Storm, there's a lot of streaming technology just to handle, it's kind of like an event ordering, or later event, a lot of the new model, just how to handle the event flow. I think that's a very totally different programming model. So that's something we call a lot of the new technology around that. Even Flink, I think it's a very defective standard for streaming processing. So but the story underneath, actually, the real-time stream data is just dramatically increasing. So but we talk a lot of customers just like the customer, like the data, the velocity, or like EPS, like event per second, a lot of the industry actually, they're actually not totally like a very different compared to 10 years ago. So that's what we see the bigger motivation behind is just like how about the new model? So I think the streaming processing, purely computing, and also database has a lot of the advantage to handle those kind of the, you know, for accuracy or throughput, even for throughput part. So that's a way to consider, that's a very great time to consider converge. So that's actually pretty much for Timeplus today.
We consider that just like it's kind of the converge. So we're very, just keep simple. So we have like a streaming, kind of a real time tier storage, it's just like a Panda log. And the second tier, like we have historical data, just like they can do backfill, they can do, you know, for like a larger throughput analytic, we can connect those data. So that's what we call underneath, we like to say, it's a unified streaming storage. So why we make that, just like we basically try to make it analytic, it's just really, really like a real time or low latency. So that's something that technology on the next, we actually do a lot of the testing, even like a 10 million EPS, like an event per second, we can keep like, even like a 10 millisecond, something like that. We call it end-to-end latency. So that's something, it's really like amazing performance. But yeah, anyway, that's a very high level technology.
But just like when deep dive or the streaming processing, you know, for example, like when we handle like a million event per second, for this kind of a scale or velocity, I think it's hard to detect the pattern or detect the insight, I think definitely it's really a challenge. So we just put some example, some like a typical challenge here. The first thing is just like a data volume or velocity. So we talk a lot, it's just like a particular velocity, I think velocity is really for streaming or real-time processing is really pretty cool here. Just like we actually use a lot of the traditional database, even for Splunk or other technologies, to actually to analyze like a high, like a trading data, actually, we pretty much see a lot of latency, you know, when the data is keep coming so fast, and the analytics is so slow. So it's really, really like fall behind, you know, for this kind of the data. The second thing is the real-time processing. I think that's something with all the use cases around is the real-time, is a low latency. So that's something definitely like a lot of the, you have to like a redesign a lot of the infrastructure underneath. So you see a lot of pipeline or index driven or, you know, ETL, typically for make a lot of sense for analytics. But for the real time, it's really sometimes just like it's too long. So if you just take a very long time to processing or, you know, for indexing or, or create a lot of heavy indexing, actually, the data is pretty like obsolete, right? So probably too, you know, too old, right? So that's something the second challenging. The third one is fault tolerance. I think the Charles just mentioned one use case. Since, you know, streaming processing, just like we actually just a continuous query, we handle like we catch it a new event. But when something happens, if we just like to take a long time to reprocessing data, or you know, or maybe like a long time for recovery, I think that's something really hard to, you know, particularly for fault tolerance. We cannot afford like a long time, you know, like a long recovery time. So that's our third challenge. The fourth item is just like a streaming, particularly for streaming processing, we actually pretty care about like event order. Event order, just like when something happens, like ABC, we definitely, we care about a sequence. So that's something not really typical, the big data use case. But that's really helpful for a lot of the, particularly for event model. So we care about like a sequence, you know, if just sequence is a wrong order, I could definitely get the wrong result. So that's something is the fourth item. The number five is just a state of management. I think state is really critical here. So since we know when we data like a processing data, or we actually long running query, so always like a state for each, you know, certain granularity, we actually reserve the state. So state is huge for the streaming processing. For example, you know, for a lot of use case, we like 100 gigabyte or even terabyte state. So that's a certain, create a lot of certain scenario, how leverage the CXL. Later on, I can introduce a little bit. Scalability is, that's another one, just like when handle the increasing the data volume, how to scale out dynamically, or maybe scale in. So that's something I think very typical of the data-intensive application.
So that's something we consider why we really passionate for the CXL, the share bigger memory. I think that's something definitely is really relevant or even critical for streaming analytics. So that's the first thing, the volume, the velocity. If we got like, if our mentality now is go to memory centric application, I think it definitely is really like a change a lot of the, you know, design or architecture. So that's the number one. Real-time processing, I think it definitely is a very critical case is the data shuffling. So when we do MapReduce or even for Spark, I think the worst case is just like data shuffling. So we just like to see pretty often, but it's really, really time-consuming. So the second thing, the third thing is just like a failover. So even auditing or stay is definitely just very critical topic of how time plus or stream processing to manage a state. The last one is just like a job rescaling. So I think that's something also related to memory requirements here.
So talk about failover recovery. So I think that before, just like, it's no doubt, we are huge query state here. So when query is something, we put some query state there. When the node one is like maybe, you know, it just crashed, we switched to some node two. So node two, the query state is just like a replica, you know, that put a data for node two. So that's something a lot of the disk IO, network IO here. So that's very traditional, the data intensive application here. So in general, like when we do, like we integrate a piece more, you know, even the future CXL is the memory here. So we definitely like see, you know, nodes that remove the IO, the bottleneck here. I think it definitely like disk IO, or network IO, we don't want to see that. So that's something definitely is really, really, really helpful for here. So we initially tested like 20 times faster. So we even consider in Siri, you know, CXL even faster, right? So as you get the latency even for compare the local disk or even 100 times, even more. So we definitely see a lot of room to even make better.
Data shuffling is really, I think it's critical here. We definitely try to, you know, we really hate that, but that's pretty normal. So it's a partition, it's not really handled all the data in the one node, right? So, you know, some node, they just calculate some result and they have to serve some result to another computing job. So that's something really happened pretty often here. So I think that's definitely, we can imagine. So when zero copy, one CXL memory here, I think it definitely is really, really, really solve the data shuffling, the challenge for, I think most of our data infrastructure application is huge performance boosting.
So computing state, I think definitely we talk about streaming processing. Streaming state is a computing state is a state is very, very unique for the streaming processing. It's not a, we're not like a database, like a tradition database. So we have to keep the state, step by step. So state is really, even for today, for Flink, you know, so it's a lot of customers, just like a really big headache is managing the state. It's huge. Yeah. And also performance is usually is a very bottle neck. So like Flink today, they use a RocksDB. RocksDB, that's a lot of tuning to make that happen. So Timeplus, we have it on the state of management. So later on, I can introduce a little bit, just like how we manage those kinds of state. So definitely it's a state, it's just we talk about the hierarchy or cache or, you know, memory or disk. So it's so complicated for the data analytic platform. So that's something we definitely looking very exciting for CXL memory here.
Job scheduling is also a very typical case, you know, for the like here, node one, node two, we definitely want to scale out to node three. So we have to move a lot of state or metadata into the, you know, like a repetition to the new node. So that's something we really, it's a lot of, it's a very expensive, you know, before. So that's something we definitely need to see, you know, when we get a CXL memory or, you know, leverage those kind of, you know, shared memory model. Definitely we don't have to, you know, very traditional, you know, the process to make very heavy or slow the rescheduling. So that's something we call scenario four.
But one deep dive, just like we just put an example here. So like here, we just like to try to monitor some other lane, some, for example, like a delivery company. So we have a lot of order and we want to see the three day, within the three day, and, you know, for certain user, you know, for the order, like deliver the order, you know, which, you know, which like a top five, like the expensive item. So for example here, so we just like use a SQL. So that's a very typical Timeplus, like a real-time streaming SQL. So when we do like a DAG, like optimization here, we do a lot of like aggregation, we do some joining. And so you see that each step, you know, the red color, we just like, you know, put a lot of state here. And each state, we just write down checkpoint. So that's immutable. So the checkpoint coordinator just like handle all the, you know, the state of management. So the challenge is when the, you know, when the state get huge, like a hundred gigabyte or even terabyte, I think that's something really like create a lot of, you know, very, very big challenge, you know, when we write the state or, you know, how to keep the, still like keep a very high performance or low latency. And that's really, really, I think it's very, one of the very biggest challenge for stream processing. So like here, we actually working closely with a MemVerge, just like a Gismo project as Charles mentioned. So just like a really like we definitely like a checkpoint, we have some design to, you know, just talk with Gismo directly. So it's one of the, it's kind of the checkpoint implementation. So we definitely like to see, you know, this kind of, since, you know, that's not a typical like one terabyte, we cannot put it into memory before. Or by now we definitely need a lot of opportunity, just like leverage the memory to make that happen.
So that's a little bit code level. So now we have like three different implementation. Like one is a leveraging local file system, for example, EBS or, you know, SSD. So the throughput like 200, for example, like 250 megabytes per second. The second checkpoint, we actually we can write, I think a lot of the database or stream processing, we put it like a checkpoint into the S3. But that one is a throughput is super, super, super low. So it's small, like it's a 10 megabyte or whatever. But it's unlimited, you know, scalable, right? So that's something that S3 definitely has a lot of advantage. The last one, we just added the last one, it's a Gismo as a checkpoint. So that's something we now handle the three different implementation. So actually, I want to put some data, but we already data just like a really conservative, like 20 times. But you know, technically, you know, for a Gismo, like we use six, like an emulator, right? So now the data, so I think that's even for technically, the last one, even like a throughput, like 250 times compared to local storage, like SSD. So that's something we certainly want to try to make that happen. Latency, even like more than 100, 160 times. So compare the local storage, the S3 is too slow. So if it's a real time, we definitely don't want to compare the S3. So that's a Gismo. I think it definitely is very, very powerful for our, you know, the streaming processing, those kind of scenarios.
Yeah, so the last one is really our passion. So today, we really care about latency. We call the latency expensive or sensitive. So a lot of analytic use case, you know, I think more and more use case really care about latency. So that's something really our really focus, like financial trading or even for like transportation or those kind of thing. I think that's definitely really obvious. And look in the future. I think it definitely is analytic. It should be moved to the real time and also make analytic actionable. So that's something really, definitely really, really passionate to make that happen. So underneath, we really are committed to make the platform. So that's why I really want to work with you guys, like a MemVerge or CXL, a lot of the hardware companies to try to, you know, the infrastructure is really, really critical to enable this journey. Yeah, that's it. Any questions? No questions? Thank you.