Data compression is one of those settings most engineers configure once and never revisit, but the decisions baked into it have real consequences for data quality, storage, and analysis. We sat down with Jim Gavigan of Industrial Insights and Kevin Jones of dataPARC to unpack how traditional historian compression actually works, where it falls short, and what dataPARC did differently with the dataPARC Historian.
Industrial Data Compression Video:
Industrial Data Compression Transcript:
So one of the things that I wanted to specifically talk about with Kevin, because there’s a couple of different approaches to this whole idea of data compression, and I come from a legacy background. As a matter of fact, I worked with the PI System. I worked for OSIsoft for a couple of years. And so I kind of come from that background where we work with lossy compression.
And I would say technically and from an academic standpoint, it’s probably the right thing to do and maybe the best approach. The problem is nobody does it right.
And so Kevin and I attack this subject from a couple of different angles.
My company, I’ve done something called the data fidelity study for probably fifteen companies, anywhere from, say, three thousand tags all the way up to over a million tags.
And you think it just starts out looking at compression, but I look at holistically, what are they doing with their data? And through that, I’ve been able to see some things that maybe most people don’t or don’t really think about.
And what really got me into doing that was we were trying to build solutions for our customers. And what we were finding was that the underlying data quality wasn’t good enough.
And so I’ve got a couple of samples that I’m going to show. So I’m going to go ahead and share my screen.
But I wanna talk through a couple of things and talk about, okay, yeah, this is probably academically and technically maybe the best thing to do, but the problem is nobody does it right. So then what should we do?
Okay, so this is one I just told Kevin about it. I found this. I actually have a whole on our YouTube channel, a whole, like, playlist on data quality, and this is one I I couldn’t actually find it at first. I thought it was in something else.
But we actually did another historian, not dataPARC, not PI, conversion into PI, and we had to go get the data. And I know it’s a little bit difficult to see, but I’m going go ahead and zoom in.
But what you see is this customer had collected the exact same data point every half a second for hours. And this was actually, we were trying to pull this out and put it in PI. We had to go take a completely different tact. So hence why you would think, okay, data compression is good. We need that.
Well, the problem is, you know, that sounds great, but in practice, here’s what actually happens. Okay. And this is why, you know, the dataPARC folks believe something a little bit different than what I’ve kind of been, say, brought up in the industry to believe. Right?
So here’s a situation where a customer for a couple hour period is looking at a flow.
We don’t know what kind of flow it is yet. I’ll tell you in a minute. But we have five eighty six raw data points coming into, in this case, the PI system. Two were archived over that time period.
So what I always ask, I always teach this when I’m doing PIE training, is I ask my students, hey, if we were doing a totalizer for this flow for these couple of hours, would we give the customer the right answer?
And almost everybody says no. And I tell them, well, it depends. There are some ways that in the PI analytics, could get them the right answer most of the time.
Yeah. Yeah.
However, not all the time. And so you might say, well, Jim, this is just thirty two to twenty nine and a half. Can’t be that big of a deal. What if I tell you this is the steam flow on a boiler, which it was, and the units of measuring thousand pounds an hour?
Well, that changes, you know, the whole conversation. Right? So context really matters.
So then you think just like our other customer did, you just throw all that out and here’s what that looks like. You don’t compress anything or you compress very little of it. Now you have twenty one thousand plus snapshots in a few hour time period, and you’ve thrown away a thousand of those. So basically, what ends up happening is you actually, in this case, have instrument noise that you almost can’t see, which I zoomed in and I said, hey, guys, I believe this data.
I don’t believe this data. I don’t think that flow can swing that hard.
And so fortunately, they had a redundant transmitter, and they were able to see that this transmitter was actually faulty, and they were able to actually go repair it.
And so what I’ve seen, Kevin, is data quality has been all over the map. And I’ll go ahead and stop sharing my screen. It’s kind of been all over the map. And even if lossy compression, you know, I could argue that, hey, it’s the best thing to do. But as you can see, it’s kind of a mess.
So I’d love for you to talk about the decisions you guys made. I guarantee you heard these complaints early on and you took a different tact. And how do you handle some of these scenarios?
Yeah, it’s interesting. I think it’s helpful to kind of go back to why we ended up where we are today.
It kind of starts with the fact that originally dataPARC did not have a time series historian. We primarily going to be a visualization data analytics company.
We’re gonna sit on top of the existing historians.
And then after a couple of years realized that there was certainly still an opportunity and a need for our customers to have a high performance dedicated process time series database. And so we built one. I remember those discussions really led by Ron Baldus, our founder, And he had a lot of experience with time series historians, had already built several of them back in the early 80s. And so he’s kind of been through all of these considerations. And there was a few factors that we talked about.
Two were really the practical factors, cost.
Back in the early 80s, you had to compress because the cost of disk space was so high that it just wasn’t feasible otherwise. And and we kind of dismissed that one, because even in the early two thousands, disk space had gotten pretty cheap and said, okay. Well, this isn’t really the reason, so let’s keep moving on.
The next part was performance. If you if you do store too much data now when you try to retrieve it, that has some issues. And so we we said, this is real. We we gotta address this one. And then we got into, I think, some of the, you know, the other parts around, you mentioned, you know, accuracy. You know, you’ve gotta store data, enough of that data that you can be accurate.
Statistically, you like, your steam flow is a great example of when we do the daily totalization
We go back thirty days. Do we have the the statistically correct data to to do that?
The other piece when we talk about accuracy is visual accuracy. As you showed in that that chart, maybe with those two points, we could get statistically accurate information.
But, visually, if we were trying to troubleshoot some issue with that boiler and we just had those two points, we wouldn’t know where the peaks and the valleys were. Right?
Yep.
So so that we said, hey. We’ve gotta solve this problem.
And and then the the the other piece was we gotta make it easy. And this was really, I think, what Ron’s biggest issue having built, you know, at least two other systems that had some pretty sophisticated compression algorithms is that, you know, he said we just struggle to make it live because of of the administration burden. At this point, we’d been in enough sites that had, you know, whether there’s pi, IP twenty one, Honeywell, Ph. D, didn’t matter.
Enough sites where they just weren’t taking advantage of the sophisticated compression algorithms they had. Yep. So we said, okay. Let’s just make it really easy here.
And kind of ended up with, okay. Let’s make some really easy to use, I would say, exception reporting compression, you know, basically to, you know, get rid of your instrument noise scenario there where, hey, if something’s changed at the fourth decimal place, we don’t wanna store. We we wanna you know, when data changes, we wanna store it.
And then having this kind of multi archive architecture where we can store high resolution data.
Basically, Ron said, hey, all the data all the time, all the data all the time, As long as it you know, a true change again. Right. So we get that we get that, you know, statistically accurate performance. We can get the the high granularity we need to, but then having some other archives that can be our our performance piece.
So really, the the visual accuracy piece. So we’ve got a, you know, we call it our plot style archive that just stores, a visual representation of what the data would look like. So if you ask for a year of data, you only have so many pixels on your on your monitor. Yep.
So this way we can quickly retrieve the data. But, you know, your Steamflow example, show all the peaks, show all the valleys, and that way we can visually represent it. So that’s that’s kinda the way we ended up from an architecture standpoint of, make it easy administratively and then really lean on the performance side.
Yeah, it’s interesting that you talked about performance like that second one that I showed that was super noisy.
Yeah.
They were actually collecting that data every half a second. And I remember making the comment to them, like, let me guess, PI Vision runs really slow. And if you’re backfilling data in PI Analytics, it runs super, super slow.
Like, gonna take a couple of days to backfill that. And they’re like, how’d you guess? I’m like, I don’t know, just a hunch.
Yeah, yeah, yeah.
And that was something they were complaining about, still complaining about today. Like I did that Data Fidelity study four or five years ago, and they haven’t fixed it. I told them what to do.
They still haven’t done anything, and they still have the same complaints today that they did then.
So back to your administrative, let’s take the administrative burden off because even if you tell somebody what the right thing to do is, that doesn’t guarantee that they’re actually going to do it.
We’ve had a couple of customers actually go and clean up some things, which was great. Yeah, I think you Oh, go ahead.
Then I’ll It’s really the same thing. It works well when it’s done well and administered well.
But the reality is that I think those are the of the exceptions in the rule.
Yeah. So you had said you were looking for like a true change. That was the terminology used. How did you guys determine that?
Because that’s a conversation I have with customers a lot. It’s like, okay, if I have, say, an RTD and it’s plus or minus a degree versus a thermocouple and it’s plus or minus three degrees, and nobody really goes through and tells you what kind of an instrument it is, what is its real accuracy? How did you guys determine what meaningful or true change meant? How did you guys determine it?
Because that’s one I’ve never known really how to do that except go get the instrument sheets out.
Yeah. And and and primarily, that’s where we would just look at, you know, some sort of a factor of of change. So whether it’s, you know, you know, like a thermocouple that’s one degree accuracy or three degrees, you know, how much is that value changing that’s coming through the OPC interface?
And just kind of monitoring that percent of change and then looking at the data for how much is being stored. So less about a sensor by sensor approach and just looking at a percent of change as it’s coming through the system.
Gotcha. Yeah, I mean, sounds like a good approach because I’ve always kind of struggled with that. You know, like, what would be the appropriate way if I was redesigning this again?
You know, what would I tell someone to do? Right?
Yeah.
And so I actually wanna ask you a little bit about the architecture a little bit, because this kind of fascinated me when I first started doing training for you guys, because truth be told, guys, I used to fight dataPARC. I was a pie guy, right? And people loved PARCview, and they didn’t like ProcessBook so much a lot of times. And I’m like, why are these guys kicking our butt so bad?
And then we brought out CoreSight, which became PI Vision, and people still wanted PARCview. And I’m like, what have these guys done that is just so cool? Right? And one of the things I thought was really interesting when I was watching that training and hearing kind of about how you guys engineered the solution was, let’s say you have a year of data, because sometimes when I look at PARCview trends, say in a paper mill or some other plant, a lot of times you’ll see those grids of trends, right?
And there’s multiple pens on each one. Say there’s a grid of nine of them, There’s four pins on each, and you want to try to do that for a year.
Like, how do you actually handle that? Because I think that’s a really unique approach. Because if it’s pie, you know, and I have a customer like the one I demonstrated, that’s gonna be all the points that were stored in archive and now coming in and snapshot at every half a second for as long as that screen’s up running.
And so it’s gonna just put a huge drag on the system, right? I’ve got thirty, what did I say, nine times four. Thirty six times a year at half a second with thirty six potential new snapshots coming in every half a second. Right? That’s gonna bog down. So how did you guys, like, architect that and how how did you kinda come up with, like, what looks right?
Yeah. I mean, this is this was kinda one of those decisions from the the very beginning. I mentioned this, you know, this plot style archive when we designed our own history. And then what we did is we extended it to, okay, we don’t care who the underlying historian is because we’ve always tried to be very agnostic when it comes to what historian a site has.
Let’s take this PI site and be able to leverage it. So, know, it’s just simply really an aggregation archive based upon I mean, there’s been some plot style algorithms that you know, I think pie has a plot style algorithm, you know, for it still has to read it off the disk. But, it just says, hey. For certain chunks of time, if I show you the first value, the last value, the highest value, and the lowest value, I will give you a a represent accurate representation of of, of that trend. And if I do that for every pixel on a screen, I can give you that accurate representation. So what we said was, Let’s just take a chunk of time, defaults to five minutes, store that first value, last value, lowest value, highest value.
Now in your example, so five minutes, so that’s three hundred seconds. That’s six hundred values would have been stored, at the half second resolution. Right. But we can sort of four. So that’s over a hundred times less data that’s being stored for that five minute period of time.
Right. And and visually, it looks exactly the same.
So so that’s just, you know, just simple math. The the disc read and and transmission of the data is, you know, orders of magnitude less. It’s now you do that across thirty six tags, across, you know, a year of data, you know, you’ve really compressed the amount of amount of data that you’re having to to handle. And so then we just then also put in some some very smart switching that as you’re working with a trend, if you get start looking at, you know, five days, ten days long term, trend windows, let’s go to that plot style archive.
Now you just you know, they say, hey. Now we’re truly trying to get really, you know, high fidelity, very granular with our troubleshooting. Let’s look at one day. Let me switch back to the more high resolution archive to give you that data, that’s our historian or PI.
So it’s that combination of smart switching and then just, you know, having a plot style archive that that’s storing that data. It really helps that performance problem.
So, you know, you’ve talked about trying to reduce the administrative burden, but then you’ve kind of have added some on the back end, right? Somebody’s got to set the aggregation server up. So if a customer comes and says, well, I’d just rather go through and try to set up compression. I don’t want to deal with all that having to set the aggregation server up. What do you say to them? Because if somebody asked me that question, don’t know if I could answer that the right way. Somebody’s like, hey, we’re really looking at data part, but we’re not sure we’re comfortable with this particular idea.
How would I talk to them about Yeah.
Well, you may try to make the easy button there that, that, you know, really, it’s a single click to enable the aggregate archive, and then, you just tell it how far you want to backfill. So it’s a, you know, less than thirty minutes of setup to make it happen.
Oh, cool.
And so then giving you, you know, through different, administration consoles, if there are certain, you know, tags you wanted to, you know, exclude, that’s fine. But the fact that, you know, the amount of data that we’re storing is really small.
Right. So the extra capacity that a site needs to add is is not that big of a deal from an, you know, an infrastructure standpoint. And then it’s just a a single click to enable it. And because, you know, it’s a PI site, we’ve got access to the, through the PI AF SDK to the tags, just start that that collection, that aggregation.
Gotcha. Yeah. And actually that was a genuine question. Like, just so you guys know, we we didn’t really rehearse this.
We talked about what we’re going to talk about, but I was like, well, I didn’t make it that far in the training. A couple of my people did. I didn’t make it that far. So shame on me, but I was like, I didn’t really see how that was done.
So I know it’s done, but I didn’t know how.
So, one of the things I’m I’m curious about too, because I’ll I’ll answer this, but I’m gonna ask you first.
Because you guys actually have an answer to this, but then I’m gonna ask you, like, what do you actually tell them?
So you get the request from the customer, and I guarantee you’ve heard this a number of times.
Hey, we’re doing this big machine learning slash AI project, and we want all the data, all the raw data for these two hundred sensors for the last three years.
What do you tell them?
Well I’m gonna tell you what I’m doing here.
Yeah. Yeah. There’s sort of a you know, think that’s one of the differences as a as a vendor versus the, you know, the consultant. I mean, we do try to tell them that you don’t really need that that, you know, this is kind of what machine learning people have been telling us for for decades.
Know, I I can think back to some examples that we’ve had with our customers where that was the pitch, and, you know, we we helped enable it. Because that was also at the end the day, you know, our job is to help our customers at the end the do what they wanna do. And so we’ve gotta Right. Be able to handle that use case.
But we’ve seen those cases where without any context, that happens. And what comes comes out of it is nothing more than the obvious. You know, the the example of, hey, we’re trying to, you know, paper mill. We’re trying to reduce the amount of sheet breaks that we have on the machine.
You know, it’s what every paper mill wants to do. Right? Yep. Run it through this big ML study, throw it all the data, and it tells you that, hey.
What you need to do is you need to quit using that that broke tank to feed your machine.
You know? And so for those of you who don’t understand the industry, well, the the reason you’ve got material in your broke tank is because you had a sheet break. Right. It’s it’s total correlation, not causation, but statistically, yeah, that that would make sense. So so there’s first, you can kinda say that example of to get real value out of this, you have to be smart and have some context to this to get real insights.
But then, we’ve also had to, I think, add capabilities and enable the SDK layer and and the kinda the data egress side to get better at this because at the end of the day, our customers want that.
For sure.
But I’m curious what what’s your answer to that?
So depending on who I’m talking to, like if it’s customer I know and I get that request, I’m like, okay, tell me a little bit more.
And so they tell me and I’m like, they don’t know what they’re doing.
Yeah.
That’s the first thing I tell them. If they want all the raw data, they don’t know what they’re doing.
Know what they’re going to do? They’re going to downsample it.
They don’t need all the raw data because our experience with doing any kind of multivariate or machine learning, a lot of times you’re like, I was running good six months ago. I’m not running good today.
And depending on the granularity of the process, depends on if you’re in a paper mill and it’s a paper machine and it’s grade based or if it’s a continuous process.
I don’t think I’ve used less than five minute just interpolated values. I needed to know about where the process was.
I’m looking for long range Right.
So sometimes we’ll aggregate the batch data into, say, event frames, right, where we get, here’s what happened in the batch in one line. Right? I thought it was on do not disturb.
And then that way, I can take I can label, like, good batches versus bad batches or good grade runs versus bad grade runs and do comparisons, you’re always going to aggregate. You absolutely never need all the way down to the raw data.
And frankly, that’s where, you know, too, I think what the reason why they ask that question in my theory is because they’ve heard something about the compression in data historians, and they don’t think they’re getting everything that they need. So they’re like, just give me all of it.
That way, I can decide what stays and goes. Yeah. And they don’t really trust, I guess, the system is kinda my theory. I don’t know if I’m right or not.
But I just I’m like, five to fifteen minutes, even an hour, depending on how long the data sets are for or, you know, one value per batch or grade run or reel or whatever it is we’re looking for, you can find out a lot. It can point you in the right direction. Then if you need to do something more granular, you can pull more granular data, but I never start there. And that’s where a lot of these companies are starting. So I was just curious because you collect all the raw data. Essentially, there’s, everything that’s a true change in what you guys were defining. So I’m like, you have it probably more so than one of my customers who’s just a pure pie customer is going to have because I’m finding there’s not as much of it as there used to be.
What I’m actually finding with my customers, there’s less of that first example, like the boiler steam flow where people are filtering it out, right? That’s kind of legacy kind of issues where we don’t have enough disk space, we don’t have enough network bandwidth. So we’re gonna try to keep storage down. I still have some legacy customers. I see that from time to time.
I typically see the opposite. We walked into a refinery, one of the guys on my team, Nick, he walks into this refinery, they have ninety thousand tags, they have compression off on all of them. They have interfaces that are overloaded. They have all this stuff coming at them and their system’s not performing.
They don’t have something like that aggregation server, so they have high speed, high fidelity data coming in and not compressing at all, and they’re wondering why systems don’t perform well.
And so that’s what we run into probably more often than the first example. So you know? But then you get into these machine learning and AI, you know, ideas.
It’s like, okay, well, what if that particular refinery said, okay, the machine learning folks said, I want a five minute average for these two hundred points for the last three years.
And let’s say that data’s coming in every second and it’s not compressed.
How long is it going to take just to crunch every single average to put one value out, whether it’s in a cell or you’re doing an AF SDK script, it’s still going to take you a ton of time to crunch through that.
For sure. You know, it’s a balance. I think when you think of the, you know, the AI and ML, you know, vendors, you know, a lot of them came from, you know, certainly a big data background, but I think primarily, you know, a transactional data background where it is a little easier to take, you know, these huge amounts of information and get some insights out of it. So that was, you think about the original data lake promise from the time series historian and said, hey, just put all your data up in those data lakes, all of it.
Then we’ll find some insights and kind of I’d say, build it and they will come approach, which I think is kind of that same mindset that they said, hey, just give us all the data in the MLL algorithm. We’ll, you know, we’ll find what changed and and and, you know, why you’re not running in your optimal conditions. But I do think we’re getting it back to a little bit more of a, you know, a practical situation where things are a little more use case driven. And and, at least I, you know, customers we we talked to said, hey. You know, say, you know, there is certainly value in these things, but know your use case. Understand more about what data you need for this use case, and just be a little bit more intentional in the approach and I think that we’ve seen a lot more success when there is some intention and some thought put into this.
Think it gets back to as much as we want there to be an easy button and just throw us all your data and we’re gonna give you the answer.
It’s just not that easy.
It’s not, it’s not. So it’s interesting that you mentioned the transactional data background. So one of the things we had talked about kind of planning this was that there’s the hyperscalers. By that, I mean like the AWSs, the Googles, the Microsoft Azures. Right? They’ve they’ve dabbled in this whole industrial Internet of things, which I hate that term, but they’ve dabbled in this space that we we make our living at.
And it feels like they’ve poisoned the market a little bit, especially coming from my side, where we have a compression algorithm that people haven’t tuned well. So they literally come and talk to a lot of the customers we have, and they’re like, Oh no, you just store all the data, all the raw data. You put it in our system, and we’ve got oodles of bandwidth, oodles of capability. That’s not even a consideration anymore. And even Ben Still, who’s my VP, he’s been with me for over eight years. We’ve had this internal debate of is compression even necessary anymore with the advent of systems like that.
And I always come back to, I don’t care how much capability you have, you’re still gonna put stress system, and you’re still gonna have junk in your system if you don’t handle it. So how are you guys kinda handling that? Because I feel like and I’m gonna ask you another question that I just thought about.
But how do you guys kind of handle that? If if you have a customer that is getting that mindset, what do you tell them? Because I’m gonna tell you what I tell them here.
Yeah. Well, I think, you know, there’s kind of a couple angles we’d go at that. Mean, I think one is, you know, we do see, you know, forward, there’s gonna be more data, not less.
You know, there are more sensors, you know, there are gonna be use cases where, having higher resolution data is going to be helpful.
And so, this performance piece. And so, mean, one is on the product side, we continue to optimize how our archive handles high resolution data. We’ve had two or three major, I’d say, architecture design changes or just the binary time series archive to make sure it’s performant.
And then, you know, the the ingress, egress piece, making sure that that stays highly performant as as data explodes. So we do see that as a as an advantage for dataPARC and as a a value for our customers that for those use cases where they need that, that we can do as good as anybody out there.
Right.
But then I think the kind of the second part to your question is, you know, the hyperscalers with that. And I think the thing that I would tell our customers is just make sure, you know, kind of what you’re getting into in those situations because they obviously have the infrastructure. Right. I mean, they’ve got the huge data centers storing your terabytes of data is not going to be a problem.
Right.
Moving the terabytes of data, it probably not gonna be a problem either. But the question is, at its core, at the kernel, are these systems really designed for time series data at scale?
Right. I’ve seen the architecture of of some of these systems, and they’re really, architected well for, hey. I’ve got one sensor, and I want to see the last year of data for that one sensor. Okay.
That’s great. But back to your example of the, you know, the the nine box trend display with, you know, four tags on it. That’s thirty six sensors and that’s a very small unit operation. A lot of times you want to look at correlations across unit operations.
So the question is, I want to look at five hundred tags in a year of that data. Well, fundamentally, there’s a different architecture to bring back a thousand, a year of data for five hundred tags versus one tag. So one, do hyperscalers understand process industry use cases?
And and I think sometimes they also underestimate when they say give us all your data, what that really means. I’ve seen a a few kind of case studies that have been presented for the time series world.
Know, they talk about, hey, here’s an example where we had, you know, five hundred tags that we brought in and really helped this customer make a lot of money and find these decisions and shoot, we say a fish tank, you know, five hundred tags.
So if you really wanna look at these refineries or these large systems, this is a massive amount of data. Right. And the performance side? And then, also forget the cost side because that’s how the hyperscalers make their money is those meters running.
And that’s what I usually talk to my customers about is remember these guys are incentivized to tell you that. They want all that data stored because that’s how they make money on bandwidth and on data storage. So of course they want you to store all the data.
Why would they tell you anything different? I’d tell you the same thing if that was how I made money. Doesn’t mean it’s the right answer. But and I remember, you know, another one that story that I was doing some consulting. A company was trying to pick their enterprise historian, and they reached out to me and said, Hey, can you help us kind of figure out what we want to do? And I said, Sure. Be glad to help you.
So what they were looking at was pie, canary, I p twenty one, couple things from Rockwell, and AWS. So really wide thing. Right? And so so kind of the way I talk to them about it is, you know, and this is no offense to anybody who’s who’s out there listening.
I just but this is just like me talking brass tacks. I just tell it like it is. If you like it, great. If you don’t, I’m sorry.
That’s just how I feel. And you asked me my opinion. Right? And they asked me my opinion.
I said, so I p twenty one hasn’t been updated in years. I went back, and I said, the latest YouTube video that they have is five years old. The latest collateral I could find is three years old, and it doesn’t look like this thing has been updated since the system that Ben and I just converted, you know, over to PIE. And that system was eight years old. So I wouldn’t even look at that unless you’re like an integrated chemical company and you’re looking at a lot of the other software that Aspen sells. Then it won’t make sense, but you’re not. That’s not who you are.
Pi oh, wait. Let me get I’m not I’m gonna come back to that one. Rockwell, it’s always gonna be somebody else’s thing. They’ve brand labeled Pi. They’re brand labeling some Microsoft thing.
It’s never gonna be their technology.
It’s not really what they’re known for. I worked for them for a number of years. It’s not really it’s a me too product. They wanna sell you a me too product. I don’t think that’s what you’re looking for.
So then it came down to PI, Canary, AWS. Well, unfortunately, for them, OSIsoft actually dropped out of that race. They were actually in the lead.
So then it was down to Canary and AWS. So IT kind of wanted AWS.
Sure, sure.
And the people in operations want a Canary.
And so I told them, I said, the way I would do this is you can put a canary historian at every site, put one in at corporate, then pipe that to your AWS instance, whatever data you want, and pipe it over there.
And I’ll never forget the IT person was, like, calling me a canary homer.
And I’m like, no. Here’s the deal. I know about what that ballpark price is, and I know about what the ballpark price is of AWS.
I’m just telling you’re going to make your money back in less than a year. And even if you outgrow it and you need something completely different, you’ve gotten your money out of it.
And you have these plants that are out in the remote places, and I can’t tell you what industry it is. I don’t want to give away who it is, but all their plants are in a remote place and they have a lot of people who don’t even have a high school education working there. So it’s gotta be like simple.
They’re not always gonna have connectivity to the internet. Go build something in AWS to a plant that has crappy internet.
Yeah, they’re not gonna get any value out of that. I’m like, I think it can play in the system, you know, if you wanna go do some specific use cases, but the data historian of some kind needs to stay.
And so that leads me to my last question. Is the industrial data historian dead?
Well, obviously a biased answer, Jim.
You’re coming from the vendor side, I think absolutely not. I think whether it’s you know, Canary or or dataPARC, you know, Pi, the vendors you’ve mentioned, I mean, that’s what you would you we’re talking to the customers. Absolutely. What I think is the right the right mix for a couple of reasons.
Right. You one, the reliability of having requiring Internet access. I think, you know, the the latency and performance of everything has to, you know, go up and as well as travel down from an analysis monitoring standpoint. But, you know, just as as simple as what do you get what else do you get when you go you know, put your data into AWS?
You know, they tend to be a little bit more of a fundamental, very basic toolkit that requires, you know, some knowledge of building your displays.
You know, with the build everything. With the things you’ve been mentioned, you know, there’s always a a built in visualization monitoring component that that gives you kinda that ease of use that, you kinda that self-service exploratory analysis piece that you absolutely need to have.
Yep. You put in AWS. Well, now what?
You have to build your own visualization. You have to build everything. Like, one of my old PI customers tried to build their own PI system because they didn’t like the pricing model.
Yeah.
And that I think they worked on it for about three or four years. I’ll never forget, I talked to the guy I used to deal with and he said, you have to build your interfaces. You have to build your visualization. You have to build the data archive.
You have to build everything. Like, nothing is really there. Like, all the components are there, but you have to put them together. Like, you have to have a team of developers to do this.
Go ahead. Sorry.
I didn’t do that.
Yeah. So so it’s, you know, it’s it’s really just another piece in the, you know, the the DIY, option that, you know, companies have always had. There’s always been a, you know, a subsection that says, hey, we’re you know, let’s do it ourselves. And that’s really what, you know, the AWS will will get you if you want to be kind of your primary industrial historian.
I think it’s, you know, it’s it’s much better, you know, I guess, much better fit when it becomes a complimentary piece to your your industrial historian stack. Yes. And really, therefore, either your use cases that that are enterprise wide, that your industrial historian doesn’t serve really well. Say you’ve got a chemical vendor who you wanna give access to a subset of data, and you also have a another raw material supplier that you wanna give access to.
You put that data into AWS and you you build a report they can access, you know, you’ve siloed them outside of your network. I mean, there’s some really good reasons for that.
Putting AWS and then then, you know, putting on a a Databricks add on and and doing some AI and ML. And you control what goes up. You control how long it stays, and you have, you know, complete control of that data. Makes total sense.
Yep. But but as your, you know, kinda your your fundamental piece, Oregon, you know, all the data, all the time, the reliability, you know, and as you mentioned, you know, the the OT to IT interface into the cloud still takes a little bit of configuration and and and some work, where the industrial historian vendors, this is what we’ve done for decades. Right. And so that robust us the store and forward technologies.
I think are just a better purpose fit.
Yeah, I agree. And it’s interesting because you kind of mentioned, like, know, we mentioned the remote sites. An industrial data historian gives you the ability. Most of our customers now are putting PI or whatever historian they have in Azure, in a in AWS. They’re putting it up there.
But, you know, every historian vendor has thought about, okay. What happens if that data historian goes away? Like, I’m gonna hold that data at the collector level, interface level, and I’m gonna store and forward it when that comes back. So it gives you some reliability. And if you have a plant that doesn’t have a good connection, you can always put that on-site.
And then as that connection’s kind of intermittent, you can always transmit. Because we’ve heard oil and gas companies, they’ll have a rig out in the middle of nowhere. Yeah. And the only way they can talk to it is through satellite and it’s slow, right?
Which that technology has gotten a lot better. So, you know, the reality is I think the historian gives you a lot more flexibility. I think I’m like you. I think it’s a piece in the stack.
You know, I think a lot of people have tried to build things in their historian technology that should be built elsewhere and vice versa.
Yep.
I’m a big fan of you don’t take out a crescent wrench when you really need a hammer, right? You can eventually get that nail in, you know, the crescent wrench, but it’s a lot of work, right? And I think I still see a lot of people doing those kind of things. And you just you want to bang your head against them and say, no, don’t do that. But I think absolutely in the whole tech stack, right? I think the historian is going to be here for years and years to come long after you and I have hung up the cleats and said, we’re done with this.
It just feels like the market’s going to change, especially with the advent of AI, which that’s a whole different debate and talk. But I feel like it’s going to be around for a long time because of its purpose built nature.
I think so. And I think it’s gonna become even more of an important piece just because the amount of data, again, it’s getting bigger, not smaller. Importance of this data gonna get bigger, not smaller.
We think about AI, what does AI need? Well, it needs really good data to be successful in in the right context. Exactly. So I think, you know, the fact of, you know, being really focused and and purpose built for these solutions is gonna be important.
And I I really like your kind of message about having the right tool, which is something, you know, I always tell our customers is, you know, you need the right tool for the problem you have. Right. And and then there’s there’s in this, you know, not a one size fits all situation. There’s lots of use cases out there that that don’t need AI and ML.
So so don’t use AI and ML for those use cases.
Let’s not get that in the habit of, you know, our our solutions looking for a problem.
Identify your problem and have the right solution for it.
Yeah.
Well, Kevin, I appreciate it. Didn’t know where this conversation was going to go. We knew we wanted to talk a lot about compression, which is a very specific term and it’s near and dear to our hearts and talk about the merits of a couple of different approaches. But the big thing we wanted to say was you need to have an approach, right? There’s a couple of different ones, a couple of different tacks you can take. Each have their merits and potential drawbacks.
And we also want to make sure we talk about like, well, why is that even important? How does it play into a larger picture? So I appreciate you talking to me. I appreciate you guys agreeing to this and my harebrained idea.
So yeah, but I think just kind of, you know, in my closing thoughts is, you know, we talk about, you know, different, I guess, approach to compression. But as I mentioned in our pre call, I see those as kind of two sides of the same coin. Know, there’s different thoughts around it. But at the end of the day, you know, what we’re talking about are like, you know, very sound approaches to industrial historian storage, which I think has a very big, you know, purpose today and purpose in the future.
Yep. You know, whereas I think some of these hyperscalers still haven’t even really thought about these questions. And so it gets back to that role of industrial historian. And the second thing I would think about what it would, you know, encourage anybody who’s listening today is just make sure you got a plan and a strategy going forward.
And continuing to you. Can you evaluate the tech stack you have because it’s an evolving world and you don’t necessarily need to just leap to, you know, stack and tech in the hyperscaler world. Make sure your foundation is always solid and make sure you’ve got a really good organization around it, that these systems live and you continue to give them the love they need and deserve.
Absolutely.
Well, again, Kevin, appreciate you talking. I know I’m kinda I’m acting like this is mine, but this is y’all’s channel.
I’m glad that you guys humored my idea.
So Yeah.
No. I I really appreciate you reaching out and having the idea. It was, it was fun to join, and, let’s do it again.
Absolutely.
Building The Smart Factory
A Guide to Technology and Software in Manufacturing for a Data-Drive Plant






