Shinji Kim, Founder and CEO of Select Star, joins us to discuss what it means to govern and manage data in modern organizations, and how companies can set themselves up for success when getting started.
This episode was recorded live at the 2022 DataConnect Conference. Special thank you to WIA Advisory Board Member Dave Cherry for assistance in preparation.
About Shinji Kim
Shinji is the Founder & CEO of Select Star, an automated data discovery platform that helps you understand your data. Previously, she was the CEO of Concord Systems (concord.io), a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led building Akamai’s new IoT data platform for real-time messaging, log processing, and edge computing. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB. She advises early stage startups on product strategy, customer development, and company building.
Relevant Links
- Designing Data-Intensive Applications by Martin Kleppmann
- Concord Systems, Inc. (Shinji's previous company that was acquired by Akamai)
Follow Shinji
Follow Lauren
- Website
[00:00:00] Lauren Burke:
Welcome to Women in Analytics After Hours, the podcast where we hang out and learn with the WIA community. Each episode, we sit down with women in the data and analytics space to talk about what they do, how they got there and where they found analytics along the way.
I'm your host, Lauren Burke, and I'd like to thank you for joining us today.
Today, I'm here with Shinji Kim, who is the Founder and CEO of Select Star. Thank you so much for joining us here today.
[00:00:36] Shinji Kim:
Thanks for having me.
[00:00:37] Lauren Burke:
Absolutely. So just to start off, could you give us a little bit of background on yourself and on Select Star.
[00:00:44] Shinji Kim:
Sure. So my name is Shinji Kim. Currently Founder/ CEO of Select Star. Select Star is an automated data discovery platform where we help others, everyone to be able to understand their company data more easily. We do this by understanding and connecting into data warehouses, BI tools and different applications, and bring out the insights from all the metadata and the activities, user activities that's happening within the databases and BI tools.
My background's in computer science. I have worked as a software engineer, data engineer, data scientist, and also a product manager. So I've been in the side of producing data, transforming data as well as consuming and making business decisions on data in the past. And I would say, Select Star kind of like is a product that I wish I had for the last 15 years of working in tech and data overall.
But Select Star is also my second data company. In 2014, I started a company called Concord Systems, which was a distributed stream processing framework, focused on helping enterprises to process large volumes of data in real time. Now, it's acquired by Akamai Technologies and currently it's a platform service called IOT Edge Connect, which is an edge computing product for IOT manufacturers to collect and process data coming from different sensors and IOT devices.
[00:02:28] Lauren Burke:
That's awesome. I like the fact that you mentioned that you started this because it's a product that you wanted. A lot of innovation comes from exactly that .Where you see something missing and you decide to be the one to fix that problem for yourself and for others. So that's awesome.
So you mentioned your background, you have a very broad and expansive background that touches a lot of different fields and a lot of different roles. What are some of the key lessons that you've learned along the way that have helped you as you are creating and growing Select Star?
[00:03:01] Shinji Kim:
I mean, there are so many. I guess the reason why I'm am here today and a lot of things that I've realized over time is that when there is an issue and there are blockers, I think my natural instinct is to really try to solve it. Try to solve it in the better way, that not just to kind of like get by the problem, but so that the core problem is fixed for myself and everyone else.
And I think that's maybe where, I guess my founder nature is really coming from. But throughout my career, I've had, you know, just different experiences and also great mentorship and people that I have worked with that I've learned a lot from. And I think if any thing that's as a lesson is like not being afraid of trying something new.
Whether that's a new technology or a new method of doing something or switching jobs, or like quitting jobs to start a company, for example. And I think that really kind of comes down to the mindset of what's the worst thing that would happen if I'm not super happy at where I am. If I know that there are different things that can be better from what I can see, like why not just do it and just go ahead and do it myself if I can. So, yeah.
[00:04:39] Lauren Burke:
I love that. I think that's such a good mindset to have, especially if you're founding companies, especially if you're trying to start something new. So I love the sort of way you're approaching that.
So in your talk today, you talked about data governance and you said that data discovery supercharges data governance. So how do you connect the data to the critical decisions that they enable, so that we could really understand what data is useful and used, and what is not?
[00:05:13] Shinji Kim:
Yeah. So data governance is something that we've noticed as one of the major use cases of data discovery, primarily because trying to put standards and controls on data without understanding the nature of data, without understanding the structure and the relationships that data models have. It's a very difficult challenge for any company to take on. And data discovery, by its nature, is really focused on just finding and understanding of data.
And in aggregate - and well, before I go into aggregate. In atomic level, that means understanding whenever I look at a table or a field, it's about understanding where did the data come from? How was this generated? Who's using this inside the company and what are the dashboards or reports that are built on top of it? And examples of how other data analysts have used utilized this data.
But in aggregate, looking at them all together, you can get insights like which are the data sets that has the most number of dependencies inside a company? Which are the data sets that are being used the most? And by whom? And vice versa. In this specific team, which are the data sets and dashboards that they use the most?
And I think because it's so concrete. And it's like, it's own data. Cuz you're doing an analysis on top of metadata. Those points can really give you a lot more ideas of what is the right way and what are the right measures that you want to put in place in your organization for data governance.
And that's why we say data discovery really supercharges data governance.
[00:07:22] Lauren Burke:
That's interesting because normally I feel like if you say data discovery, people are saying, let's just go look and see what we can find, but you're saying no, you have to have some sort of direction. And I think that's important of a distinction to make.
[00:07:34] Shinji Kim:
Yeah, yes and no. I think a tool as itself can always be just a tool where you can go and discover anything, but tool can also be used in order to achieve a certain purpose. And I see that as use cases of the tool. Hence, data discovery is a big use case and big purpose that data discovery serves. And when it's used for that purpose, a lot of that will come down to analyzing the access, usage, and the relationships within data. And all three of them can be the same features of data discovery, but can be interpreted and used differently as well.
For example, just in terms of like data documentation, and I would say more around self-service analytics. Without having to do a lot of manual documentation, having a auto- generated documentation based on the relationships that the dataset has can really be like, you know, different use cases, but it's coming from the same underlying technology and features of data discovery. I would say.
[00:08:50] Lauren Burke:
That's interesting. That's a very good explanation that I think people listening to this will appreciate as they are probably trying to get better at data governance in their organization, or even bring up why you might need that and how to get started with it.
Kind of leading into this, most first attempts at a data governance project fail. What can individuals do to make sure that theirs does not?
[00:09:15] Shinji Kim:
Yeah. I think this is always one of the first and major questions that we always get. And my take on it is that most of the data governance projects fail because it goes out of scope. It gets too many people, too many different things involved and it becomes a project where you are boiling the ocean. You're trying to boil the ocean, and that just is not gonna work.
[00:09:43] Lauren Burke:
You're trying to govern too much. Once you want to govern, you want to govern everything and you can't.
[00:09:48] Shinji Kim:
That's right. That's right. Yeah. That's a really good way to put it.
So I would always say that having a specific scope and goals you are trying to achieve is very important. And when you are defining goals, having a clearer picture of what it should look like at the end state and having that to be consistent across the stakeholders that are part of the project. I think that's really important to have upfront than as we are working through the governance project. Because I think it's almost like a setting of a north star.
In the startup growth terms, people use north star as like the KPI that they are hitting. But I think that you can also think about north star as more of like where you are trying to get to and getting everyone aligned behind that goal and behind that initiative. Whether that is specifically for complying with regulations, or to get to a point where you are treating all the private data or personal data in certain way. It could be getting to a point where everyone is using the certified metrics and the right KPIs. I think it should be a concrete goal where everyone is involved and have the same understanding for. Then the rest of the process, whether that is defining which framework, which tools, what the process should be like can really follow afterwards.
[00:11:36] Lauren Burke:
The way you are describing the process makes a lot of sense. The way you are laying it out seems like one big thing is you need to take steps and you need to consider where you are, where your organization is, where your people are with that.
[00:11:48] Shinji Kim:
Yeah, I'm not saying that like because you want to do data democratization, privacy is not important. Obviously there are basics aspects of the platform you might want to get through, but it's really having the focus on that scope in order to get the program as a more phased approach is what I would say.
[00:12:12] Lauren Burke:
That makes a lot of sense. So going off what you mentioned with privacy, governance as a term, it sort of has a lot of more negative connotations. It's associated with control, bureaucracy, and maybe even the implication that there's going to be some friction due to the limitations that it could imply for your team or for your organization. Do you think there's a better term that we could use that would better embody what data governance stands for?
[00:12:41] Shinji Kim:
Yeah, I agree. And I actually was refraining from using the term data governance in the past. Until very recently. Mainly because that's kind of how everyone recognizes their initiative as. Every time I talk about, you know, anything related to governance.
I think a better term, that has more positive connotation, would be data management. Within data governance, there needs to be a data management office. Or more of a data management team, data management group. More acting as more like a Switzerland of the company, as well as a team that builds and executes the center of excellence that all other data teams can examine and follow.
And I would say that the term of data management might be more broader than what most people consider as quote unquote data governance. I think this is also because the term data governance also comes from a lot of people thinking or having a connotation of data governance mapped with access control.
And whenever there is a data governance project, there needs to be a very fine-grain access control list that you're getting into. Whether that's permission-based, attribute-based, or role-based. I think that might be also why it has a bad rep, but I think really the core of data governance is putting data under control that organization can use and refer to. And that is something really supported by data management practices overall.
[00:14:25] Lauren Burke:
That makes a lot of sense. So for companies or organizations, what level of maturity within their data organization, data structure, do you think they should have before they sort of focus on adding data governance?
[00:14:42] Shinji Kim:
That's a great question. I would say most of the time when you have just, you know, one database, one or two databases with production data, you don't necessarily need a layer of data transformation and reporting layer. You know, there's really not much to govern. Like your one data team, you know, knows everything and it's very clear where things are.
I would say most companies when they start hitting hundreds and more tables. And most of the time we see this in companies that are fast growing, beyond about 150-200 employees with data teams. As company grows, there's more emphasis and reliance on making business decisions on data. And this is one place that you would want to start putting good measures and modeling in place, so that you can really scale more consistently.
So I think that's like where. And even, I guess, rather than governance, it's really, you know, like discovery perspective. If you have even at least let's say three to five person data team, how do you possibly know what everyone is creating? You know, if each analyst are let's say serving different division of the company. And I would say that's really where data discovery and a way to standardize some of the documentation, metrics, and definitions around your own data becomes very important.
[00:16:28] Lauren Burke:
It makes sense that the size of your organization really helps to define what sort of structure you would need as you're building other aspects of your data and maybe analytics teams out.
So in your talk earlier, you mentioned centralized, decentralized and hybrid models. What are the key attributes of an organization that help you decide which one is best?
[00:16:53] Shinji Kim:
Yeah. I explained centralized versus decentralized based on kind of what it used to be in the past when anything that has to do with technology was maintained by the IT team versus today where technology is also used and maintained by many different business teams.
So centralized model of data management has been more of a traditional way. But getting that into today's organization is much harder because each business units actually operate more independently around the tools they use, as well as the data models and dashboards that they create.
So, hence on the other side, lot of today's organization, I would say is more of a decentralized management that they follow. And it really comes down to, they might be just building all different models on top of the same data warehouse, where sometimes they may have their own, full data stack infrastructure that they manage as well.
So in terms of like, where does that sit in the org and how does that interact with the current engineering or IT team, that is a bit different per company. Where it is more important around data governance perspective and also more of like a usage of data, like where the data is going to. Whether it's going around business analysis or decision-making or model building that is where the underlying data models live. And how that is managed is where the data governance and data management teams really start.
So, yeah. So in that regards, centralized data management teams it is the more of like the data team that acts as a place where everyone needs to kind of like create a ticket or create a request to the data team to get anything done. Whereas decentralized model would be different businesses would work on their own. But that may cause different numbers or different characteristics of the same metrics that companies are looking to see as well.
[00:19:20] Lauren Burke:
So it sounds like some of these models were more popular in the past and some of these are coming with this new age data structure that we're starting to see at companies. So for you personally, how has your perspective of data governance changed over your career?
[00:19:34] Shinji Kim:
I guess when I was an individual contributor or manager of teams, I was on the other end of either producing data or consuming data. So as a user, sometimes it would take weeks and month to just get access to the data that I need. And sometimes I may have access to all the data of the company, but may not know which are the right data to use or whether this dataset is correct.
there is no way for me to find out whether the data set has been already filtered by some attribute, for example. Which kind of gets me to mostly trying to dig through where the raw data is, so that I can run my own transformation. But that's, that's a lot of work and a lot of resources that were waste to be honest with you.
And I think that all shaped my thinking around how data governance or more of data management really needs to be combined with centralized, but open access to be able to at least find the information and also get as much context about that data asset that you already have access on.
[00:21:00] Lauren Burke:
Yeah. I, uh, I think I kinda sprung that one on you. But, um, I think it's interesting, you have a lot of really good insights about data governance, or data management as I should say, it's clear that your background and your experience has shaped that and sort of helped you figure out the best way and the best approach for not only you to approach helping others grow theirs, but also to give advice and to help build it up in your own organizations.
So before we wrap up, I would like to ask, is there a resource that has helped you throughout your career or at some point in your career that you think might help others today that are listening?
[00:21:38] Shinji Kim:
Yeah, actually there is one book about data that I really like written by Martin Kleppmann. It's called the Data-Intensive Applications. I believe that's what it's called, I might be wrong. It's from O'Reilly.
Martin Kleppmann actually was one of our advisors of the company from my last one, Concord systems. He spent a lot of time at LinkedIn, and also working with Kafka and Samza, which is a stream processor. This book talks a lot about the underlying data structures of different data frameworks. Which helps you to decide almost like for this type of application, what is the right type of database or type of message queue, or processing will be the best.
It does talk about a lot of technical topics, but it also is more on the design of these data frameworks, which I think is very interesting and useful. It's something that I would definitely recommend, especially if you are more on the side of data analyst, but also want to be aware of, want to dabble and understand the data platform side of the things.
[00:22:56] Lauren Burke:
That's a great answer. We can link that so people can find it and read it and hopefully learn some of the things that you have and go off and apply then.
[00:23:04] Shinji Kim:
Yeah. And what I'm now trying to do is try to write more about the best practices and also the great resources that we see in the industry. So you can also check out our blog and social media on Twitter and LinkedIn, because I will be sharing a lot of that as I am also, you know, learning from our customers and partners we work with day-to-day.
[00:23:28] Lauren Burke:
Awesome. So for our listeners, before we close out. Where can people keep up with you. Do you have any social media or a website?
[00:23:36] Shinji Kim:
So we are at selectstar.com and our Twitter handle is @selectstarhq. And my Twitter handle is just my full name, @shinjikim. We are also on LinkedIn.
So if you would like to follow our journey and hear about the newest and the best practices of data management, you can check us out there.
[00:23:58] Lauren Burke:
Awesome. Thank you so much for joining us today, Shinji. I hope everyone learned a lot about data governance and learned some things that they can take back and apply to their work and to their organization.
[00:24:10] Shinji Kim:
It's been really awesome to be involved in this conference today. So thank you so much for having me here.
[00:24:15] Lauren Burke:
Absolutely. Thank you.
Keywords: