Charting the CHAOSS: Insights into Open Source Health and Viability

author-image

By

Recovered from shutterstock.com

In this episode of the Open at Intel podcast, Katherine Druckman talks about open source project health with Dawn Foster, Director of Data Science for the CHAOSS project. They discuss the importance of sustaining open source projects and the strategic benefits for companies that contribute employee time to projects they value. Foster highlights the need to address red flags in project health, such as responsiveness, release frequency, and contributor sustainability. She describes how CHAOSS (Community Health Analytics in Open Source Software) helps open source projects monitor their health by analyzing data and metrics with tools such as Augur and GrimoireLab, which gather data and provide visualizations for analyzing project health. The conversation concludes with a discussion about the rise of single-maintainer projects and the importance of assessing the viability of projects critical to an organization's products or infrastructure.

 

"I tend to look at responsiveness, so time to first response and change request closure ratio. So, are you keeping up with your pull requests, or are you getting gigantic backlogs? I also look at whether or not you're cutting releases often enough to include all of the security fixes that you should have, including the ones that come in from your dependencies, for example." - Dawn Foster

 

Katherine Druckman: I got to chat with Dawn Foster all about her work with the CHAOSS project, open source project health, and project contributor sustainability. This is an important conversation for anyone involved in open source software.

Hey, Dawn, thank you so much for joining me. I really appreciate it. This is your first time on the podcast, I believe, and you are somebody who's kind of been on my hit list for a while, someone I'd really like to talk to, so I'm glad we finally connected.
 

Dawn Foster: Yeah, it is my first time on the podcast, and I'm super excited to do it. Thanks for having me.

Katherine Druckman: Awesome. If you wouldn't mind, just give us a little background on ... Well, first, maybe what you're currently doing, but kind of the high level story of your open source adventures.

Dawn Foster: Yes, absolutely. I'm currently Director of Data Science for the CHAOSS project, so I am basically a freelancer. We've got some funding from the Alfred P. Sloan Foundation to fund some data science work for another two-and-a-half years, so that's what I'm focused on. I have been in this industry for a very long time. I came into it with kind of a traditional background, a computer science degree, and I managed to luck into a Unix system administrator job as my first job out of university. That was in 1995 I started doing that.

And then fast-forward a few years, I ended up at Intel, and they needed someone to look at open source developer tools and Linux developer tools. And since I had done Unix, that was a lot more than a lot of people at Intel back in 2000 had done, because it was very much a Windows space. So they had me looking strategically at a bunch of open source projects, and I got more and more interested in community, and I eventually turned that into full-time community manager-type roles, where I did that at a couple of startups. I freelanced for a while, and then I ended up back at Intel to be the community manager for the MeeGo project.

Katherine Druckman: Oh, yes, MeeGo. I have the T-shirt.

Dawn Foster: And then later ... Yeah. Oh, it was so much fun. It was a very fun project. And then after that, I worked a little bit on Tizen. So I was back at Intel from I guess 2010 through 2012 or so, and then I worked at Puppet. And then for my midlife crisis, I left the industry briefly and went back to school to get a Ph.D. I studied the Linux kernel, and I looked at how people collaborate within the Linux kernel, doing things like network analysis and statistical analysis on collaboration.

Katherine Druckman: Very cool.

Dawn Foster: And then I ended up at Pivotal, we got acquired by VMware, and I was the Director of Open Source Community Strategy there for a while. And then I managed to land my dream job, which is the Director of Data Science for CHAOSS.

Katherine Druckman: Cool. This is a subject that is near and dear to my heart, honestly, project health, right? So tell us what the CHAOSS project is, really.
 

The CHAOSS Project

Dawn Foster: Yeah, CHAOSS stands for, it's CHAOSS with two Ss, and it stands for Community Health Analytics for Open Source Software. So we are focused on helping other open source projects be healthier, and the way we see that happening is by looking at the data, looking at the metrics, figuring out where your project is struggling and where it's doing well, and then focusing on the areas that you can improve. We have lots of different metrics, and we also have two different software packages that you can use to gather metrics. So we have one package called Augur, which is focused more on repository-type data, and then we have GrimoireLab, which has repository data plus a bunch of other stuff. Technically, they're very different projects.

Katherine Druckman: I tend to look at project health these days through kind of a security lens, because to me, a well-cared-for project is more likely to be a secure project, right?

Dawn Foster: Yep.

Katherine Druckman: Let's call it safer to use and safer to pull into your own project as a dependency and all of those things. But what are some big red flags that you can identify pretty easily, especially the ones that are kind of easy to address but haven't been? What's the low-hanging fruit in project health?

Identifying and Addressing Project Health Red Flags

Dawn Foster: Yeah, that's a really good question, and I tend to look at it from two different angles. So there's the projects that you're contributing to where you're interested in the inner workings of those projects and making that project more healthy from the inside out. For that, I tend to look at responsiveness, so time to first response and change request closure ratio. Are you keeping up with your pull requests, or are you getting gigantic backlogs? I also look at whether or not you're cutting releases often enough to include all of the security fixes that you should have, including the ones that come in from your dependencies, for example.

And then I also look at it from a contributor sustainability standpoint. Are the vast majority of your contributions coming from one or two people? What would happen if they won the lottery and disappeared forever? Would the project be able to survive? Another one that I think is also important is, they call it the elephant factor, it's basically organizational participation. So if all of the contributions are coming from employees at a particular organization, what happens if that organization decides that this project isn't important anymore?

Katherine Druckman: That's a really big one, actually, and something that I've had quite a few conversations about lately, which is first the notion of training up your successor, and I think that's kind of related there, but when you have that single company or single organization becoming very dominant in a single project, I think even then it's more difficult, right? Because it's a sign that there's a lack of community orientation, I guess, a lack of outward thinking and outward communication, which is really critical for exactly the reasons that you mentioned. So how do you address that problem when you have that kind of single organization dominance, the elephant problem?

Dawn Foster: Yeah. It depends. How you address it depends a little bit on what the project is like. So there are legitimately open source projects where that elephant, that organization doesn't actually care about incoming contributions from other companies, in which case it's really not much you can do, right? I would say that's definitely a minority. On the other hand, generally, projects are just looking for anyone outside their company to start contributing. And it can be really hard, because when you're the primary contributor to an open source project as an organization, it sort of sets the expectation that that's the way it works. So people from company X are the people who run this project, and so you really do have to be very proactive about pulling other people into the project.

You can look at some of the people that are using it, maybe ask them to contribute. As an organization, it's a little bit easier, because what you can actually do is you can leverage some of the business relationships that you have with other companies that you know are using the project and who might want to contribute, and you can encourage them to contribute in specific ways.

Katherine Druckman: That does make a lot of sense. I guess there's also a little bit of a, I don't know, encouraging the people who have a vested interest in the health and success of a project to then participate in its health and success. But at the same time, I can see scenarios where if there's a single company, for example, dominating a particular project, as an outsider, there's a disadvantage. At a basic level, you don't have quite the same access, right? And the communication platforms become a little bit siloed. The people are communicating within their own company and not so much with the outside world. But there's also a little bit of a, if I'm an outsider, I'm thinking, "Well, what's in it for me?" If you do have a vested interest, then sure, especially if it's business critical to something you're doing somewhere else, but in other scenarios, you might wonder, "Well, why would I contribute to this thing that's dominated by company X?"

Dawn Foster: Yeah, I think they do really have to have some vested interest in the project to contribute, and I think that's true of just about any project. You're going to contribute to a project that you need for something you're interested in, something you're using. Those are typically the places where you would contribute.
 

And then you mentioned also the communication channels. I think it's really, really important, even in these single organization open source projects, to do all the work in the open. So public reviews on GitHub, on PRs, on the issues being tracked in GitHub or GitLab or whatever your platform is and not in an internal tool, and pushing those private conversations out into the open wherever possible.

Katherine Druckman: And we, as open source people, we know those things, but even then, it's hard. It's easy to slip into what's easy. So if I am at the same company with this and such maintainer, it's so easy for me to use the back channel to get something moved forward or answer a question, and you just kind of fall back into whatever the path of least resistance is. And so that's kind of a hard thing to break out of, I think.

Dawn Foster: Yeah, absolutely.

Katherine Druckman: Yeah, it's a good reminder, though.

Dawn Foster: And you see that in all projects, right? Even in some CNCF stuff that I do, I'll see where somebody has created a little group of direct messages with three or four of us that are in leadership positions for something, and we'll be having discussions, and I'll look at it, and be like, "We really need to have this discussion in the regular Slack channel. There's no reason for the three of us to be having this discussion."

Katherine Druckman: Yeah, that's a good ... Yeah, and again ...

Dawn Foster: And you're just going to have to push that forward and remind people that there's no reason for us to be having a private conversation about this. This is something that anybody could chime in on.

CHAOSS Project Tools: Augur and GrimoireLab

Katherine Druckman: So tell me a little bit about the tools that you mentioned. You mentioned two.

Dawn Foster: Yes. So we have Augur, which is one of the CHAOSS tools. On the backend, it's a Postgres database, and the way this tool works is it gathers a whole bunch of data from Git-style repositories. So what we actually do behind the scenes is we clone the repository, gather as much data as we can, and then use the GitHub or GitLab API to gather the rest of it, so things like pull requests, merge requests, issues, the stuff that's not just in Git. If it's in the API for the platform that we're getting it from, we probably have that data in Augur.

And the reason I mentioned that it's a Postgres database on the backend is that we do have some front ends into Augur, but a lot of us that use it actually write our own database queries and pull custom data out of it. So Augur is really great from a data science perspective, so for people like me. We also have a lot of open source program offices using it as well, but it's really easy to customize from that standpoint.

And then on the other side, we have GrimoireLab, which is mostly driven by a company called Bitergia out of Madrid. And so they have absolutely beautiful visualizations.

Katherine Druckman:

Oh, I love a good visualization!

Dawn Foster: Yeah, and it's based on OpenSearch on the backend. So it used to be Elasticsearch Kibana. They've converted it over to OpenSearch. And what that gives us is you can do a lot more customization of the visualizations themselves from that front end without actually sort of diving into the database. And the other difference with GrimoireLab is it collects other types of data, so Slack, Wikis, forums, other things. If you're trying to understand your community as a whole, and your community is spread out over lots of different places, then GrimoireLab is really good.

And the way I think about it, so I've used both of them extensively. At VMware when I was there, we used both of them. I used Augur within the open source program office because I wanted to heavily customize what people saw, and I wanted to give them four things to focus on. But our community manager team, on the other hand, used GrimoireLab because those visualizations allowed them to dive into every possible detail of their community right from that visualization front end. So it's a lot easier, I think, for community managers.

But either way, it's a lot of data, and so one of the things that we struggle with, and this is what my talk at Scale was about, it is this kind of, I affectionately refer to it as just kind of a wall of visualizations or a wall of data, and it's kind of the tsunami. It's just all of this data coming in from these open source projects, because you still, regardless of what tool you use, you still need to think about what questions you have, what areas you want to focus on, what you want to look at for your community, because you do need to narrow it down.

Katherine Druckman: Sure, yeah. There can be too much data, I guess. Maybe, maybe.

Dawn Foster: There can.

Katherine Druckman: I love data, though. I love information.

Dawn Foster: Yeah, me too.

The Importance of Data and Privacy in Open Source Communities

Katherine Druckman: It's funny. I am kind of a privacy nerd too. I love the data, I love the insights you get from picking through data. I don't like it being collected. It's funny.

Dawn Foster: Yeah, fair enough. Yeah. Have you read, there was an article in the ACM called Beyond the Repository?

Katherine Druckman: I have not seen that.

Dawn Foster: Okay. So it was written by Amanda Casari, Julia Ferraioli, and Juniper Lovato, and they talk a lot about some of the privacy elements of it. They talk about the fact that these open source communities and this data is made of people, so it's not just this collection of data to be used and abused.

Katherine Druckman: Yeah, it’s actual humans.

Dawn Foster: And so they talk a lot, not just about the privacy elements, but about how to get the community involved and things to think about when you're doing research and using open source data. And it's not a particularly hard read, it's not super academic or anything. It's like six pages, and I really encourage people to have a look at it.
 

Insights from Research on the Linux Kernel

Katherine Druckman: Cool. I'll tag a link when I publish this. Speaking of reading, I wanted to talk a little bit about your academic experiences and your research into the Linux kernel. That sounds very interesting, and I wonder if you could give us the high-level view of your findings?

Dawn Foster: So it was super fun, because I did take a complete diversion out of tech for three-and-a-half years to get this PhD, which also meant that I got to write code again, so I had sort of promoted myself out of that. When you're a director of whatever, you don't generally get to write a lot of code anymore. So from that aspect, for me, it was really interesting. And because the collaboration within the Linux kernel happens on mailing lists, that's what I focused on. I did also look at some things within the source code repositories as well, but it was primarily focused on the mailing list.

And what I did was, it's called a maximum likelihood model, and what I did was I looked at which people were more likely to reply to each other on the mailing list, and what characteristics do these people have in common? So are people in the same time zone more or less likely to reply to each other on the mailing list? And my thought would be that people in the same time zone would be a lot more likely, they would have more conversations with each other just because of the nature of being in the same time zone, and that wasn't true. It didn't matter what time zone people were in. I found that absolutely fascinating. So that was really interesting. It sort of validates what the kernel developers always say about the kernel.

The other thing that I found interesting was if people work at the same organization, within the kernel, at least, they were actually more likely to reply to each other on the mailing list than people who worked at different companies. And I would've thought, my assumption was that the opposite would've been true.

Katherine Druckman: Yeah, because they would've used the back channel, right?

Dawn Foster: They could use the back channels, whatever. But some of that might have to do with the fact that people working at the same company might also work kind of on the same things within the kernel, they would care about the same things, and that would drive some of those conversations. Also, and this is the bit that I looked at the source code for, was people who are working in similar areas within the kernel, they also tend to reply to each other more often. So if we're both working on a very particular subsystem, then we're more likely to reply to each other, and that sort of makes sense. You tend to reply to the other people who are working on the stuff that you're working on. So that was really interesting. And then if you're a maintainer, people are a lot more likely to reply to you as well, which makes sense.

 

Katherine Druckman: Yeah, that does make sense. I do another podcast, and we interviewed Greg Kroah-Hartman, and one of the questions I had for him was just, "Why is email still the best? It's basically a mailing list, email. Why is this still the best way to manage contributions and communications in the Linux kernel?" Because again, it seems like a very old school way of doing things, and there are a lot of other communication platforms, Slack and GitHub and all these other methods. But both from identity perspective, your e-mail address is who you are, and just like maybe your GitHub profile is who you are as a developer these days, which is a whole other conversation, and it's kind of weird, but it's the way it shakes out. I wondered what your observation is just about that method of communication versus others. Does it matter?

Dawn Foster: Yeah, it's a really good question. And Greg's great, by the way. He helped me a lot with my data, and he was sort of a sounding board for a lot of the work that I was doing. I have a huge amount of respect for the work that he does within the kernel. So the kernel is hard. So a lot of people say the kernel should just move to GitHub because it's a monorepo, and there are some very specific reasons that they can't really do this on GitHub just from a technical standpoint, and there are whole articles on that, because I've read those, and they're really interesting. But I do think that the mailing list approach ... It's a hard question to answer. On the one hand, the mailing list approach works really, really well for the kernel. So you tend to kind of stick with what works well, and moving to something else would be incredibly disruptive for the project.

Katherine Druckman: Oh, sure.

Dawn Foster: So I think that it would make things very difficult. I think not everybody would like it. They would lose developers, and I think it would be very challenging just because they've done it this way for a very long time. I think people underestimate switching costs when they think of things like this. So I do think it'd be very disruptive to move to anything else for the kernel.

However, I do think that it makes it a lot harder to get younger people involved in kernel development. I have been concerned for a very long time that the kernel has an aging problem. A lot of the maintainers are my age. They're in their 50s, so it's becoming a problem that they don't have a lot of younger folks coming in.

Katherine Druckman: Sustainability.

Dawn Foster: Now, they've done a lot in the last 5 or 10 years to try to improve this. So they've done stuff through Outreachy, a lot of the companies involved, they naturally get some people who are younger moving into kernel development. But I do worry that a lot of the people who are responsible for the kernel are approaching retirement.

Katherine Druckman: Yeah, it's a valid concern. We're talking about critical technology. Everything has Linux in it somewhere, right?

Dawn Foster: Yep, it does.

Katherine Druckman: It's in about everything. What happens if there are no longer maintainers? Well, Jorge Castro says there are zombies the next day.

 

Dawn Foster: Zombie maintainers.

Katherine Druckman: Well, no, I mean, so the whole society collapses, and we all turn into zombies, I think. I don't know. But yeah, it's a valid concern. Kubernetes, a much younger project, but here we are at KubeCon, it's the same thing, though. It's become mission-critical. Maybe there will be something else in the next X number of years, but all of this is kind of foundational technology really does need to have successors in place, a line of succession.

Dawn Foster: No, for sure. And I think even Kubernetes struggles with getting enough people to contribute to the project to sustain it over time. It's a constant challenge to try to get enough contributors, and especially in an environment like we're in right now, where it's kind of financial downturn-y, lots of companies laying people off, refocusing on other things. And when that happens, and we've seen this with some CNCF projects, I mean, one of the reasons ... So etcd, which is the key value store for Kubernetes, so Kubernetes doesn't really run without etcd. For a while, was in dire, just dire straits. It just did not have enough developers to maintain it. And part of the reason was because a couple of companies had pulled their maintainers off of etcd to work on other things. And when that happens, it creates a big gap in the project. And through some focus and through some work, we were able to address that on the CNCF side to get more people contributing to etcd, but it was a big problem for a while.

Katherine Druckman: Yeah, I can see that. Yeah, everybody's stretched a little too thin, and there are consequences to that.

Dawn Foster: Totally.

Katherine Druckman: Yeah. So I wondered, again, going back to project health because, well, I just gave a whole talk kind of about it, so it's on my mind, and I wondered, statistics show that there is a tremendous rise in single maintainer projects, and that's an interesting conversation. I mean, obviously it's a red flag for project health, and everything has to start somewhere, everything probably starts as a single maintainer project. But I wondered from your perspective and the type of work you do, what is the significance of that data? Now, that doesn't mean they're the most popular projects or the most successful, but what does that mean in the most global terms?

Dawn Foster: Yeah, so I do think that the data is a little bit misleading because in a lot of cases, it looks at projects on GitHub that have nobody ...

Katherine Druckman: It could be abandoned, could be sandbox.

Dawn Foster: ... only one person contributing. So I have loads of those in my GitHub account as well. And it's fine, because people use them, but they're just trivial things that nobody really needs to contribute to. It's pretty common to have lots of those projects where people like me are just going to throw something out there so we could show it to people or just to give it a convenient place for it to live. But the concern is those single maintainer projects that are used by loads of people.

Katherine Druckman: Yes.

Dawn Foster: And those are the concern, and there are loads of those. And we've seen examples. Not that long ago, there was a project, and I don't remember what its name was, but the maintainer was thrown in jail, and he was in jail for six months and nothing happened on the project while he was in jail, because it was a single maintainer project, and it was a relatively popular project. And so those things happen, right?

Katherine Druckman: And if you're not on top of your dependencies and following up and paying attention to every single thing, and you don't know this, you don't even know that maybe you should fork it or maybe...

Dawn Foster: Yeah. And one of the things that we've been focusing on within the CHAOSS project as well, I mentioned earlier that we've spent a lot of time on the projects that you contribute to, so looking at projects from the inside out. Also, we just developed a whole set of metrics models for looking at viability, which is looking at it from kind of the outside in for consumption. So these are consumption, they're all around viability, and they're how you think about the projects that you pull into your products or that you pull into your infrastructure.

Katherine Druckman: Yeah. Exactly, yeah.

Dawn Foster: And these were developed by Gary White over at Verizon. And so he's been doing a lot of work on those. And it is something that I think we need to think more about. The way I talk about it when I talk to people about viability is the projects that you should spend a lot of time assessing the viability of are the projects that you could not ship your product without, because you couldn't easily substitute something else. There are loads of little tiny libraries that are single maintainer libraries, but you could rewrite them relatively easily. But you can't rewrite a Kubernetes, you can't rewrite a big piece of your infrastructure. So I encourage companies to think strategically about which projects fit into which bucket, which ones could you not live without and not easily replace and not rewrite yourself, and which ones are relatively trivial from a technology standpoint, because those you could easily replace, and the viability should really be focusing on the ones that you can't do without.

Katherine Druckman: Yeah, that's great. My interest these days is the consumer side of open source. In other words, being a little bit more careful. Because it's so easy now, it's so easy to use your little package manager and pull in whatever project you happen to see on GitHub, and there it's in your project, and it has several dependencies. I think most people are careful, obviously, especially if you're an enterprise or something like that, but I think everybody, anybody who's writing anything at all needs to really consider taking ownership of what you pull into your project, because you have then adopted it quite literally, and you need to take care of it, and you need to keep track of its health, and you need to take it to the vet or whatever.

Dawn Foster: Yeah, exactly. And I think a lot of times, people, especially individual developers, they don't necessarily know what else to think about. And I think we also fall into the trap of, "Well, oh, it's run by big company X. Surely, it'll always be around." And we know that big companies kill projects too. There's whole lists of projects that have been killed by-

Katherine Druckman: Yeah, killed by Google. That's a site, right?

Dawn Foster: That's what I was thinking of, actually. Yeah. Yeah, for sure. So people think, "Oh, well, this is a Google open source project. Surely, I should just use that because they're Google or Red Hat," or name your favorite company. But also, developers don't necessarily think about licensing, for example. When I worked at VMware, we had a team of developers, they spent months doing this technical evaluation of a project. And when they sent it to us to look at, they were like, "It's GPL. We've this GPL project we've spent all this time. It works perfectly for us. Just can we use it?" And my response was, "It's not GPL, it's a GPL. And no, we can't use it, because we have some customers that won't take code that has GPL licensed components within it." So that was just like a deal breaker, and this team had spent months evaluating this technology, and I felt terrible. But had they come to us early, before they'd started this long evaluation, we could have told them, "No, you can't use that. Use something else." But they don't necessarily know all of the things to look at.

Katherine Druckman: Yep, that's right. Education is key.

Dawn Foster: Yeah.

Katherine Druckman: Well, cool. This has been fantastic. Is there anything that you wanted to share that I didn't ask?

The Future of Open Source: Sustainability and Viability

Dawn Foster: It's a good question. I do just want to encourage people to think a lot about contributor sustainability. It's something I care deeply about, and I think a lot of times, it gets kind of a cursory evaluation. I spend a lot of time thinking about responsiveness, for example, and one way to improve responsiveness is just to put more pressure on your developers to respond more quickly, and that things like that don't solve the long-term problems. They just make it less sustainable over time because all you're doing is burning people out. So I really want people to think deeply about contributor sustainability and what it takes to sustain the projects that you care about.

And then especially as companies, contributing some employee time back to the projects that you care about. It's not charity. It's not like just for the good of the project, but it also gives you insight into what's going on in the project, what's the future direction, what's happening, what might we want to know about the project? And by having your employees deeply embedded in those projects, that's a strategic win for your organization. And I think people tend to think of it as charity, and it's not. It benefits your company to have those employees embedded in the projects that are strategic for you and important for your business.

Katherine Druckman: Yeah, I think that's good advice, wise words. Well, thank you so much for carving out some time for me and us and anyone listening. I appreciate it.

Dawn Foster: Yeah, thanks for having me. This was fun. Be cool.

Katherine Druckman: You've been listening to Open at Intel. Be sure to check out more from the Open at Intel podcast at open.intel.com/podcast and @OpenatIntel on Twitter. We hope you join us again next time to geek out about open source.

About the Author

Katherine Druckman, Open Source Evangelist, Intel

Katherine Druckman, an Intel open source evangelist, hosts the podcasts Open at IntelReality 2.0, and FLOSS Weekly. A security and privacy advocate, software engineer, and former digital director of Linux Journal, she’s a longtime champion of open source and open standards.