Patterns and surprises in rich but noisy network data – There has in recent years been a large amount of research interest in networks such as the Internet, the World Wide Web, citation networks, social networks, and biological networks such as metabolic networks and food webs. Empirical observations of networks like these are often noisy, containing measurement error, contradictory observations or missing data, but they can also be richly structured, with measurements of different types, repeated observations, annotations or metadata. In this talk I will address the problem of accurately estimating network structure from such rich but noisy data, particularly focusing on social and biological examples. In the process, we will see that the pattern of errors in network data is far from random and can teach us some intriguing lessons not only about the data but also about the underlying systems they describe.
Mark Newman: Patterns And Surprises In Rich But Noisy Network Data
Pleasure to be back here again as always. I've enjoyed many of my conversations with you this week. A short visit but it's great to be here. So I'm going to talk about networks today. This is an area that's grown very rapidly in the last 20 years and I would say one of the reasons for that is that we have a lot of data on what's become available in the last couple of decades is really detailed data on the structure of a lot of networks and has driven a lot of the science that people do in this area. So I'm thinking of things like the Internet here and that work of computers connected by data connections transportation networks like airlines. This is a picture of a portion of the World Wide Web. This is a biological network a metabolic network of chemical reactions in the cell a food web of predator prey interactions species in an ecosystem. A lot of my work is concerned with social networks like this one this is a network of friendships amongst students in a high school. So it stands sort of study that one does in this area is one collects data for a network such as this and then one assembles it into a nice picture like this and mage's some things about the structure of the network which hopefully answer some questions about the system we're interested in. And then I would claim that there's a problem with this approach.
The problem is that this is not really a friendship network so this is a network of kids in a school and what you do is you go in and you say who you're friends with to each of the kids in the school and you end up with this network here. This is the network that people analyze. But I would claim that this is not actually the network of who's friends with whom this is the network of who says they're friends with who and those two things are very different. Turns out they are really very different in fact. But what are you talking about. So the network that we're looking at here in this picture which incidentally comes from this big NIH funded study the picture was made by Jim Moody Duke University. So now what we're looking at is a network of the wrong data for what they experiment was which is you go when you ask people who they're friends with and what we actually want to know is who actually is friends with whom. And then once we have that we can ask questions about things. So another way of putting this would be to say that the data on not the network. This apparently is not obvious in this field. I assume that in every other field. This is obvious. Everybody knows that the data are not the answer right. If you're measuring the mass of the Higgs boson then the data are you know some zillions of collisions between elementary particles in a part of Collider and the whole things shoot out in different directions. That's not the mass of the Higgs boson that's the data. Then you analyze it somehow to get the answer you want which is the massive expose on.
And everybody understands that this huge pipeline of computational analysis that takes weeks to complete before you end up with this one number at the end of it was confusing about network data is that often the network data the raw data that you start off with themselves and look kind of like a network. That's what this picture is. And I think that this is has loads people into not noticing this gap between what the data is and what the thing is you want to understand. So I want to talk about that gap today and how we cross it. Another way of saying all this is just that in networks as in everything else there is measurement error right. We measure some data about it at work but it doesn't perfectly reflect the structure of it. For instance you know a friendship network you ask who's friends with whom but it doesn't perfectly reflect the structure of the network and we need to allow for that. And as I say in every other field of science people understand this that there is measurement error in your data and for some reason in this area in the study of networks this hasn't become an established part of the way we do science and it really ought to be. You just can't publish things without everybody as everybody knows this right. So at some level what I'm going to be talking about today is how do you work out the error bars on your network. We sounds like a really boring thing to talk about but it's not actually it turns out it's really interesting. Why is it interesting. It's interesting because it turns out that the errors can tell you something about the actual system that you're studying.
So just to give you an example when you look at the Friendship Network it turns out that when people say who they're friends with some people are more accurate at saying who they're friends with than other people. And it turns out that which people are more accurate and less accurate is telling you something about people.