Is correctness still the problem?

I agree. I think we know how to build a singe piece of software correctly. The problems arise in integration.

Volatile and Decentralized: What I wish systems researchers would work on:

“Understanding interactions in a large, production system: The common definition of a ‘distributed system’ assumes that the interactions between the individual components of the system are fairly well-defined, and dictated largely by whatever messaging protocol is used (cf., two phase commit, Paxos, etc.)  In reality, the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.

Let me give a concrete example. Recently we encountered a problem where a bunch of jobs in one datacenter started crashing due to running out of file descriptors. Since this roughly coincided with a push of a new software version, we assumed that there must have been some leak in the new code, so we rolled back to the old version — but the crash kept happening. We couldn’t just take down the crashing jobs and let the traffic flow to another datacenter, since we were worried that the increased load would trigger the same bug elsewhere, leading to a cascading failure. The engineer on call spent many, many hours trying different things and trying to isolate the problem, without success. Eventually we learned that another team had changed the configuration of their system which was leading to many more socket connections being made to our system, which put the jobs over the default file descriptor limit (which had never been triggered before). The ‘bug’ here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.

Somehow there needs to be a way to perform offline analysis and testing of large, complex systems so that we can catch these kinds of problems before they crop up in production. Of course we have extensive testing infrastructure, but the ‘hard’ problems always come up when running in a real production environment, with real traffic and real resource constraints. Even integration tests and canarying are a joke compared to how complex production-scale systems are. I wish I had a way to take a complete snapshot of a production system and run it in an isolated environment — at scale! — to determine the impact of a proposed change. Doing so on real hardware would be cost-prohibitive (even at Google), so how do you do this in a virtual or simulated setting?

I’ll admit that these are not easy problems for academics to work on. Unless you have access to a real production system, it’s unlikely you’ll encounter this problem in an academic setting. Doing internships at companies is a great way to get exposure to this kind of thing. Replicating this problem in an academic environment may be difficult.”

(Via Volatile and decentralised.)

Safety margins for software

The ACM Queue article: Software Needs Seatbelts and Airbags contains a rather grim view of the state of software today.

The software industry is in a position similar to that of the automobile industry of the 1950s, delivering software with lots of horsepower and tailfins but no safety measures of any kind. Today’s software even comes complete with decorative spikes on the steering column to make sure that users will suffer if their applications crash.

I do believe we are much better at building software than claimed in this quote, in particular safety-critical software. However for more general purpose software like web-services etc., where the high development costs of safety-critical methods is not cost-effective, there is still a lot to be done to improve the quality of software in a cost-effective way.

This article discusses an interesting approach to automatically add safety to software. The article makes a distinction between a bohrbug, a “good, solid bug”, whose occurrence conditions are known and a heisenbug a bug that occurs non-deterministically, and cannot easily be replicated. This is an interesting idea, and the authors group has developed a tool DieHard that can do this automatically. Given my background in Formal-Methods I’m inclined to view this kind of an approach sceptically, as it would seem to be much better to write the software correctly in the first place. Nevertheless the tool promises to eliminate dangling pointer errors, which is definitely valuable. So I don’t see any reason why this approach should not be incorporated into software development flows.

The approach is based on adding extra padding to memory allocations, thus alleviating the effects of overflows. It would seem to me that this could allow for a cost/efficiency tradeoff, because adding more padding is going to decrease the probability of overflow errors, while increasing memory consumption of the application. I believe there is a lot of promise in this approach, as it would allow us to start talking in terms of adding a 10% safety margin to certain aspects of systems. This would be an important step in transforming software quality from a black-art into an engineering discipline.


A sober view on the future of Computing

David Auerbach has a rather sober view on the future of computing in his post The Stupidity of Computers.

There is good news and bad news. The good news is that, because computers cannot and will not “understand” us the way we understand each other, they will not be able to take over the world and enslave us (at least not for a while). The bad news is that, because computers cannot come to us and meet us in our world, we must continue to adjust our world and bring ourselves to them. We will define and regiment our lives, including our social lives and our perceptions of our selves, in ways that are conducive to what a computer can “understand.” Their dumbness will become ours.

I used to be a believer in hard AI, but I have had to change my beliefs.

Academic entrepreneurship

This is a nice summary of things you need to think about if you are thinking to start a company from academia. The point I would add is that it’s extremely facilitating if you can find somebody not from academia, who would be interested in acting as the CEO of the company. This was crucial when we started Pecos. The company failed for other reasons, but without the CEO we would not have started at all.

 

Why We Should Be Teaching More Computer Science Classes

From Education Nation 2011: Why We Should Be Teaching More Computer Science Classes

Most people who write computer programs aren’t professional programmers. Scientists and engineers write programs on a daily basis. But even non-technical professionals rely on deep knowledge of computing. Graphic designers work with many images with multiple layers, and they write programs to automate operations. An estimate out of Carnegie Mellon University says that for every professional software developer in 2012, there will be four people who write programs but aren’t professional software developers.

 

Nice rant: "Personal cloud computing in 2020 (or not)"

Personal cloud computing in 2020 (or not)

My favourite coffeespot has been noticed by Nordic Coffee Culture Blog!

Café Art, Turku | Nordic Coffee Culture

CS Education Act introduced into Congress

This in the States: Robert P. Casey Jr. | United States Senator for Pennsylvania: Newsroom – Press Releases

In Finland, where IT is seen as one of the key competitive factors of our industry, we don’t have any proper CS in comprehensive school, let alone in high school.

(Via Computing Education Blog)

Swedish Multicore Day

Last week I participated in the Multicore Day in Kista, Sweden. The multicore day is a yearly event (I think this is the 5th in the series), that focus on the new challenges that have been exposed with the proliferation of multicore chips. The event is quite big, there were almost 200 registered participants, although maybe only half of them showed up. Still a much bigger audience that in the event I helped organize in May (Tackling the Multicore Challenge).

So why this interest in multicores? Most of us have at least a dual-core chip in the laptop or desktop computer, and many computers have even more (I have an 8-core iMac at home). There are also commercial chips like the Tilera64PRO that has 64 cores on a chip. One of the main challenges posed by these chips is how to parallelize applications so that they make effective use of the computing resources, and this was the theme for this years event, that had 3 keynote speakers and focusing on programing approaches.

According to Karl-Filip Faxén, there are 3 criteria that a programming approach should satisfy:

  1. It should support scalability. This means that programs should now be writen for a predefined number of cores, instead the program should adapt to the number of available cores, and thus its performance will scale when new cores become available.
  2. It should not be more difficult to use than sequential programming approaches.
  3. It should be useful for a wide range of problems.

One such approach is task based parallelism. In task based parallelism the program is split into tasks, which are smaller and more dynamic that threads. Threads are instead seen as implementation level mechanisms, that are used by the run-time system to execute the tasks.

The 2 morning keynote speakers presented Cilk, a task based programmin approach  originally developed by the Supertech group under the leadership of Prof. Charles E. Leiserson, starting in 1994. The approach was later commercialised and is now offered by Intel as Cilk Plus. The presentations were very good, and gave good insights into how task-based parallelism works and where it gives good results. I was quite impressed by some of the results, i.e. how quickly a program had been parallelised using Cilk, and what performance gains had been achieved.

After lunch there were 3 parallel sessions with talks from Academics. I found these less interesting, as they were mainly project presentations, clearly aimed at highlighting running projects to participants from industry. Given the industrial emphasis of the event I see no fault in this.

The 3rd keynote was given by Prof. Wen-mei Whu, who is working on implementing HPC algorithms on GPU’s. The talk highlighted and important issue: not all algorithms seem amendable for parallelisation. In the realm of graph algorithms, a version of bread-first search exists that scales well on GPU’s, however nobody knows how to parallelise depth-first search. This is also probably the reason that the Murphi model-checker has been parallised. Murphi is based on BDD’s and BDD model-checking is based on bread-first search.

The day closed with a panel. Many of the panelists emphasized the need for introducing parallel computing at the undergraduate level, so that it becomes a part of the standard toolkit of programmers. Taking up this point, I will be running a course “Introduction to Manycore Programming” later this fall, with the intent to present a number of different approaches to programming parallel applications. More about this later…

 

 

 

 

DIEM – Devices and Interoperability Ecosystem ends its 3rd year

The DIEM project recently had its 3rd year review. Simply put the goal of the project is to develop technologies that enable devices to interoperate better. This is done within the conceptual framework of a Smart Space. A Smartspace is an abstraction of space that encapsulate both the information in a physical space as well as access to this information allowing devices to join and leave the space. In this way the smartspace becomes a dynamic environment whose membership changes over time when the set of entities interact with it to share information between them.  A Smart Space is essentially an information storage architecture, where the “smartness” comes from the applications or services that it provides.

For example, your smartphone could notice that your favorite program will start in 5 minutes, based on your profile information or a fan page on Facebook and the TV guide available on the broadcaster’s web page. Then, it could use GPS to find that you are not at home, and deduces that it needs to start the PVR at home. Clearly this kind of an application is not possible today because most devices are islands and they do not expose their internal API’s to application designers.

The DIEM project has emphasized on the one hand on the development of a Smart Space infrastructure, the SmartM3 framework, and on the other hand on the development of a number of domain specific smart space solutions. Our work has been in the development of the Smart Space infrastructure where our focus has been application development tools. Our research group has been working with Domain-Specific-Languages for many years, and we have proposed a DSL based approach to smart space application development. Our goal is to try to hide as much as possible of the Smart Space infrastructure API:s and instead let the programmer concentrate on expressing the essence of the application logic. To give an example: A simple application for a smart space would be a “smart mute function” to your TV. Whenever you receive a phone call, the TV automatically moves into time-shift mode and starts recording the program you are watching, when you finish the call, the TV resumes to show the programming. Application is actually a very simple rule that we can express in our DSL as follows:

with phone:- Phone(id=”phone01″), pvr:- PVR(id=”pvr01″):
when phone.IncommingCall: pvr.Pause
end

The with clause is a quantifier that binds variables to results of queries. It works over sets of bindings, i.e. instead of binding to a particular phone Phone(id=”phone01″) we could bind to all phones in the Space with phone:- Phone. This approach makes it very easy to express the typical use case of a Smart Space: listen to an event, and the react to it.

Project Team

The people currently involved in the project are:

Publications

Below you can find links to the our main publications on the topic.

%d bloggers like this: