57 posts categorized "NStatic"

January 14, 2015

NStatic

I started working on NStatic a little under a decade ago. It went through several iterations of development with major design changes that broke a lot of functionality in order to achieve the product I dreamed of. For instance, one goal was to be able to prove the correctness of an implementation of red-black trees. I have seen papers in computer verification in which every loops had to be proved with some Hoare logic annotations. Assertions contain forall and exists quantifiers when using a function call was far simpler to use and read; function calls are actually more efficient to implement, I think.

I postponed development a few times, because I thought that I could generate money with some quick project that also had a side-effect of learning a new technology like iOS or Android. Each time the quick project turned into something more work that I was prepared to emotionally invest in, when I really wanted to be working on my static analysis tool.

My savings are low, and I am planning a kickstarter to focus on and release NStatic for the next year or two. This will allow me to purchase additional development tools that will accelerate the product’s time to market. It will make my life a bit more bearable and stress-free. It would be a disappointment for me, if I never release this product given a whole decade.

There are a some recent developments that may prove helpful for me, which is the commercially-friendly open-sourcing of C# via Roslyn, CodeContracts, and .NET Core. I was planning a tool to just read the assembly IL and PDB files to perform code analysis, which is considerably simpler than my original approach of processing source code, but the Roslyn API might enable NStatic to process source code directly with high quality. NStatic can actually integrate the CodeContracts work to become a superset. The .NET Core might lead to potential benefits down the road.

A major tech company is currently considering me for a group that does development tools, but there’s no guarantee that I would actually work on static analysis there. I also do significant work in natural language processing, which the group is not involved with, but the company is big and has the best AI researchers.

If my kickstarter doesn’t take off, then I would probably accept an offer at the company.

The work that I am doing with NStatic might change software.

Plans for Year 2015

My company is SoftPerson, which specializes in “semantic computing” in everyday general-purpose applications. The past decade or so was one of a lot of reading, research and experimentation with entirely new technologies and user experiences, but no product. I can’t really continue for more than another year without revenues, so I came up with some remedies.

These are my plans for the current year.

  • Releasing a NConvert semantic code conversion tool early this year
  • Kickstarter for NStatic, a static analysis tool, next month
    (In a future year, I would do the same for my natural language wordprocessor product).

In both cases, I would start out with a light version of the application, lacking much of the AI, to minimize risk and time to market. Then, I would follow up with the killer functionality. Consumers who buy the light version may receive the AI-enhanced version when it finally releases.

There are a variety of reasons that I was not able to release NStatic earlier, many of the reasons have little to do with the AI.

  1. My code generation was originally in Perl, which is difficult to refactor, and is now written completely in C#.
  2. I relied on and let down by third-party components for some core functionality such as parsing C# in attempt to focus on my core value proposition. My codebase will be based on Roslyn.
  3. Working with source code is time-consuming. NStatic will initially work with the IL from compiled assemblies and use the symbols information from PDB files.
  4. The mathematics is intense. Some of the issues that I have to deal with is ensuring non-termination.

I have been contacted by recruiters from two different major tech companies, so I may go back to work in the industry if I can leverage the ideas and techniques that I developed. The pros of working in the industry is that I can share ideas with top researchers and my discoveries will be widely disseminated.

I may turn some of my experiences and notes into book form. The books may be self-published using the Gumroad service.

  • General-purpose Layout Engine
    • Building a Word-Processor or Any Other Document-Based Editing
    • Structured Editing
  • Semantic Computing
    • Real Natural Language Processing (Focus on Parsing and Semantics)
    • Symbolic Reasoning Engine
  • High-Performance Immutable Data Structures
  • Practical Guide to Building a Software Startup
  • The Design and Architecture of Various Open Source Projects
    • Roslyn
    • F#
    • Python Math Libraries (Sympy, Numpy, SciPy and others)
    • .NET Core
    • Jetbrains IntelliJ IDEA & Nitra

Some of the code that I will write will be open-sourced.

January 19, 2008

Playing Chess with God

Although the title is appropriate, I wasn't actually planning on writing a post about the legendary chess grandmaster, Bobby Fischer, who died in the past few days, but I will take a few moments to pay my respects.

I was a long admirer of him, and, over the years, I identified with him ever more over the years. Here was this young natural talent who defeated several highly trained Soviet chessplayers to end Soviet domination of the chess world title. He was also very individualistic, provocative and controversial, breaking taboos and openly defying the government, by traveling to Serbia despite sanctions. In the same vein now, I do fancy myself competing against other companies and speaking freely.

Back to my intended post, I found this interesting post called "Play Chess with God" by an interesting new blogger, Russ Cox. This and his other post, SuperOptimizer, deals with computer solutions to traditional human problems mainly through brute force techniques.

Starting in the late 1970s, Ken Thompson proved by example that computers could completely analyze chess endgames, which involve only a small number of pieces. Thompson's key insight was that when there are only a few pieces on the board, there might be very many possible move sequences, but there are relatively few distinct board positions.

Given six pieces in the board, it became a simple lookup with Ken's endgame database to determine the best next move to make. Russ quotes former championship chess player Tim Krabbe describing his experiences using said database:

Playing over these moves is an eerie experience. They are not human; a grandmaster does not understand them any better than someone who has learned chess yesterday. The knights jump, the kings orbit, the sun goes down, and every move is the truth. It's like being revealed the Meaning of Life, but it's in Estonian.

While my software relies more on intelligent rules rather than brute-force techniques, the feelings and experience that I have experienced are more or less the same. My initial thought sometimes on seeing a strange result calculated by my NStatic tool is denial with the full belief that I am witnessing a software bug, only to discover after careful analysis and to my surprise that the computer's solution is in fact correct.

I don't claim this experience to be unique. This should be familiar to anyone who has ever played chess against the computer or used a declarative programming language with unpredictable control flow.

January 02, 2008

Human-like Reasoning

My goal in my static analysis work is not to solve general problems like the Four Color Theorem, but rather to simulate closely the human reasoning process to solve problems that are generated, comprehendible, and solvable by normal people. Practical and deep, rather than complete.

Humans write programs to be understood by other humans. Human understanding comes about instantaneously or with relatively little effort compared to the work performed by many analysis tools. We typically don't unroll loops and recursion, but instead "get" the meaning of programs by recognizing a few standard patterns. We look at a program and instantly see a sorting operation taking place, for instance. Programs typically consist of these recurring patterns. In many programs, the only complex data structures used are a standard few like arrays, hashtables, linked lists and stacks. My approach to reasoning adheres closely to the way humans think, even when more efficient computational approaches exist, so as to align my software with human performance; the one exception is when non-human-like approach is strictly better in every way, yet yields identical results.

Stephen Wolfram, inventor of Mathematica, wrote a post Mathematics, Mathematica and Uncertainty touching on some of these themes. Even with the more advanced symbolic capabilities of Mathematica--virtually infinite compared to more than my own software package, Wolfram encountered limits trying to use his own system to verify itself.

Sometimes we use the symbolic capabilities of Mathematica to analyze the raw code of Mathematica. But pretty quickly we tend to run right up against undecidability: there's a theoretical limit to what we can automatically verify.

Yet, I think that our approaches deviate somewhat in that Mathematica isn't geared to testing programs. Also, the Mathematica code base is not typical. Being a program for doing mathematics, it naturally consists of advanced mathematics, difficult for mortals to understand and involving non-linear arithmetic, which is theoretically undecidable.

The code was pretty clean. But it was mathematically very sophisticated. And only a very small number of experts (quite a few of whom had actually worked on the code for us) could understand it.

Another point that Wolfram makes is that, contrary to my approach, Mathematica typically avoids imitating humans but instead uses complicated, non-human-friendly procedures extending hundreds of pages in length, roughly the size of the entire analysis portion of my code base.

Think of almost any mathematical operation. Multiplying numbers. Factoring polynomials. Doing integrals. There are traditional human algorithms for doing these things.

And if the code of Mathematica just used those algorithms, it'd probably be quite easy to read. But it'd also be really slow, and fragile.

And in fact one of the things that's happened in Mathematica over the past decade or so is that almost all its internal algorithms have become very broad "polyalgorithms"---with lots of separate algorithms automatically being selected when they're needed. And routinely making use of algorithms from very different parts of the system.

So even if there's some algorithm for doing something that's written out as a page of pseudocode in a paper or a book, the chances are that the real production algorithm in Mathematica is hundreds of pages long---and makes use of perhaps hundreds of other sophisticated algorithms in other parts of the system.

But such algorithms were not constructed to be understandable---and much like things we see in physics or biology, we can see that they work efficiently, but they're really difficult for us humans to understand.

And in general, the way Mathematica works inside just isn't very "human friendly"... Like when Mathematica does an integral. It doesn't use all those little tricks people learn in calculus classes. It uses very systematic and powerful methods that involve elaborate higher mathematics.

November 14, 2007

Supercompilation

I noticed work by SuperCompilers, LLC on supercompilation that takes on the myth of the sufficiently smart compiler (which I previously wrote in my post on Humans vs Computers). Their technology seems very similar to my work with its reliance on partial evaluation.

The company deals with the same sort of criticisms about NP-completeness and undecidability in their white paper "Supercompiling Java Programs."

The problem of fully optimizing an arbitrary computer program, when mathematically formalized, turns out be an incomputable problem, we do not need to optimize arbitrary mathematically given computer programs, we only need to optimize programs written by humans. And humanly-written programs tend to have a lot of special structure too them, including both obvious and subtle redundancies that lead to inefficiencies.(this is related to algorithmic information theory; see, Chaitin, 1987 and Bennett, 1986). There is no possibility of ever creating an algorithm that can fully optimize a general computer program. Fortunately, though, the practical problem of program optimization -- though still very difficult -- is a lot easier than the general mathematical problem to which it corresponds. Because, in reality

Supercompilation does not try to solve the impossibly hard problem of fully optimizing an arbitrary computer program. Rather, it is an extremely valuable heuristic technique, which seeks to find ways of significantly optimizing a wide variety of computer programs. Most real-world computer programs are chock full of inefficiencies, which can be removed via the supercompilation process. These inefficiencies are often hard to see in the source code, but they are highly transparent when the code is translated into a more mathematical form, as is done inside the supercompiler. Some of these inefficiencies are introduced by sloppy programming; others are placed there intentionally, by programmers interested in goals besides efficiency, such as maintainability and comprehensibility of code, or speed of software development. In many cases the most efficient way to implement a given piece of code is a truly humanly infeasible way, involving literally hundreds of times more lines of code than the ways that occur naturally to humans.

Another post "Illogical Arguments in the Name of Alan Turing" in O'Reilly's website criticizes bad arguments by people, that static code analyzers are not useful at all because they are limited by the halting problem, or that logic flags can never be found because of the halting problem.

Humans are subject to the same theoretical constraints as computers, but yet they do not perform the same laborious, brute-force search. Instead, humans employ a strategy of recursive pattern matching and replacement when thinking about and checking their programs. The kinds of programs that people write are those that they can easily reason about and review later; being maintainable, these programs belong to small subset of possible programs and are inherently tractable for analysis.

One paper "Solving difficult SAT instances in the presence of symmetry" suggests that the structure of a problem can affect its performance. That, in practice, a technique such as SAT-solving which is known to be NP-complete, often doesn't break down as it theoretically should. The authors point to the presence of large number of symmetries in the problem as somewhat of a requirement for explosive behavior.

Research in algorithms for Boolean satisfiability and their implementations [23, 6] has recently outpaced benchmarking efforts. Most of the classic DIMACS benchmarks [10] can be solved in seconds on commodity PCs. More recent benchmarks take longer to solve because of their large size, but are still solved in minutes [25]. Yet, small and difficult SAT instances must exist because Boolean satisfiability is NP-complete.

A 1994 Infoworld article quoted a researcher saying that deep static analysis requires years of analysis time. Upon reading that quote, I immediately thought that the researcher's approach might be very inefficient, because it implied that, after some depth, a human could outperform a computer on the same job. Perhaps, if we mimicked actual human reasoning, we could produce more effective tools.

October 05, 2007

Specifications

My main tasks in the past couple of months have been divided between bug-fixing and improving scalability. In the latter case, testing on large projects revealed bottlenecks caused by uncontrolled expression growth, for which I came up with a general solution. I have been spending a considerable amount of time on my simplifier for recursive lambda expressions.

There was one major addition. I added syntax support and highlighting for eventual specification support. My immediate goal was to provide a clean way of adding breakpoints and assertions to the source code.

image

The above visual shows what the new syntax highlighting looks like. The new keywords are bolded. Two new grayed out comment delimiters were added: //@ and /*@. These comment delimiters act like white space from within NStatic with the exception that new keywords are only recognized within the comments.

Any arbitrary code, such as ghost fields, can be placed inside these comments; this serves as an alternative to using the preprocessor defines, NSTATIC and CODEANALYSIS.

The new keywords that I have included: require, ensure, assert, assume, pure, modifies, invariant, old, breakpoint, result, unreachable, and nonnull. But, for the first release, the only keywords that are actually effective are assert, assume, breakpoint, old. Assert actually verifies that the expression is true and returns a warning otherwise (or error in the case of an invalid code path). The behavior is stronger than a call to Debug.Assert. Assume does not require the expression reduce directly to a true value, but it will produce a error if there is a false path leading to it.

The specifications are modeled on both Java Modeling Language (JML) and Spec#. There are quite a number of differences between the two. For instance, JML incorporates method contracts before the method declaration whereas Spec# incorporates contracts just right before the method block. While I followed the Spec# in this case, my instinct is to prefer JML, because it is more of an industry standard, and I am less likely to encounter Microsoft IP issues.

I suspect highly that the next version of C# will incorporate some kind of design by contract.

There are a few differences from both JML and Spec#.

  • Arbitrary functions are supported as expressions with specifications, including impure functions (functions containing side-effects). Side effects that occur inside the expression of a single specification are not visible to other parts of the program, including adjacent specifications.

I have never taken a course in formal methods, so I am not fully aware of the implications of my approach.

August 18, 2007

Scalability

I wrote my NStatic tool to perform quickly with the goal of scaling linearly with the size of the code base. Analysis of a single terminal function is done in a single pass and is basically linear with the size of a function. Performance is impacted more by the call depth of some of the functions than by the number of functions in the code base.  Even with deep calls, I found ways of keeping analysis tractable through lazy and cached evaluations.

The main issues I am currently grappling with are some scalability issues that I discovered while scanning Rotor (the Shared Source Common Language Implementation). I found some interesting errors in Rotors, which I'll eventually post in my blog.

I also found some gotchas as well. My tool zipped through hundreds of functions in a split second before stalling in a function, which wasn't all that complex. Some of my transformations, while seemingly simple, exhibit exponential characteristics. This occurs when a function application duplicates an argument.

f(x) => x op x

Repeated forward use of such transformations can lead to large expressions that take long to evaluate or even print.

I found a general, elegant solution that I have been implementing. Ideally, the normal form should be at least as compact as the original form [considering the fully expanded lambda expression representation of f], preferably more so. With constant and atomic arguments, this is the case. For more complex arguments, the reverse transformation is applied: The left hand side of the transform becomes the normal form. It's counterintuitive, but there's no reason why we can't continue to simplify this normal form as the two sides of the equation have the same value.

July 26, 2007

VB & Other .NET Languages

Send me mail if you are interested in beta testing experimental Visual Basic or IL+PDB support (for other .NET languages including C++/CLR--please specify language) if and when I do include it in an NStatic beta. I don't know if I'll be able to handle DLR-based languages.

There's a slight chance for support of other languages in the beta, but it will be experimental and the primary focus is C#. Any other language support will come in a minor update. I don't have much experience working with other CLR languages other than C#.

If VB support is introduced, it will likely have rich editor and intellisense, but some parts of the analysis may exhibit C# semantics except when performing the analysis indirectly through IL. All other languages will have error highlighting support.

July 22, 2007

Omniscient Debugging

Bill Lewis of Tufts University, who is a computer scientist who has worked on natural language processing, gave a fascinating Google TechTalks lecture on "Debugging Back in Time." He also has a page dedicated to this new idea of Omniscient Debugging, where the debugger knows everything.

The omniscient debugger records every state change in the entire run of the program. Bill states that omniscient debugging gives the programmer a unique view of the program that eliminates the worst aspects of breakpoints:

  • No guessing where to put breakpoints
  • No extra steps to debugging
    • Set breapoints, start, examine, continue
  • No fatal mistakes... (no, "Whoops, I went too far")
  • No non-deterministic problems
  • Customers can email a debugging session to developers

An omniscient debugger can be extremely effective in understanding and fixing bugs. Bill's description and walkthrough of his prototype omniscient debugger (which in his words is more already more effective than any commercial debugger) seems to offer a genuine "silver bullet" approach to debugging.

His approach to omniscient debugging logs every state change to the program. This has the obvious disadvantage of slowing down execution runs and consuming disk space. Bill claims to have a found a practical way of doing this. The disadvantages are similar to those a profiler, but even more so, because each state change is logged, not just function calls.

I noticed a long time ago that my own analysis tool, which uses a completely symbolic representation of a program's abstract state space, could in theory be able to recreate the entire run of a program with just a compact set of critical assumptions.

This could be even more effective than logging, because of the superior time and space savings of my approach. NStatic can execute a long series of nested lengthy loops in less than a millisecond; it can allocate a gigabyte-size array and fill it up with values from any algorithm, and it would still take just the few K of memory that is needed to represent the algorithm itself.

There might even be some benefit to combining my "symbolic" debugger with Mike Stall's C#/IL managed debugger sample, so that I can retroactively incorporate symbolic reasoning to an actual just-in-time debugging session. One developer actually created Deblector, which uses Reflector to decompiling and debugging IL on the fly.

I did consider at one time being able to step forward and backward through the execution history. In NStatic Directions, I did mentioned making static analysis more into a debugger than a static analysis tool.

Making C# feel more like a dynamic, interpreted, REPL language through the use of a symbolic interpreter. Code is always live and can be statically debugged. As Erik states, programming will “shake of the yoke of the dreaded (edit, compile, run, debug)* loop and replace it with a lightweight (edit=compile=run=debug) experience.”

NStatic does already feel like a debugger with ability to watch and execute code immediately, as well as the ability to apply new assumptions.

July 21, 2007

Methods, Part 1

Method handling is difficult in static analysis. I will describe in a few parts some of the issues that I have had to deal with.

I demonstrated in earlier posts my handling of recursive functions in Loops, Part 1. In this post, I will show how I elegantly handle virtual functions. Below is the code for three classes, a base class Animal and two derived classes, Cat and Dog. Animal is an abstract base class, so no implementation actually exists for one of its member function.

Each animal can make a noise and has a type.

To test the three Animal classes, a wrote the following test method below. We have three variables of type Animal, "spot," "polly," and "tabby." We know Spot is a dog through an Assert statement. We know Tabby is a cat because we just created one from scratch. However, Polly is an unknown and sounds as if it may be a parrot, a class that we actually haven't account for; or, we could be mistaken, and it could actually be a dog or cat. 

So, I run the code through NStatic and extract out all the local variables and its member variables after the scan. Here's what the resulting Locals window looks like.

Whether we assign a variable or assume it, we know through the variables of the form "isXaY" that Spot is a dog and not a cat, and Tabby is a cat and not a dog. We also know that Spot says "Bark" and Tabby saw "Meow." We can also tell each animal's type from Animal's virtual property, Type. However, the Polly variable is mysterious. Thus, the local window has to represent all information about Polly's nature and behavior through conditional expressions and possibly generated values (in the case of Animal.Say(), which has no implementation).

 

kick it on DotNetKicks.com