Wednesday, June 3, 2015

Gitlab Wish List

Gitlab Wish List

I started using gitlab at work. I looked at several git web guis, but I ended up liking gitlab the best. I'm using the community edition, and it's hosted on a machine on our internal network.

If you're not familiar with it, it's a lot like github, except that it is open source, and you can run it locally on your own hardware. The company behind it also sells an Enterprise Edition. And like github, you can host your repos at gitlab.com. They even offer private repos for free.

But to me, being able to host it locally is its biggest selling point over gitlab. And it's hard to beat the price tag of free.

So if you need something like that, I do recommend gitlab, and it is what I'm using for that purpose.

This post is mainly going to be about the weird way I'm using gitlab, and about the features I wish it had. Don't take this to mean I don't like gitlab, it is what I'm using after all.

Keep in mind that gitlab has very frequent updates. If you run into this post a a year from now, it's possible none of my complaints will be valid.

My Setup

Several projects are normal git projects, but one project in particular, which is also the largest project, is not.

That project officially lives in svn.  I have a git-svn clone of it with a carefully created configuration. There's a cronjob that every five minutes does 'git svn fetch' and 'git push gitlab'. There's a special user, named svnbot, that the pushes are done under, so that they don't all show up as my user account under activity.

I think my setup is unusual in that I'm mirroring svn on gitlab on a continue basis, instead of doing a one time import and using only git from then on.

The Workflow

Many users checkout from svn, and commit with svn, and don't need to care about gitlab.

Others clone from gitlab, make their changes, and then 'git svn dcommit' back to svn. I have a script, hosted on our gitlab, that runs 'git-config' a whole bunch of times in order to configure git-svn the same way that svnbot's is configured, so that this can actually work. (It also takes care of the authors file and a few other things.)

So after doing a clone, there's a script you have to run before you can 'git svn dcommit', but after that things a pretty normal. You can fetch new changes with either 'git fetch' or 'git svn fetch'.

Topic Branches and Merge Requests

Things get interesting once people start using topic branches. Now people can submit Merge Requests in gitlab (which are like Pull Requests on github), and get their code reviewed before they commit it back to svn.

Now the flow is something like
  • Pull for gitlab
  • Create new topic branch
  • Push topic branch to gitlab
  • Iterate on topic branch for a while
  • Submit MR
  • Possibly push new commits in response to feedback from MR
  • Optionally rebase
  • git-svn dcommit

One Way

My svn to gitlab syncing only goes in one direction. Because of that, Merge Requests should be upvoted but not Merged.

As an aside, the upvote feature on Merge Requests was really hard to discover.

There's a lot of branches from svn. In particular, all tags show up as branches. So I've protected the most common branches so that only svnbot can push to them. But there's no mass protect button, so most branches that should be protected are not.

Problems

Who Got Notified?

I just pushed a new commit to a Merge Request, did the assignee get notified? I just commented on a commit, did the author get notified, or do I need to tell them to look at it?

I'm never certain who got notified by what I just did. I wish there was a way to just see that information somewhere. That way I could be sure.

In most cases, the answer seems to be that they did get notified. It's just the uncertainty that bothers me.

SVN Revisions

Svn has its own revision numbers. git-svn puts these numbers, along with the branch url, in a sort of header at the end of the commit log of every commit.

Now, I don't expect gitlab to have special support for git-svn. But I do expect its search feature to search commit logs. If it did support that feature, then I could search by svn revision, since the svn revisions happen to be in the logs.

Sadly, it doesn't yet support that feature.

Performance

Gitlab's performance is ok most of the time. Since it's on our own hardware, most performance problems are our own fault and not gitlab's.

But it does get bogged down on some of the larger files I sometimes have to deal with.

One of the git repos is an import of the Asterisk source code. These guys used a one file per module approach for a long time, and a few of their files got way out of hand. The biggest offender is chan_sip.c, which is around 80,000 lines long. (I'm not kidding.)

When I first started using gitlab, it would fail to load that file, and it would time out with a 500 Internal Server Error. But the newer version we're currently running is a bit better and can load it, but not quickly.

One could argue that a 80,000 line .c file is unreasonable. And one might be right, but I'm not in a position to fix it, I just need to deal with it, and so do the tools I use.

Besides, vim and the cli git command deal with it just fine. In fact, with the exception of the blame command, git is always unreasonably fast. This causes me to expect any GUI on top of it to be just as fast.

Sharing Links

One nice thing about using gitlab is that it gives me a way to link to code. It even has a #linenumber in urls, so I can link someone to a specific line of code.

It could be better though. I can't link to a range of lines, and sometimes it fails to scroll to the linked to line. It could probably do a better job of highlighting the linked to line.

I can link to particular lines in a diff. Well, I think that feature actually broke for a while, and then they fixed it.

One thing I can't do is generate a diff of a subtree. I want to diff a...b of /foo/bar/, but I can only diff a...b of /.

Diffs

I really enjoy having syntax highlighted diffs. The ability to link to and to comment on specific lines is great. I can comment on both Merge Requests, and on specific commits. So I can still tell someone "this line here is wrong!" even if they skipped code review.

Besides using it as a place to pull from and push to (and anyone can do that), the Diff view is the main feature is use. (Often in conjunction with Merge Requests).

But I do have some complaints.

The word-diff feature isn't very smart. If two lines in a row are changed, that shows up as two minus lines and two plus lines. That seems to be enough to default the per word diff highlighting, it doesn't activate. Which is a shame, because when it works it's a very useful feature.

It might be nice to have an option for "context" diffs, the ones that are like unified but have "!" in addition to "+" and "-". "!" means the line was modified. (Unified just shows it being removed and the new modified line being added.) I normally prefer Unified, but I wonder if with word diff support, context diffs might be nicer than unified, at least sometimes.

There's a "..." in the margin to show more context on the diff. Unfortunately, it doesn't let you control whether you want more from above or below the "...". There's also no way to say "just show me the whole file!".

There's a git TUI named "tig" that lets you increase the number of context lines (or decrease it) with the [ and ] keys. It would be neat if gitlab had a way to do that, although clicking the "..." is often closer to what I want. ('tig' is actually pretty cool, and I would encourage the gitlab people to use it and steal ideas from it where it makes sense.)

The hunk line is displayed way too lightly. That line displays the starting and ending line numbers of before and after the diff, which isn't needed in a GUI like this. But it also displays the enclosing function, which is very useful information.

It's stupid to make it too light to read. Either hide it completely, or make it readable. And do the second of those, because I want to see it.

The diff view doesn't respect the theme.

It would be awesome if it could do filetype highlighting and diff syntax highlighting at the same time. Github does that now, so hopefully it's just a matter of time.

It would be even more awesome if it could do code within code within diff highlighting. I might have some SQL inside of some PHP, and I might have added a new column to the SELECT list, and there might already be 10 columns selected, so I really need the word diff highlighting.

Blame

One useful command is 'git blame', and gitlab has support for a blame view of files in its interface. What this does is, for every line in the, it shows you who last touched that line and what commit that was.

While the name implies it's for assigning blame, really it's for looking at the commit that introduced a given change in it's entirety.  Sometimes it does lead to asking the author some questions though.

Gitlab used to have a bug where the blame view would always show the blame for the master branch, regardless of what revision you were on. This was really confusing. Fixing this bug is currently my only contribution for gitlab. (It was a one-liner.)

I do wish that gitlab supported a more advanced Blame view though. Often the first commit isn't the one that introduced a line, and often that's the one you're after. So I usually have to dig multiple commits and do multiple blames to fine what I'm after.

Things like ignoring white space and some of the other options git supports would be a start.

Just having a link is very useful, so that's already a good start.

But I feel like this a place they could innovate. Some kind of onion-skin thing, like what's used for animation packages? I don't know.

Cherry-picking Management

At least one of the git web GUIs I ran into has support for Cherry-picking. I don't remember exactly what it supported, but I do have an idea of the features I would want.

So, the idea is bug fixes get committed to your development branch, but then need to be cherry-picked back to your release branch or branches. You can't merge because that would bring all dev changes to your release branch, and that would be bad.

Git itself doesn't really track cherry-picks though. This is unlike svn, who's mergeinfo property does track cherry picks.

Svn can list revisions available for cherry-picking (which it calls merging) with the 'svn mergeinfo' command. That same command can list revisions that have been cherry-picked.

I don't have much experience using the git-cherry-pick command. I know it has the "-x" option, which strangely used to by the default but isn't anymore. There's also 'git cherry' and 'git patch-id'.

Given two branches with a common ancestor, I want gitlab to be able to display commits available for cherry-picking, commits already cherry-picked, and commits that have been manually excluded from cherry-picking. 

Then I want to be able to mark commits as already cherry-picked, or "not going back". Then I want to be able to see the remaining list.

Being able to cherry-pick from inside of gitlab (a Cherry-Pick Request could be a special kind of Merge Request) would be nice too. And picking up on cherry picks done outside of gitlab automatically would also be nice. But those features aren't as important to me. It's the keeping the list of available cherry-picks that I consider most important. 


Summary of Complaints/Wish List

  • No way to tell who got notified when you comment, commit, etc
  • There's no way to Close a Merge Request and indicate that it was Accepted, any closed MRs that were not merged show in Red as if they were rejected
  • No way to protect branches in bulk
  • You can't search by commit log message, which means you can't search by svn revision
  • Diff highlighting: word diff sometimes doesn't work
  • Diff highlighting: it would be great if it could do code syntax highlighting and diff syntax highlighting at the same time
  • Performance sucks on very large files or very large diffs
  • The "..." that expands a diff doesn't let you choose up or down
  • For some reason, the file viewer is themed, but the diff viewer doesn't respect the theme. But the diff viewer is the main one I use!
  • The @@ hunk line in diffs is displayed too lightly, it's too hard to read, but contains useful info like the enclosing function
  • Can't diff a subpath
  • Better Blame
  • Cherrypick tracking/management


Sunday, May 4, 2014

Introducing emplacer: allocate subtypes of an abstract base class directly in a container

The Problem

A few weeks ago I was working on a toy C++ project. I needed something similar to the HTML DOM, so I created a few classes.

class Node {
public:
/* ... */
        virtual std::string nodeName() = 0;  // At least one pure virtual method
};

class TextNode : public Node {
/* ... */
};

class ElementNode: public Node {
/* ... */
};

But then I did something dumb. I tried to make a vector of Nodes.
typedef std::vector<Node> NodeList;

And of course, when I went to use my NodeList, the compiler spewed a page of errors at me.  I was confused for a moment, but then it became obvious. You can't create an instance of an abstract base class. That's what the "abstract" part of "abstract base class" means. But creating a vector of them is asking the STL to do exactly that.

If you're not familiar with abstract base classes in C++, I suggest you read a tutorial on polymorphism in C++, such as this one. But suffice to say, you create pointers or references to them, and the pointers point to (and the references refer to) instances of concrete derived classes that inherit from the base class.  So you might have an ElementNode, and call a function that takes a Node&, and pass your ElementNode& to it instead, and it can call nodeName() on it, and the right function is called at runtime. (The compiler adds a pointer to a vtable somewhere in the object, and that vtable is consulted to see which function call.)

The Real Problem

So I told the compiler I wanted a std::vector<Node>. And the compiler told me no, that doesn't make sense. And it's right, of course.

So what did I really want? Well, I wanted a vector that can hold any subclass of Node. So how do I ask the compiler to do that?

(As a side note, the normal way to do this is to create a std::vector<Node*>, or perhaps a smart pointer version of that. I didn't want to do that because I didn't want separate memory allocations for each element of the vector.)

The most obvious way is with an union.

struct NodeAny {
        enum {TEXT, ELEMENT} nodeType;
        union {
                TextNode textNode;
                ElementNode elementNode;
        };
};

But there's a bunch of problems with that.

  • You need a way to call the constructor of the whichever node type it currently is.
  • The "nodeType" member is duplicating the purpose of the vtable. By that I mean, if you write some methods on this class, they're probably all going to switch on nodeType. But that's dumb, because that's what C++'s runtime subtype polymorphism does for you.
  • There's also no way to make this generic, you can't write a variadic template union that takes a series of types to union together.

The Solution

So I complained to my friend Jeremy that C++ should be able to do this. This isn't Java, I shouldn't have to do a separate allocation for each object. (Side note: I've heard that Java's heap is actually faster than C's. Since it's garbage collected, it can basically just use a stack, and then compact things later.)

In the past, he had also wanted this functionality, and so my complaints spurred him to create emplacer. I then forked it and attempted to improve it, and added an example.

emplacer acts similar to a smart pointer to the base class. This means you have to use * or -> to access the pointed-to class.

Originally, to define one, you would do something like this:

typedef type_collection<ElementNode, TextNode> NodeSubclasses;
typedef emplacer<Node, NodeSubclasses> NodeAny;

But once we started really using it, we realized a problem with that. The compiler and debugger (and also valgrind) always expand out the name instead of using the typedef. Once we had 12 or so subclasses, whenever we made a compiler error, we started seeing some of the longest typenames we had ever seen. (And in C++, that is saying something.)

Eventually I realized that we could replace the second typedef with a new class, like this:

struct NodeAny : public emplacer<Node, NodeSubclasses> {
using emplacer::emplacer;
};

The using line is a C++11ism, and pulls in the constructors from the base class emplacer.

How It Works

emplacer has two private members, "char data[]" and "bool live".

The size and alignment of "data" are determine at compile time by the template arguments. This is done using type_collection has constexpr methods to calculate the max alignment and max size of any of the types it holds.

So each emplacer instance is big enough to hold the largest subclass. It uses placement new and manual destructor calls on "data".

"live" keeps track of whether or not the emplacer instance has been "emplace"ed. Or in other words if it has called placement new on "data". Most operations throw a std::logic_error if the object is not live.

emplacer implements operator*() and operator->(). They both do "return *reinterpret_cast<Type *>(data);", where Type is the abstract base class, and where they're defined to return by reference. The rest of the magic is done for us by C++'s subtype polymorphism.


How To Use It

First you'll need a class hierarchy with an abstract base class and some polymorphic subclasses. (I.e. the base class should have a bunch of pure virtual functions for the derived classes to implement.)

Then you'll want to create a typedef of the type_collection template, and subclass of emplacer. There's an example of that in the "The Solution" section, and also in the example .cc file in the gist.

You can then make an instance on the stack just like any other class.

ShapeAny my_shape;

If you're using my version, and if the first type in the type_collection is default constructible, and my_shape runs the default constructor of that first type. Otherwise, my_shape is not live, and trying to dereference it throws an exception.

In any case, you can run emplace on it to initialize or reinitialize it.

my_shape.emplace<Circle>(1);

If the object was live, it first calls the destructor by hand, and then calls the constructor, using placement new. Remember all of this is done on the stack memory (or heap, or global, or where ever it's allocated) that my_shape is already using, no additional allocations are done by emplacer. (But of course the class constructed can allocate memory in its constructor.)

My version also has copy and move constructors that accept the SubTypes, so you can do something like this:
ShapeAny my_shape(Circle(1));

(Speaking of copy constructors, emplacer implements a real copy constructor too, so you can copy instances of the emplacer.

This was kind of tricky to implement, since constructors cannot be virtual. At one point I was storing a pointer to the constructor of the last constructed type. But I didn't like taking up extra memory for that, so now I'm using type_collection and comparing typeids until I find the right class to construct. This uses less memory but is probably slower.)

You can call any method of the base class on it by using "->" instead of ".". You can pass it to any function that takes a base class reference by passing in *my_shape, or if it takes a pointer, &*my_shape.

As alluded to earlier, you can also use it in containers. Here's an example:
 std::vector<ShapeAny> shapes;

 shapes.emplace(shapes.end())->emplace<Rect>(3, 4);
 shapes.emplace(shapes.end())->emplace<Square>(3);

 //emplace returns reference to *this, so you can use it like this too
 shapes.push_back(ShapeAny().emplace<Rect>(5, 6));
 shapes.push_back(ShapeAny().emplace<Circle>(1));

 for (auto shape: shapes) {
  std::cout << shape->area() << std::endl;
 }

Conclusion

I think that emplacer is pretty useful. I searched, but I couldn't find anything else online like it. Boost has a couple of classes like Boost::Any and Boost::Variant, but both of them solve related but different problems.

While I suspect that emplacer is fast, I haven't done any benchmarking yet. The copy/move constructors and assignment operators might be slow since they search the class hierarchy, though I'm hoping the compiler can make them fast anyway.

One thing that I find interesting is that subtype polymorphism is more or less equivalent to a tagged union. But the former tends to be much easier to work with in C++. But there's a gap with what you can do with base classes, and emplacer fills that gap.

I hope you find it useful.

Tuesday, November 6, 2012

Internet voting

Here's how I would do internet voting.

The objectives are: voting should be anonymous, verifiable, support any system (instant runoffs would be nice), and early voting/revoting should be allow. Also only those allowed to vote should be able to cast a ballot.

First, you have to know what a public/private keypair is.

For our purposes, a keypair consist of a private key and a public key. A computer program generates the two of them. The private key must never be shared. The public key is shared with the world.

The private key can be used to sign things. The public key can be used to verify a signature. If a signature verifies, it must have been signed by whoever has the private key.

First, we give every voter a custom device that is similar to a USB key. Or maybe they have to purchase it from a vender. Either way, they have to have this device to vote using this system.

The USB key isn't a normal USB key. It contains keypairs. It doesn't allow you to access or copy the private key, instead it allows you to submit data to be signed, and it returns a signature. It has a few other operations, as required to implement the scheme described here. We'll call the USB device a Voting Key, or VK for short.

The voter takes their VK down to the courthouse or DMV. The VK has a keypair on it that identifies the voter. The clerk checks the voter's ID, and then inserts the VK into their computer. They copy public key and sign it with their own key, and upload it to their server. At the same time, the voter enters a password or passphrase into the computer.This process only has to happen once per voter. They can repeat it if they lose or break their VK.

Now they go home, go onto their computer, and vote. Here's what happens.

The voter inserts their VK into their computer. They have or install a special driver. They load a program or go to a special website. The first step is to obtain a ballot.

Their VK contains 1000 additional keypairs preloaded on it. (Or maybe they're generated after it's purchased.) These keypairs serve as ballots.

The program uses the VK and gets the public key of a new ballot. It signs the ballot with the voter's public key, and then sends this as a request to the server. The voter also enters his password, and this is hashed with a nonce and included in the request, to provide two-factor authentication. The server marks the voter as having obtained a ballot, and adds the ballot public key to a (publicly accessible) list of valid ballots for this election. It also signs the ballot, marking it as valid.

The original signature of the ballot by the voter's public key is then discarded, and never made public. The server software will have to be carefully audited to ensure this happens. In this way, we're ensuring only registered voters can vote, but we are also keeping the ballots anonymous.

The server will refuse to sign a ballot if one has already been issued for this election, if the voter is dead, or if he's lost the right to vote. In this way we have ensured only people allowed to vote can vote.

The voter then uses the program to obtain the items he's voting on, makes his choices, and submits. The VK then uses the ballot private key to sign the filled out ballot, and this signature is uploaded to the server.

All of the completed ballots are made publicly accessible.  The voter can download all of the ballots, and the program can determine the election results. The voter can determine that his ballot is amount those cast and that is is correct. But no one without access to the voter's VK can tell which ballot belongs to the voter.

If the election date isn't past yet, the voter can at any time recast his ballots, making different choices. The program uploads a new ballot. The ballot contains a sequence number, and the server signs it with a timestamp, and rejects invalid ballots or ballots with lower sequence numbers. In this way, we can allow early voting and allow changing your vote up until the closing of the polls.

Then, instead of replying on polls and surveys, we can watch the actual election results in real time, starting maybe 90 days before the end of the election. Instead of "go out and vote or your guy might lose!", it will be "your guy is losing, go vote!".

The format of the cast ballot is mostly unspecified here. That's because it doesn't matter, the system works regardless of what it is. In that way, it should work with any alternative voting system we want to use.

Friday, December 9, 2011

Hello World

This is my first blog post. There are many like it, but this one is mine.