Sunday, December 4, 2016

Building libvirt from git on macOS

Libvirt is available on macOS via homebrew, however, building a git version (rather than from a release tarball) might be a little tricky.

Recently, I've fixed some minor issues concerning building on macOS, so here is a little guide on building that on macOS.

The first thing needed is a relatively fresh rpcgen, as the one that comes with macOS is too old. I've created a homebrew formula for rpcgen from FreeBSD, you can grab it here: https://gist.github.com/novel/0d74cdbc7b71f60640a42b52c9cc1459 and install it. Please consult homebrew docs if you're not sure how to do that.

Now it's just a matter to point it to the proper rpcgen and build it as usual:

$ ./configure ac_cv_path_RPCGEN=/usr/local/bin/bsdrpcgen --without-sasl
$ make
$ make check

There are some tests failing, need to figure out why, but at least test suite is working.

Sunday, November 13, 2016

Playing around with valgrind in libvirt

Valgrind is a well known memory debugging tool. It doesn't seem to officially support FreeBSD, but, luckily, it's available in FreeBSD ports.

Using it is not hard (at least on a basic level), but requires some patience and attention.

Fortunately, libvirt comes provides an easy interface for adding unit tests, and it could be used for running with valgrind as well. Obviously, tests should cover code paths we want to test, but it's still much more convenient than triggering all the stuff manually.

Individual tests could be executed like this:

VIR_TEST_RANGE=22 VIR_TEST_DEBUG=1 ./tests/bhyvexml2argvtest

bhyvexml2argvtest is a test that checks that the bhyve driver converts XML to a proper set of actual bhyve(8) commands. VIR_TEST_RANGE tests to run only test #22 from the suite, and VIR_TEST_DEBUG=1 makes it produce more verbose output.

Adding valgrind into the game is fairly simple:

VIR_TEST_RANGE=22 VIR_TEST_DEBUG=1 valgrind --trace-children=yes --tool=memcheck --leak-check=full   ./tests/bhyvexml2argvtest

Note: I also configured libvirt with '-O0 -g' in CFLAGS for debugging.

It produces a bunch of information about leaks, for example:

==28853== 128 bytes in 1 blocks are definitely lost in loss record 134 of 173
==28853==    at 0x4C246C0: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==28853==    by 0x7D906D8: vasprintf_l (in /lib/libc.so.7)
==28853==    by 0x512876F: virVasprintfInternal (virstring.c:481)
==28853==    by 0x512886D: virAsprintfInternal (virstring.c:502)
==28853==    by 0x40D9A9: testCompareXMLToArgvHelper (bhyvexml2argvtest.c:109)
==28853==    by 0x40DFD8: virTestRun (testutils.c:180)
==28853==    by 0x40DD50: mymain (bhyvexml2argvtest.c:177)
==28853==    by 0x40FA77: virTestMain (testutils.c:992)
==28853==    by 0x40DE0E: main (bhyvexml2argvtest.c:193)
==28853== 
==28853== 128 bytes in 1 blocks are definitely lost in loss record 135 of 173
==28853==    at 0x4C246C0: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==28853==    by 0x7D906D8: vasprintf_l (in /lib/libc.so.7)
==28853==    by 0x512876F: virVasprintfInternal (virstring.c:481)
==28853==    by 0x512886D: virAsprintfInternal (virstring.c:502)
==28853==    by 0x40D9F5: testCompareXMLToArgvHelper (bhyvexml2argvtest.c:111)
==28853==    by 0x40DFD8: virTestRun (testutils.c:180)
==28853==    by 0x40DD50: mymain (bhyvexml2argvtest.c:177)
==28853==    by 0x40FA77: virTestMain (testutils.c:992)
==28853==    by 0x40DE0E: main (bhyvexml2argvtest.c:193)

Moving from bottom to top of these traces, it's fair to assume that the issue is somewhere in testCompareXMLToArgvHelper. It looks like this:

 96 
 97 static int
 98 testCompareXMLToArgvHelper(const void *data)
 99 {
100     int ret = -1;
101     const struct testInfo *info = data;
102     char *xml = NULL;
103     char *args = NULL, *ldargs = NULL, *dmargs = NULL;
104 
105     if (virAsprintf(&xml, "%s/bhyvexml2argvdata/bhyvexml2argv-%s.xml",
106                     abs_srcdir, info->name) < 0 ||
107         virAsprintf(&args, "%s/bhyvexml2argvdata/bhyvexml2argv-%s.args",
108                     abs_srcdir, info->name) < 0 ||
109         virAsprintf(&ldargs, "%s/bhyvexml2argvdata/bhyvexml2argv-%s.ldargs",
110                     abs_srcdir, info->name) < 0 ||
111         virAsprintf(&dmargs, "%s/bhyvexml2argvdata/bhyvexml2argv-%s.devmap",
112                     abs_srcdir, info->name) < 0)
113         goto cleanup;
114 
115     ret = testCompareXMLToArgvFiles(xml, args, ldargs, dmargs, info->flags);
116 
117  cleanup:
118     VIR_FREE(xml);
119     VIR_FREE(args);
120     return ret;
121 }

We can see that we call virAsnprintf a few times to populate xml, args, ldargs and dmargs, but later, in the cleanup section, we only free xml and args, leaking ldargs and dmargs, just like valgrind reported.

Now trying to fix that by adding two VIR_FREE calls for leaked variables and try again.

After running the test again, this warning is gone (yay!), so moving to the next one:

==29877== 296 (200 direct, 96 indirect) bytes in 1 blocks are definitely lost in loss record 146 of 171
==29877==    at 0x4C25C90: calloc (in /usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==29877==    by 0x50C47D8: virAllocVar (viralloc.c:560)
==29877==    by 0x510D3A1: virObjectNew (virobject.c:193)
==29877==    by 0x510D4FC: virObjectLockableNew (virobject.c:219)
==29877==    by 0x51EBFC1: virGetConnect (datatypes.c:127)
==29877==    by 0x40D680: testCompareXMLToArgvFiles (bhyvexml2argvtest.c:34)
==29877==    by 0x40DA1B: testCompareXMLToArgvHelper (bhyvexml2argvtest.c:115)
==29877==    by 0x40DFF0: virTestRun (testutils.c:180)
==29877==    by 0x40DD68: mymain (bhyvexml2argvtest.c:179)
==29877==    by 0x40FA8F: virTestMain (testutils.c:992)
==29877==    by 0x40DE26: main (bhyvexml2argvtest.c:195)
==29877== 

And indeed, it looks like connect object is leaking here:

 22 static int testCompareXMLToArgvFiles(const char *xml,
 23                                      const char *cmdline,
 24                                      const char *ldcmdline,
 25                                      const char *dmcmdline,
 26                                      unsigned int flags)
 27 {
 28     char *actualargv = NULL, *actualld = NULL, *actualdm = NULL;
 29     virDomainDefPtr vmdef = NULL;
 30     virCommandPtr cmd = NULL, ldcmd = NULL;
 31     virConnectPtr conn;
 32     int ret = -1;
 33 
 34     if (!(conn = virGetConnect()))
 35         goto out;
  --- skip ---
 82  out:
 83     VIR_FREE(actualargv);
 84     VIR_FREE(actualld);
 85     VIR_FREE(actualdm);
 86     virCommandFree(cmd);
 87     virCommandFree(ldcmd);
 88     virDomainDefFree(vmdef);
 89     return ret;
 90 }

So adding virObjectUnref call for the connection object and trying again. And after this there's only one warning left:

==30532== 7 bytes in 1 blocks are definitely lost in loss record 6 of 169
==30532==    at 0x4C246C0: malloc (in /usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==30532==    by 0x7D361DF: strdup (in /lib/libc.so.7)
==30532==    by 0x5128C71: virStrdup (virstring.c:714)
==30532==    by 0x410B8F: bhyveBuildNetArgStr (bhyve_command.c:94)
==30532==    by 0x411410: virBhyveProcessBuildBhyveCmd (bhyve_command.c:305)
==30532==    by 0x40D6FF: testCompareXMLToArgvFiles (bhyvexml2argvtest.c:46)
==30532==    by 0x40DA27: testCompareXMLToArgvHelper (bhyvexml2argvtest.c:116)
==30532==    by 0x40DFFC: virTestRun (testutils.c:180)
==30532==    by 0x40DD74: mymain (bhyvexml2argvtest.c:180)
==30532==    by 0x40FA9B: virTestMain (testutils.c:992)
==30532==    by 0x40DE32: main (bhyvexml2argvtest.c:196)

This one is a little harder. Reading the code of bhyveBuildNetArgStr, it's not exactly obvious where it leaks, because net->ifname was freed in every code path.

 46 static int
 47 bhyveBuildNetArgStr(virConnectPtr conn,
 48                     const virDomainDef *def,
 49                     virDomainNetDefPtr net,
 50                     virCommandPtr cmd,
 51                     bool dryRun)
 52 {
 53     char macaddr[VIR_MAC_STRING_BUFLEN];
 54     char *realifname = NULL;
 55     char *brname = NULL;
 56     char *nic_model = NULL;
 57     int ret = -1;
 58     virDomainNetType actualType = virDomainNetGetActualType(net);
 59 
 60     if (STREQ(net->model, "virtio")) {
 61         if (VIR_STRDUP(nic_model, "virtio-net") < 0)
 62             return -1;
 63     } else if (STREQ(net->model, "e1000")) {
 64         if ((bhyveDriverGetCaps(conn) & BHYVE_CAP_NET_E1000) != 0) {
 65             if (VIR_STRDUP(nic_model, "e1000") < 0)
 66                 return -1;
 67         } else {
 68             virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s",
 69                            _("NIC model 'e1000' is not supported "
 70                              "by given bhyve binary"));
 71             return -1;
 72         }
 73     } else {
 74         virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
 75                        _("NIC model '%s' is not supported"),
 76                        net->model);
 77         return -1;
 78     }
 79 
 80     if (actualType == VIR_DOMAIN_NET_TYPE_BRIDGE) {
 81         if (VIR_STRDUP(brname, virDomainNetGetActualBridgeName(net)) < 0)
 82             goto out;
 83     } else {
 84         virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
 85                        _("Network type %d is not supported"),
 86                        virDomainNetGetActualType(net));
 87         goto out;
 88     }
 89 
 90     if (!net->ifname ||
 91         STRPREFIX(net->ifname, VIR_NET_GENERATED_PREFIX) ||
 92         strchr(net->ifname, '%')) {
 93         VIR_FREE(net->ifname);
 94         if (VIR_STRDUP(net->ifname, VIR_NET_GENERATED_PREFIX "%d") < 0)
 95             goto out;
 96     }
 97 
 98     if (!dryRun) {
 99         if (virNetDevTapCreateInBridgePort(brname, &net->ifname, &net->mac,
100                                            def->uuid, NULL, NULL, 0,
101                                            virDomainNetGetActualVirtPortProfile(net),
102                                            virDomainNetGetActualVlan(net),
103                                            VIR_NETDEV_TAP_CREATE_IFUP | VIR_NETDEV_TAP_CREATE_PERSIST) < 0)
104             goto out;
105 
106         realifname = virNetDevTapGetRealDeviceName(net->ifname);
107 
108         if (realifname == NULL)
109             goto out;
110 
111         VIR_DEBUG("%s -> %s", net->ifname, realifname);
112         /* hack on top of other hack: we need to set
113          * interface to 'UP' again after re-opening to find its
114          * name
115          */
116         if (virNetDevSetOnline(net->ifname, true) != 0)
117             goto out;
118     } else {
119         if (VIR_STRDUP(realifname, "tap0") < 0)
120             goto out;
121     }
122 
123 
124     virCommandAddArg(cmd, "-s");
125     virCommandAddArgFormat(cmd, "%d:0,%s,%s,mac=%s",
126                            net->info.addr.pci.slot, nic_model,
127                            realifname, virMacAddrFormat(&net->mac, macaddr));
128 
129     ret = 0;
130  out:
131     VIR_FREE(net->ifname);
132     VIR_FREE(realifname);
133     VIR_FREE(brname);
134     VIR_FREE(nic_model);
135 
136     return ret;
137 }

Also, I had an assumption (wrong!) that I was running with dryRun == true in the test. Anyway, when things go in unexpected ways, it's good to run lldb:

(lldb) breakpoint set --file bhyve_command.c --line 94
Breakpoint 1: where = bhyvexml2argvtest`bhyveBuildNetArgStr + 731 at bhyve_command.c:94, address = 0x0000000000410b3d

Then a couple of next's and we can set a watchpoint and see what happens:

(lldb) print net->ifname
(char *) $1 = 0x0000000807343678 "vnet%d"
(lldb) watchpoint set var net->ifname
Watchpoint created: Watchpoint 1: addr = 0x80727f108 size = 8 state = enabled type = w
    declare @ '/home/novel/code/libvirt/src/bhyve/bhyve_command.c:49'
    watchpoint spec = 'net->ifname'
    new value: 0x0000000807343678
(lldb) c
Process 53109 resuming

Watchpoint 1 hit:
old value: 0x0000000807343678
new value: 0x0000000000000000
Process 53109 stopped
* thread #1: tid = 100513, 0x0000000800ccadb6 libvirt.so.0`virStrdup(dest=0x000000080727f108, src="vnet0", report=true, domcode=57, filename="bhyvexml2argvmock.c", funcname="virNetDevTapCreateInBridgePort", linenr=32) + 43 at virstring.c:712, stop reason = watchpoint 1
    frame #0: 0x0000000800ccadb6 libvirt.so.0`virStrdup(dest=0x000000080727f108, src="vnet0", report=true, domcode=57, filename="bhyvexml2argvmock.c", funcname="virNetDevTapCreateInBridgePort", linenr=32) + 43 at virstring.c:712
   709            size_t linenr)
   710  {
   711      *dest = NULL;
-> 712      if (!src)
   713          return 0;
   714      if (!(*dest = strdup(src))) {
   715          if (report)
(lldb) bt
* thread #1: tid = 100513, 0x0000000800ccadb6 libvirt.so.0`virStrdup(dest=0x000000080727f108, src="vnet0", report=true, domcode=57, filename="bhyvexml2argvmock.c", funcname="virNetDevTapCreateInBridgePort", linenr=32) + 43 at virstring.c:712, stop reason = watchpoint 1
  * frame #0: 0x0000000800ccadb6 libvirt.so.0`virStrdup(dest=0x000000080727f108, src="vnet0", report=true, domcode=57, filename="bhyvexml2argvmock.c", funcname="virNetDevTapCreateInBridgePort", linenr=32) + 43 at virstring.c:712
    frame #1: 0x0000000800862168 bhyvexml2argvmock.so`virNetDevTapCreateInBridgePort(brname="virbr0", ifname=0x000000080727f108, macaddr=0x000000080727f004, vmuuid="\x04\x11㮰P+874\a\b", tunpath=0x0000000000000000, tapfd=0x0000000000000000, tapfdSize=0, virtPortProfile=0x0000000000000000, virtVlan=0x0000000000000000, fakeflags=9) + 83 at bhyvexml2argvmock.c:32
    frame #2: 0x0000000000410bf3 bhyvexml2argvtest`bhyveBuildNetArgStr(conn=0x000000080725e000, def=0x000000080727c000, net=0x000000080727f000, cmd=0x0000000807245280, dryRun=false) + 913 at bhyve_command.c:99
    frame #3: 0x00000000004113eb bhyvexml2argvtest`virBhyveProcessBuildBhyveCmd(conn=0x000000080725e000, def=0x000000080727c000, dryRun=false) + 610 at bhyve_command.c:304
    frame #4: 0x000000000040d70a bhyvexml2argvtest`testCompareXMLToArgvFiles(xml="/home/novel/code/libvirt/tests/bhyvexml2argvdata/bhyvexml2argv-net-e1000.xml", cmdline="/home/novel/code/libvirt/tests/bhyvexml2argvdata/bhyvexml2argv-net-e1000.args", ldcmdline="/home/novel/code/libvirt/tests/bhyvexml2argvdata/bhyvexml2argv-net-e1000.ldargs", dmcmdline="/home/novel/code/libvirt/tests/bhyvexml2argvdata/bhyvexml2argv-net-e1000.devmap", flags=0) + 225 at bhyvexml2argvtest.c:48
    frame #5: 0x000000000040da26 bhyvexml2argvtest`testCompareXMLToArgvHelper(data=0x000000000063e460) + 405 at bhyvexml2argvtest.c:117
    frame #6: 0x000000000040dfe3 bhyvexml2argvtest`virTestRun(title="BHYVE XML-2-ARGV net-e1000", body=(bhyvexml2argvtest`testCompareXMLToArgvHelper at bhyvexml2argvtest.c:101), data=0x000000000063e460) + 245 at testutils.c:180
    frame #7: 0x000000000040dd5b bhyvexml2argvtest`mymain + 789 at bhyvexml2argvtest.c:179
    frame #8: 0x000000000040fa82 bhyvexml2argvtest`virTestMain(argc=1, argv=0x00007fffffffe6e8, func=(bhyvexml2argvtest`mymain at bhyvexml2argvtest.c:127)) + 1675 at testutils.c:992
    frame #9: 0x000000000040de19 bhyvexml2argvtest`main(argc=1, argv=0x00007fffffffe6e8) + 50 at bhyvexml2argvtest.c:195
    frame #10: 0x000000000040d44f bhyvexml2argvtest`_start + 383
(lldb) 

But virStrdup is not really helpful, so going to the previous frame to see the logic behind that:

(lldb) fr s 1
frame #1: 0x0000000800862168 bhyvexml2argvmock.so`virNetDevTapCreateInBridgePort(brname="virbr0", ifname=0x000000080727f108, macaddr=0x000000080727f004, vmuuid="\x04\x11㮰P+874\a\b", tunpath=0x0000000000000000, tapfd=0x0000000000000000, tapfdSize=0, virtPortProfile=0x0000000000000000, virtVlan=0x0000000000000000, fakeflags=9) + 83 at bhyvexml2argvmock.c:32
   29                                      virNetDevVlanPtr virtVlan ATTRIBUTE_UNUSED,
   30                                      unsigned int fakeflags ATTRIBUTE_UNUSED)
   31   {
-> 32       if (VIR_STRDUP(*ifname, "vnet0") < 0)
   33           return -1;
   34       return 0;
   35   }
(lldb) 

Finally, things are getting clear: bhyveBuildNetArgStr was calling virNetDevTapCreateInBridgePort that is mocked for tests. And the mocked version does not free ifname before calling strdup on it, so it's never freed again. The rest of the fix is trivial, problem solved.

All in all, this appears like an interesting and useful exercise. This is probably too detailed, but hopefully it'll be useful for somebody who's just starting the journey of searching for memory leaks (like myself).

Friday, November 4, 2016

bhyve vs VirtualBox benchmarking, part2

"There are in order of increasing severity: lies, damn lies, statistics, and computer benchmarks",
man 8 diskinfo

I got some feedback to my previous post about benchmarking of bhyve and VirtualBox:

So I decided to do some more tests and include e1000 for networking tests and try different drivers and image types for I/O tests.

Networking test

This is a little more extensive test than the one in my previous blog post, now it includes e1000 and virtio-net for both VirtualBox and bhyve. Setting for the test is still the same: iperf and bridged network mode. Commands remain the same, on VM side I run:

iperf -s

On the host, I run:

iperf -c $vmip

And calculate the average of 8 runs. Results:

And actual values (in Gbits/sec) are:

VirtualBoxbhyve
e10001.428751.57375
virtio-net0.432752.4

The most shocking part here is that in VirtualBox e1000 is more than 3x times faster than virtio-net. This seems a little strange, esp. considering that e1000 performance in bhyve is almost the same, but virtio-net in bhyve is approx. 1.5x times faster than e1000 (that's probably a very huge difference too, but at least it's expected virtio to be faster).

I/O testing

I decided to check things suggested in the tweet above and started with disk configuration. I've converted my image to the "fixed size" type image like this:

VBoxManage clonehd uefi_fbsd_20gigs.vdi uefi_fbsd_20gigs_fixed.vdi --variant Fixed

Then I conducted tests for the following configurations:

  • VirtualBox + fixed size image
  • VirtualBox + dynamic size image
  • bhyve + virtio-blk
  • bhyve + ahci-hd

The same image was used for all tests. Fixed size image was produced using the command specified above, raw image for bhyve was created from the VirtualBox image using the qemu-img(1) tool.

I started with bonnie++ test at results surprised me, to put it softly:

Here, we can see that bhyve with virtio-blk shows the best write speed (that is expected), but then shows the worst rewrite speed (which is a little surprising, but the gap is minimal) and shows the worst read speed (more than 2 times slower than VirtualBox; extremely surprising). After that I decided to take a few days break and then to try some different ways of benchmarking.

So, for read performance I used the diskinfo(8) tool this way:

diskinfo -tv ada0 | grep middle

and use the average of 16 runs. For writing, I used dd(1):

dd bs=1M count=2048 if=/dev/zero of=test conv=sync; sync; rm test

and also use the average of 16 runs. I got the following results:

And the numbers (all values are kbytes/sec):

vbox (fixed size img)vbox (dynamic img)bhyve (ahci-hd)bhyve (virtio-blk)
diskinfo1232397152877912960552647685
dd113737135088113889115924

Frankly, this didn't help to figure out state of things, because these results are kind of opposite to what bonnie++ showed: bhyve with virtio-blk shows very high read speed as demonstrated by diskinfo, 1.7x faster than VirtualBox. On the other hand, the numbers that diskinfo is showing are crazy: 2647 mbytes/sec. This feels more like RAM transfer rates, so it looks like some sort of caching is involved here.

As for the dd(1) test part, bhyve with virtio-blk is 16.5% slower than VirtualBox using dynmanic size image.

Conclusion

  • For VirtualBox, if choosing between e1000 and virtio-net, e1000 definitely provides better performance over virtio-net, at least on FreeBSD hosts with FreeBSD guests
  • virtio-net in bhyve is approx. 1.5x faster than VirtualBox with e1000
  • I'll refrain from comments on I/O tests.

Further Reading

Sunday, October 16, 2016

bhyve vs VirtualBox benchmark

I've always been curious how bhyve performance compares to other hypervisors, so I've decided to compare it with VirtualBox. Target audience of these projects is somewhat different, though. VirtualBox is targeted more to desktop users (as it's easy to use, has a nice GUI and guest additions for running GUI within a guest smoothly). And bhyve appears to be targeting more experienced users and operators as it's easier to create flexible configurations with it, as well as automate things. Anyway, it's still interesting to find out which one is faster.

Setup Overview

I used my development box for this benchmarking, and it has no additional load, so results should be more or less clean in that matter. It's running 12-CURRENT:

FreeBSD 12.0-CURRENT amd64

as of Oct, 4th.

It has Intel i5 CPU, 16 gigs of RAM and some old 7200 IDE HDD:

CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz (3491.99-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306c3  Family=0x6  Model=0x3c  Stepping=3
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x21<LAHF,ABM>
  Structured Extended Features=0x2fbb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 17179869184 (16384 MB)
avail memory = 16448757760 (15686 MB)
ada0: <ST3200822A 3.01> ATA-6 device

For tests I used two VMs, one in bhyve and one in VirtualBox. A bhyve VM was started like this:

bhyve -c 2 -m 4G -w -H -S \
        -s 0,hostbridge \
        -s 4,ahci-hd,/home/novel/img/uefi_fbsd_20gigs.raw \
        -s 5,virtio-net,tap1 \
        -s 29,fbuf,tcp=0.0.0.0:5900,w=800,h=600,wait \
        -s 30,xhci,tablet \
        -s 31,lpc -l com1,stdio \
        -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \
        vm0

Main pieces of the VirtualBox VM configuration are displayed on these screenshots:

Guest is:

FreeBSD 11.0-RELEASE-p1 amd64

on UFS. VirtualBox version 5.1.6 r110634 installed via pkg(8).

World Compilation

The first test is running:

make buildworld buildkernel -j4

on releng/11.0 source tree.

The result is that bhyve is approx. 15% slower (106 minutes vs 92 minutes for VirtualBox):

iperf test

The next test is network performance check using the iperf(8) tool. In order to avoid interaction with hardware NICs and switches (with unpredictable load), all the traffic for testing stays withing the host system:

vm# iperf -s
host# iperf -c $vmip

As a reminder, both bhyve and VirtualBox VMs are configured to use bridged networking. Also, both are using virtio-net NICs.

Result is a little strange because bhyve appears to be more than 4 times faster here:

That seemed to be strange to me, so I ran the test multiple times for both VirtualBox and bhyve, however, all the time the results were pretty close. I'm yet to find out if that's really correct result or something wrong with my testing or setup.

bonnie++ test

bonnie++ is a benchmarking tool for hard drivers and filesystems. Let's jump to the results right away:

This makes bhyve write and read speeds approx. 15% higher than VirtualBox. There's a note though: for VirtualBox VM I used VDI disk format (as it's native for VirtualBox) and SATA emulation because there's no virtio-blk support in VirtualBox. In case of bhyve I also used SATA emulation (I actually intended to use virtio-blk here, but forgot to change commandline), but on a raw image instead.

xz(1) on memory disk test

The last test is to unpack and pack an XZ archive on memory disk. I used memory disk specifically to exclude disk I/O from the equation. For a sample archive I choose:

ftp://ftp.freebsd.org/pub/FreeBSD/releases/ISO-IMAGES/11.0/FreeBSD-11.0-RELEASE-amd64-memstick.img.xz

And the commands to unpack and pack were:

unxz FreeBSD-11.0-RELEASE-amd64-memstick.img.xz 
xz FreeBSD-11.0-RELEASE-amd64-memstick.img

Results are almost the same here:

Summary

  • CPU and RAM performance seems to be identical (though it's not clear why buildworld takes longer on bhyve)
  • I/O performance is 15% better with bhyve
  • Networking performance is 4x better with bhyve (this looks suspicious and requires additional research; also, I'm not familiar with bridged networking implementation in VirtualBox, maybe it could explain the difference)

Further Reading

PS: Initially I was going to use phoronix-test-suite. However, it appears that a lot of important tests fail to run on FreeBSD. The ones that did run, such as apache or sqlite tests, do not seem very representative to me. I'll probably try to run similar tests but with Linux guest.

Tuesday, October 4, 2016

Bhyve Networking Options

Once in a while people on the #bhyve IRC channel on freenode ask questions about bhyve networking configuration, i.e. how to configure things to let a VM have network access.

There are at least 3 ways to do that (that I'm aware of, maybe there are more):

  • Bridged networking
  • NAT
  • NIC Passthrough

I'll try to go over each of those and describe how things work in each scheme.

Common configuration

Common things for all the setups: I'm running FreeBSD 12-CURRENT amd64, I'm having two NICs (re0 and re1), both connected to a home router.

Bridged Networking

Bridged networking, just like name suggests, means bridging together VM interfaces and the uplink interface, putting those in the same L2 segment.

Configuration is relatively straight-forward. Let's start from a completely fresh configuration, where we don't even have re1 (uplink) configured:

kloomba# ifconfig re1
re1: flags=8802 metric 0 mtu 1500
        options=8209b
        ether 18:a6:f7:01:66:52
        nd6 options=29
        media: Ethernet autoselect (100baseTX )
        status: active
kloomba#

Now let's create a bridge named brextand add re1 to it. Also, we'll run dhclient on it to obtain an IP address (in my setup it comes from a DHCP server running on my home router):

kloomba# ifconfig bridge create name brext                                                                                                                                                
kloomba# ifconfig brext addm re1
kloomba# ifconfig brext up
kloomba# ifconfig re1 up
kloomba# dhclient brext

As a result we have an IP address assigned to the brext bridge:

brext: flags=8843 metric 0 mtu 1500
        ether 02:29:bb:66:56:01
        inet 192.168.87.46 netmask 0xffffff00 broadcast 192.168.87.255 
        nd6 options=1
        groups: bridge 
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: re1 flags=143
                ifmaxaddr 0 port 2 priority 128 path cost 200000

Now we need to create a tap (that will be tap1 in my case) interface for VM and boot it up:

kloomba# ifconfig tap create up
kloomba# ifconfig brext addm tap1

And boot a VM like in a way you like, for example:

bhyve -c 2 -m 4G -w -H \
        -s 0,hostbridge \
        -s 3,ahci-cd,/home/novel/FreeBSD-11.0-CURRENT-amd64-20160217-r295683-disc1.iso \
        -s 5,virtio-net,tap1 \
        -s 29,fbuf,tcp=0.0.0.0:5900,w=800,h=600,wait \
        -s 30,xhci,tablet \
        -s 31,lpc -l com1,stdio \
        -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \
        vm0

Now we can open up a VNC client and connect to this VM (I use vncviewer :0) and do the following:

vm# dhclient vtnet0

If things go well, you'll get an IP address from the same subnet as IP address on host's re1 and, obviously, it's served by the same DHCP server that serves the host's re1.

### host ###
kloomba# ifconfig brext
brext: flags=8843 metric 0 mtu 1500
        ether 02:29:bb:66:56:01
        inet 192.168.87.46 netmask 0xffffff00 broadcast 192.168.87.255
        nd6 options=1
        groups: bridge 
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: tap1 flags=143
                ifmaxaddr 0 port 10 priority 128 path cost 2000000
        member: re1 flags=143
                ifmaxaddr 0 port 2 priority 128 path cost 200000
kloomba# 

### vm ###
root@vm0:~ # ifconfig vtnet0
vtnet0: flags=8943 metric 0 mtu 1500
        options=80028
        ether 00:a0:98:1b:c8:07
        inet 192.168.87.47 netmask 0xffffff00 broadcast 192.168.87.255
        nd6 options=29
        media: Ethernet 10Gbase-T 
        status: active
root@vm0:~ # 

To understand a little better what's going on here, let's run ping on a VM:

vm# ping 8.8.8.8

... and check how it looks like on the host:

kloomba# tcpdump -qni brext -c2 -e host 8.8.8.8 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on brext, link-type EN10MB (Ethernet), capture size 65535 bytes
16:09:44.755083 00:a0:98:1b:c8:07 > 40:4a:03:76:de:1d, IPv4, length 98: 192.168.87.47 > 8.8.8.8: ICMP echo request, id 25603, seq 4, length 64
16:09:44.783499 40:4a:03:76:de:1d > 00:a0:98:1b:c8:07, IPv4, length 98: 8.8.8.8 > 192.168.87.47: ICMP echo reply, id 25603, seq 4, length 64
2 packets captured
6 packets received by filter
0 packets dropped by kernel
kloomba# 

What we can see here? Packets from our VM are leaving the host with the IP address it has on vtnet0, MAC address is also vtnet0's MAC. BTW, 40:4a:03:76:de:1d is MAC of my router.

That is it, it works, does not need stuff like firewalls or routing configuration to work, so this approach is relatively easy. There are downsides of that, however. It's pretty common that the router your PC is connected to is configured to only pass only single MAC per networking port, or maybe even only a whitelisted MAC (that's quite common in office environments) or if your home router does not support that. In that case you'll have to go with the NAT approach that I'll describe next.

NAT Networking

Let's assume we're starting from scratch and don't have that brext bridge we created in the previous section. And we're again starting with creation of the new bridge:

kloomba# ifconfig bridge create name brnat up
kloomba# ifconfig tap create up
kloomba# ifconfig brnat addm tap1
brnat: flags=8843 metric 0 mtu 1500
        ether 02:29:bb:66:56:01
        nd6 options=1
        groups: bridge 
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: tap1 flags=143
                ifmaxaddr 0 port 10 priority 128 path cost 55
kloomba# 

As we can see, we no longer have our uplink interface re1 in the bridge. Now let's start a VM, command will be exactly the same as in the previous section, so I won't repeat it here.

Now, if we go to the VM and try to do dhclient vtnet0 nothing happens, because there are no DHCP server reachable from this VM. It's a good time to decide what IP range we'll use for our VM(s). Let's go with something like 10.0.0.0/24. Let's configure pf to do NATing for us. Basic /etc/pf.conf for this purpose might look like this:

ext_if="re1"

virt_net="10.0.0.0/24"

scrub all

nat on $ext_if from $virt_net to any -> ($ext_if)

pass log all

What we're doing here? For packets coming from $virt_net (our VMs range) we're translating its source address from 10.0.0.0/24 internal net to address of our external interface (re1). Now we can load the rules, enable pf and check if that works.

kloomba# pfctl -f /etc/pf.conf
kloomba# pfctl -e
pfctl: pf already enabled
kloomba# 

We also need to assign a proper IP address to our bridge:

kloomba# ifconfig brnat inet 10.0.0.1/24

Also, it's a good time to ensure that IP forwarding is enabled on the host: sysctl net.inet.ip.forwarding=1.

Now VM is expected to get connectivity if we manually assign an IP address to it:

vm0# ifconfig vtnet0 inet 10.0.0.2/24 up
vm0# route add default 10.0.0.1

Now things should work and we can actually see what's going on with the packets. Let's start ping again in our VM: ping 8.8.8.8 and tcpdump on interfaces on the host. Let's start with brnat:

kloomba# tcpdump -ni brnat -c 2 -e host 8.8.8.8
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on brnat, link-type EN10MB (Ethernet), capture size 65535 bytes
17:28:36.101514 00:a0:98:1b:c8:07 > 02:29:bb:66:56:01, ethertype IPv4 (0x0800), length 98: 10.0.0.2 > 8.8.8.8: ICMP echo request, id 21507, seq 62, length 64
17:28:36.129840 02:29:bb:66:56:01 > 00:a0:98:1b:c8:07, ethertype IPv4 (0x0800), length 98: 8.8.8.8 > 10.0.0.2: ICMP echo reply, id 21507, seq 62, length 64
2 packets captured
2 packets received by filter
0 packets dropped by kernel
kloomba# 

And on our uplink interface re1:

kloomba# tcpdump -ni re1 -c 2 -e host 8.8.8.8                                                                                                                                                                       
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on re1, link-type EN10MB (Ethernet), capture size 65535 bytes
17:31:37.898499 18:a6:f7:01:66:52 > 40:4a:03:76:de:1d, ethertype IPv4 (0x0800), length 98: 192.168.87.44 > 8.8.8.8: ICMP echo request, id 19102, seq 236, length 64
17:31:37.926781 40:4a:03:76:de:1d > 18:a6:f7:01:66:52, ethertype IPv4 (0x0800), length 98: 8.8.8.8 > 192.168.87.44: ICMP echo reply, id 19102, seq 236, length 64
2 packets captured
3 packets received by filter
0 packets dropped by kernel
kloomba# 

We can see that at this point no information about VM is exposed here (i.e. no VM subnet 10.0.0.0/24, no vtnet0 MACs etc); NAT works as expected.

As you can see, NAT networking is a little more complex configuration-wise. Though it's probably the most general solution, you don't have to rely on external routers configuration, bridging support in the hardware/drivers and so forth.

This configuration can be simplified though, a good step to it would be configuring DHCP server on brnat to serve IP addresses from our VM range. This could be done for example using the dns/dnsmasq tiny DHCP server.

NIC Passthrough

This is a somewhat fun way to setup networking because a) you'll need 1 (one) physical NIC per VM b) you'll need one more physical NIC for host if you want it to keep connected. This might be much better with SR-IOV though I've never tried SR-IOV cards on FreeBSD. Anyway, back to the point.

I'm going to passthrough re1, I'm running pciconf -l -v to find its PCI address:

re1@pci0:3:0:0:        class=0x020000 card=0x85051043 chip=0x816810ec rev=0x09 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

So I add pptdevs="3/0/0" to /boot/loader.conf and reboot. After reboot it looks this way:

ppt0@pci0:3:0:0:        class=0x020000 card=0x85051043 chip=0x816810ec rev=0x09 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

Now starting a VM like this:

bhyve -c 2 -m 1G -w -H -S \
        -s 0,hostbridge \
        -s 4,ahci-hd,/home/novel/img/uefi_fbsd.raw \
        -s 6,passthru,3/0/0 \
        -s 29,fbuf,tcp=0.0.0.0:5900,w=800,h=600,wait \
        -s 30,xhci,tablet \
        -s 31,lpc -l com1,stdio \
        -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \
        vm0

If things go well (i.e.: host supports IOMMU, device supports passthrough, ...), we'll see this device in a VM exactly like it would appear in host:

re0@pci0:0:6:0: class=0x020000 card=0x85051043 chip=0x816810ec rev=0x09 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

At this point it can be used just like it was not a VM but another host connected to a network with its own NIC. Once can run dhclient re0 etc.

Further reading

Update Oct, 17th, 2016: added pics.

Thursday, February 18, 2016

Re-using arguments and environment variables for autotools' configure script

A thing that I'm going to describe is nowhere difficult, however it's not very obvious and actually I was pretty surprised that I only managed to think about it after so many years working with autotools.

From time to time I find myself needing to re-run the configure script because I either need to add or remove some option or just run it with sh -x for debugging purposes or with scan-build for example.

Usually I have a number of arguments and environment variables defined and I don't keep that in my head. I just hit ctrl-r and find the last execution of configure with all the stuff included and continue with that. But if I don't touch it for some time then it goes away from the HISTFILE. In that case I have to go the config.log file and copy/paste stuff from there. However, copy/pasting sucks and, moreover, it doesn't have proper quotes.

Apparently, there's a better solution for that: the config.status can provide everything that's needed.

$ ./config.status --config
'--with-hal' '--without-uml' '--without-polkit' 'CFLAGS=-g -I/usr/local/include' 'LDFLAGS=-L/usr/local/lib'
$

For example, to re-run all the same using scan-build just do:

$ eval scan-build ./configure `./config.status --config`

So it's pretty neat.

More info on the config.status scripts is available in its official documentation.

Wednesday, February 3, 2016

ZFS on Linux support in libvirt

Some time ago I wrote about ZFS storage driver for libvirt. At that time it worked on FreeBSD only because ZFS on Linux had some minor limitations, mainly some command line arguments for generating machine-readable output.

As of version 0.6.4 ZFS on Linux actually supports all we need. And I pushed the libvirt related changes today.

It will be available in 1.3.2, but if you're interested, you're welcome to play around with that right now checking out libvirt from git.

Again, most of the usage details are available in my older post. Also, feel free to poke me either on twitter or through email novel@freebds.org (note the intentional typo). It could take some time to get a reply though.