Knorrieyeah but this btrfs-show-super thing I'm using as an example is already borken
Knorriein the official repo
Knorrieand the fix kdave did is not sufficient
Knorriebecause it only works by accident if you first build other things
Knorrieor I'm stupid
Knorriejust doing "make" first now
Knorrie [CC] cmds-inspect-dump-super.o
Knorrieok so the content of the c file is definitely not the problem
Knorrieand V=1 doesn't seem to make any difference by the way
Knorrievim Makefile
darkling# V=1 verbose, print command lines (default: quiet)
darklingHmm. Yes, that does seem a bit broken.
Knorriethere's only 1 occurence of btrfs-show-super in the makefile
Knorrieprogs_extra = btrfs-fragments btrfs-calc-size btrfs-show-super
darklingAh, try "make V=1 ..."
Knorrieif I add baby1 to that line, trying random things, no change in behavior
Knorrieyup that's more output
darklingFundamentally, you need to add _all_ of the .o files which define functions used by the program (transitively) to the linker command line.
darklingExactly how you do that within the makefile infrastructure that's used here is less obvious...
Knorriemaybe I need to rerun some magic autorunconfgenerate./configure again
darklingI don't think it's been automaked.
darkling(Which is probably a good thing)
KnorrieI had do ./
Knorrieand other stuff
Knorriewell, enough for today
darklingYes, that's autoconf, which automatically writes an include file.
darklingIt also substitutes variables into *.in files to make them into * files.
darklingIt doesn't do anything more than that, though.
Knorriererunning and configure doesn't help
Knorriebaby step 1 failure so far :o
darklingThere's a which looks like it contains all of the magic substitution variables for the Makefile.
darklingBasically, the Makefile is a genuine ordinary bog-standard human-written makefile, as far as I can see.
darklingSo nothing to do with ./configure or autogen will do much to it.
darklingThe misconfiguration you're seeing is a problem with the makefile and _only_ the makefile.
darklingIt just needs the rule for the binary you're building to have the right set of .o files (or source files) defined.
darklingbtrfs_show_super_objects = cmds-inspect-dump-super.o
darklingAbout four lines below that is some unpleasant magic that iterates over the several similar variables and puts them into "standalone_deps"
darklingThen there are two more lines which use standalone_deps, which are the rules for all the btrfs-* tools and btrfs-*.static tools.
ElladanHey so if I run dd writing zeros to btrfs, the dd stream will massively dominate all IO, causing massive latency for other, interactive btrfs users and cause my VMs to experience >60s disk timeouts.
Elladan(I also become almost unable to log in and other fun stuff).
ElladanI guess there wasn't a question there. Is there any hope for my precious latency? :-)
kpcyrdhey, I've noticed I can `btrfs su cr foo` as regular user, but I need to be root to `btrfs su del foo`, is there a reason for that?
kpcyrdI would expect that I can delete the subvolumes I created as non-root
ElladanPresumably it's because allowing you to do that would be a security violation.
Elladani.e. if other users created files in your directory.
ElladanSimilarly you can create a directory, but if root goes and makes a subdirectory with some files in it, you're not allowed to delete it any more.
kpcyrdhmm, interesting
ElladanSince deleting a subvolume is recursive, it would go and wipe out root's files in that situation.
ElladanTBH I'm pretty surprised creating a subvolume is allowed in that situation. It seems obviously wrong.
Kamilionworks well for me; I keep xen VM disk images in subvolumes. The unpriv'd daemon may create new subvols to house new VM images, but you either need capabilities, uid0 or to be logged in as root to remove them.
Kamilioncan't delete snapshots of the subvolumes either (which IIRC, are subvolumes themselves)
KamilionElladan: Add a second drive to the volume with the RAID1 profile; it's helped most of my servers out a lot.
Kamilionsomething about how native command queuing works; when I have a two drive RAID1, I can still read with a heavy write load.
KamilionACTION shrugs
ElladanKamilion, I have three drives and it has no effect.
Kamilion... wha?
Kamilion... how? What? That's supported?!
ElladanIt's not actually RAID-1.
ElladanIt's object mirroring.
Kamilionnegative, it's chunk based mirroring.
Kamilionobject mirroring only comes into play with Ceph
ElladanSame thing, where the chunks are objects.
Kamilionbtrfs's gigablocks sort of prevent any chunk from being larger than a gigabyte.
ElladanIt's only called RAID-1 to make the term familiar for people, but the semantics are very different.
Kamilionyou can have a 20GB file, and it'll be spread across 21 gigablocks in various extents.
Kamilionthe extents are mirrored.
Kamilionyeah, I know.
KamilionStripe, mirror, stripe and mirror, parity, stripe and parity.
Kamilion0, 1, 0+1, 5, 6.
ElladanAlso, I don't understand why command queuing would be expected to help there.
ElladanI mean, each drive has the same write commands, but one drive receives each read. The end result should be hotspots.
Kamilionthe read command ends up getting queued on the opposing device while a write's executing
Kamilionthat's the thing, linux's IO elevator doesn't promise they'll be getting the same write commands, as you noted, this is not really RAID.
Kamilionat least in the sense of a hardware caching RAID adapter
ElladanThe latency issue seems like it must be related to load balancing / transaction size / something. Switching to a different IO scheduler also has no effect.
Kamilionfair queue elevator makes no difference?
Kamilionmaybe my experience is different because I'm on server grade hardware.
Kamilionis it possible you're saturating some bus link?
ElladanThough again that shouldn't really matter with proper load balancing.
KamilionACTION scratches his chin
KamilionWell, I'm working on a system right now with 12X 800GB SAS SSDs and 2X intel SATA SSDs
ElladanWhat does matter is reducing the write load, exactly as you'd expect if the problem was queuing up too much write work instead of load balancing properly.
KamilionI see a large difference in performance on the intel SSDs between two independant btrfs volumes, and one volume in the RAID1 (metadata and data) personality.
ElladanAlso BTW, "the read command ends up getting queued on the opposing device..." doesn't really make sense. Both devices should receive the write at approximately the same time, no?
Kamilionah, i've seen that with DD and DC3DD
Kamilionapproximately, but it's never entirely deterministic
ElladanAnd reads are load balanced to devices in btrfs last time I checked, they're just assigned through some sort of hoaky process affinity.
Elladaner, aren't.
KamilionOh. Perhaps that's it -- I am using ionice on everything that can do heavy I/O
ElladanOh, you're using large numbers of SSDs. The reason you're not seeing latency is probably because the aggregate throughput is very high.
Kamilionwhich for me, ends up to be qemu-dm
ElladanAlso because the ssds command latency is extremely low.
Kamilionuh, two different sets
Kamilionthe SAS SSDs are dual channel
Kamilionbut they have nothing to do with it
KamilionI'm talking about the pair of intels, the boot disks
ElladanOh also ionice has no effect BTW.
KamilionIntel model 320 SSDs, 160GB
ElladanIntel SSDs are still massively fast.
Kamilionsingle channel SATA 6Gb
Kamilionyeah, about 300MB/sec
Kamilionnot too much faster than my spinning disks at 180MB/sec
ElladanAnd what 100,000 IOPs or something.
Kamilionpfft, no f--kin way
Kamilionmaybe for the DC3500s
Kamilionbut not for these consumer SSDs
Kamilionthe DC series does more like 570MB/sec, saturating the link
ElladanYour spinning disks get like what 100 iops. I'd assume the intel SSDs are on the order of 1000x faster.
Kamilionthese get nowhere even close
Kamilionthe spinning disks do about 4800 IOPS
KamilionSeagate ST3000DM001s, just about the fastest consumer rust you can get hands on.
ElladanLink speed is seldom relevant except for streaming to high performance SSDs.
KamilionSequential IO at about 180MB/sec
KamilionRandom at around 10-20MB/sec, thanks to the "massive" 128MB of DDR cache
Kamilionthat's 4K though
Kamiliongo down to random 512 and it sharts due to the 4K sectors
ElladanThe Seagate ST3000DM001 is a 7200 rpm drive, it doesn't get 4800 IOPs
Kamilionalso doesn't like 16K random I/O benchmarks.
KamilionSure does.
KamilionOh, nevermind, my mistake
ElladanIt'll do 4800 sequentual iops, but so will basically any hard disk.
Kamiliononly gets that behind the LSI 2008
Kamilioner, no, that one's an LSI 2108 with a gig of cache
KamilionEither way
KamilionI have a 24 bay machine and I can't fill it with the 800GB SSDs
Kamilionbecause they saturate the LSI's bus interface
KamilionI'd have to add a second adapter
Kamilionwell, third adapter
ElladanA quick google indicates your Intel 320 series SSD gets 40,000 read / 20,000 write iops as tested.
Kamilionthe intel S2600GZ motherboard's got a pair of SAS connectors
Kamilionand then a RAID riser with another pair of SAS connectors
ElladanA 7200 RPG disk will get on the order of 100-200 iops on the same test.
Kamilionyeah, the artificially high figure I'm seeing is due to the LSI's cache
KamilionStill, I havn't found better drives than the ST3000DM001s for mass storage at low cost. They suffer from infant mortality; but if they've run for more than a month, they tend to last years.
KamilionI've got velociraptors, but those aren't cheap.
ElladanYeah so in conclusion I think your btrfs installation probably suffers from more latency than it should, but the high throughput and low latency of your setup means that performance is acceptable for you.
Kamiliondoubt it.
ElladanI actually don't really have any particular performance needs, but it would be nice if my IRC VM didn't become nonresponsive when I write a large file. :-)
Kamilion... wat
Kamilionyeah, something is seriously wrong then.
Kamilionthe perforce and gitlab servers are always hammering the disk
Kamilionand yet none of the windows VMs hosting games seem to notice.
ElladanNone of what you're saying is really inconsistent with what I'm saying. You have a very different number of disks than my server does, and the disks have radically different performance characteristics.
KamilionThis is a two disk system.
ElladanAlso you mentioned "write load" but have you actually tried the test I mentioned?
ElladanAh OK, I see that what you described up above was a system you're "working on"
Kamilion^C3046+0 records in
Kamilion3046+0 records out
Kamilion3193962496 bytes (3.2 GB, 3.0 GiB) copied, 45.423 s, 70.3 MB/s
ElladanCan you show me your command line?
Kamiliondd if=/dev/zero of=bigfile bs=1M
ElladanThat's funny, I got 150 MB/s
ElladanI guess I have 3 disks instead of 2.
Kamilionit certainly was using the entire disk bandwidth during that
ElladanDo you have compression turned on?
Kamilion... Pfffthahahahaha
Kamilionthat's not really going to work out so well for 40GB-200GB .img files
ElladanIt works great for /dev/zero though ;-)
Kamilionhm, really? lemme see.
KamilionACTION asks pixz to compress /dev/zero
ElladanAre your VMs COW or NOCOW?
Kamilionboth at the same time.
Kamilionthe running subvol is NOCOW, the snapshots are COW.
Kamilionnothing reads the snapshots though, so in practice, it's all nocow except for the backup files.
Kamilionthe interior filesystems are ext3.
Kamilion(not a good idea to stack btrfs if you ever want to use btrfs recover)
ElladanIt should work fine if your IO sequencing is correct :-D
Kamilionthey are periodically "defragemented" by resize2fs periodically
KamilionI used to use zerofree, but it was easier to ask resize2fs to move the extents with a shrink.
ElladanSo if you run your dd command, and then do "touch foo;sync" in a VM, how long does it take sync to return? Leave dd running until it's done.
Kamilionsame VM? Two different VMs? VM and host?
ElladanHost and one VM
Kamilioninstant return
Kamilionof course.
Kamilion/ is a tmpfs.
Kamilion(on the host)
ElladanWell obviously on btrfs and not in a tmpfs...
Kamilionroot@gitlab:~# touch foo; time sync
Kamilionreal 0m20.622s
Kamilionin the same VM
ElladanOk, so it took your VM 20 seconds to return from writing a trivial amount of data.
Kamilionroot@bdutahxeon:/mnt/btrfs/big-storage# touch foo; time sync
Kamilionreal 0m1.233s
Kamiliontook the host about 1.2 sec
Kamilionroot@perforce:~# touch foo; time sync
Kamilionreal 0m1.653s
Kamilion1.6 seconds on another VM
Kamilionthis is kernel 4.10.0-40-generic #44~16.04.1-Ubuntu
Kamilionaka linux-image-hwe
KamilionPerformance changed drastically when I switched from the stock 4.4.0
Kamilionand will probably increase slightly more when I get 4.13/4.14, due to some of the fixes that went in
Kamilioneither way though, xen 4.6.5 seems to be dealing with the latency well.
Kamilionand i doubt 17.10/18.04's move to xen 4.9 will change much.
KamilionAnyway, my #1 concern is recoverability. btrfs is the only filesystem so far that puts that anywhere close to first class
KamilionI have spent way too much time cleaning up after LVM2 metadata failures, XFS arrays losing a drive, all kinds of stupid shenanigans that should not have resulted in lost data.
KamilionSince going to btrfs in 2011, I havn't lost a single byte.
Kamilion(on btrfs)
Kamilionmeanwhile I've lost two ZFS arrays since, thanks to freenas.
Kamilionaaaaand that's why I don't use freenas anymore.
KamilionAnd if freenas's freebsd zfs driver can make those mistakes; I bet the linux driver might trip too.
Kamilionoh, and if you happen to pooch a ZFS volume? Get out the hex editor. Ain't no tools for fixing that.
KamilionACTION runs quite a number of servers
Kamilionbad disks are a regular occurance.
KamilionI freaking wish I could afford Drobos.
Kamilionhey, the red light over there is blinking, go change that drive.
Kamilionand I cannot seem to find a single linux based storage server distro that doesn't do SOMETHING retarded (usually picking PHP for their webUI)
Kamiliondoes anyone know of a btrfs storage appliance using nothing but python or go?
Kamilionno perl, no php, no nodejs, no cgi-bin, no query parameter bullshit
ElladanThe host had dd running at the time?
Kamilionnegative; VM was running DD
Kamilioni don't dare do any operations like that on the host
ElladanOn host: real 0m36.635s
Kamilionall the VMs qemu-dm instances are ionice'd, so dd on the host would absolutely override all the VM I/O
Kamilionplus my VMs do have their own IO priorities set
ElladanKamilion, I might have missed a few lines of what you said
Kamilion[18:20:43] <Kamilion> negative; VM was running DD
Kamilion[18:20:54] <Kamilion> i don't dare do any operations like that on the host
Kamilion[18:21:44] <Kamilion> all the VMs qemu-dm instances are ionice'd, so dd on the host would absolutely override all the VM I/O
Kamilion[18:22:26] <Kamilion> plus my VMs do have their own IO priorities set
ElladanAh then your test isn't comparable at all
KamilionI've spent years tweaking.
btrfs998i am facing issue
btrfs998my root partition i going in read only mode
btrfs998i have tried btrfsck still problem persist
xnxsany relevent logs as to what caused it?
Mobtrfs998: Once I got it mounted rw again after some of the --init-* options of btrfsck, but only if you have a backup. But anyway it was still broken after the repair, just a bit less.
MoQuestion, for a subvolume with only nodatacow files, set by chattr, what happens if I snapshot that? Doesn't snapshotting rely on COW?
multicoreMo: new writes will be COWed once (...until you make a another snapshot)
Momulticore: So even with nodatacow I will get one COW per snapshot. After deleting the snapshots the extents done by snapshot COW are still kept and I would need to defragment it to have the structure as if no COW has been done?
MoThe question is, because I disabled COW for the subvolume containing Virtual Machines. Now for backupping I snapshotted that and transferred the snapshots, then deleted the local snapshots.
multicoreMo: no-cow extent A is snapshotted to extent B, B is newly written data and it's file keeps NOCOW property (writes to extent B will overwrite the data)
KeMo: that is quite right, snaphotting will cause fragmentation when workload includes writes
MoOk, then I will make sure the snapshotting is only done without running VMs..
Kein theory, you can always run your vms in snapshot mode and then sync periodically
Keand block the sync, when doing backups
MoKe: But then I would need a deep copy of that whole subvolume, that I'm going to snapshot? That rsync A/ B/ would have a full copy in B/ ?
KeI don't follow your meaning
Kecheck snapshot in qemu manual
Kenot sure if there are good automated implementations for it though
MoSry, I misread "rsync periodically"... and with snapshot mode you mean snapshottting with VM utilities? I'm running VirtualBox. And the snapshotting I do with btrfs, because I like to send/receive them.
Kein my opinion that is the technically proper solution
Kedo cow on the short term and merge changes to original image periodically
MoSo I had the idea, running the VM in A/, rsync to B/ and only snapshot B/ while skipping rsync in the snapshot-phase.
Kersync is not atomic
MoYes, using the VM snapshot utilities I need to check, but as for btrfs snapshots that is easy to transfer to a backup with all other of my snapshots.
MoKe: You mean rsync on a running VM on A/ could leave a broken copy on B/ ?
Keit is unlikely, but possible
MoI see, therefore I like btrfs snapshots.
Kewell on btrfs it's actually even likely, as root is continuously cowed
Keon ext4 rsync would often work
Kei have done live block level copies every now and then, when I have been desperate
MoNone of all the actions is so IO blocking like defragment, neither duperemove nor subvolume delete. But defragment on some subvolume makes working with that btrfs quite impossible, also when prefixed with ionice -c 3 schedtool -D -e
Knorriethat's quite a lot
darklingLooks like it's sdi now
Knorrieuncorrectable... so, no other good copy, or, it failed to re-write on the drive that had it wrong?
darklingI'm hoping the latter.
Knorrieerror messages could be improved
Knorrieerror details: read=<blah>, so maybe it didn't even try to write correct data back again, because it couldn't even read the wrong data
optywhy do you use hastebin? no js, no content :/
Knorriedarkling likes to annoy us
darkling90% of the web is no-js-no-content...
DusXMTdarkling: But most paste services aren't
DusXMTAnd 90% is by far a stretch
DusXMTACTION could use RiscOS' netsuft browser, which doesn't have js, on a surprisingly large number of sites relatively problem-less
darklingThis is interesting: Dec 1 15:00:15 s_src@amelia kernel: BTRFS error (device sdi1): bdev /dev/sdi2 errs: wr 38677790, rd 38641891, flush 991, corrupt 0, gen 0
darklingNote that it's "device sdi1", but /dev/sdi2
Knorrieand just looking by devid instead of path?
Zygo'sdi1' could be the mount point of the filesystem, but even if it is, why another partition on the same disk?
Zygoeither way something is quite wrong there
darklingsdi1 isn't a mountpoint of the FS, and isn't a part of the FS.
Zygoeither the kernel or IO layers are badly confused, or user error
Zygo(or obscure test scenario)
darklingI haven't been messing around with anything.
darkling/dev/sdi1 is a BIOS boot partition (not quite sure why it's there, actually... probably something to do with gpt)
Zygo"badly confused" is sounding more and more likely
darklingYeah. :(
ZygoUSB anywhere? mismatched MBR and GPT partition tables?
darklingNo, it's all internal SATA. Not been touched in months.
darklingWell, it all seems to be less confusicated after a reboot. I should do another scrub, though.
darklingI rebooted my server to deal with the disk confusion, and now my media player doesn't boot.
darklingOK, looks like I may have two devices failing at the same time here. :(
Keperhaps just a transient failure?
darklingDon't think so. Loads of this:
darklingAlso, on the server side of the NFS root for the media player:
KeI don't know how to reda those, but I have had ata errors
darkling(Both of those are on the server)
darklingHmm. This is looking more and more like a controller failure.
darklingI can do smartctl --all on sdc and sdd, but not sde, sdf, sdg or sdh.
darklingYup, all four of those are on the same controller.
darklingReplaced the controller, and we're back in business.
darklingHad to shift four drives to an external eSATA PM enclosure.
darklingAnyone got an opinion on the likely quality/usability of this thing?
darklingOr this?
Zygoyou know you have a lot of files when 'rm -rf' is using 3.07GB of RAM to delete them
djwongACTION wonders what kind of .... i don't even ... would require 3G of RAM to delete?
Knorriemy first limitation with rm is usually that the command line gets too long
ZygoI think it's mostly extent buffer cache
Zygo10-20m files in the rm
Knorrienext time put them in a subvol and sub del :)
Zygo~38 GB of metadata in the FS
Zygoactually I do it with rm to work around what happens when you delete a subvols with. file open on it