23/03/2021

OVH on fire

As you may heard on march 10th a large fire destroyed part of a big datacenter in Strasbourg owned by OVH (maybe the biggest european service provider), and yes, this blog burned with it.

After the accident there was a huge discussion on the web, flames (sigh…) on Twitter and Reddit about this crazy provider which doesn’t have a disaster recovery plan or some sort of automagic backup, so people get stucked with no options other than start their site/service from scratch…

Some of you may think I’m mad about it and I would run away from this provider… well I’m not and I’ll remain with OVH.

The reasons are very simple, first of all as you can see the blog is back (maybe better than before, things like this always makes you think how can you improve stuff, or at least this is how they work for me) because (surprise surprise!) I had a backup every 6 hours on another location (thanks restic).
The second reason why I decided to stay with OVH is that their vps offer is perfect for my needs, it costs like a shared hosting service and runs so much better, and obviously I can do whatever I want with my private vps, instead of get stucked with only a wordpress hosting service.

And no, I’m not mad with OVH, because even without reading carefully the contract I signed, I knew from the beginning that I had to take care of backups, even if they were included in the service (and they’re not in my case).
Why? Because I want backup made on my way, so I can control them, I can check them, I can figure out the best recovery plan for me.

I understand those who were complaining about backups made in the same location where the burning happened, they payed for a service and it has a flaw (a big one, don’t get me wrong).
But from my perspective there was a bigger flaw, and it was their thinking “ok I paid someone to take care of the backup, job’s done”.
No… no…. NOOOOO!
If you own a service you have the responsibility to take care of the backup, to understand it, to figure out the recovery plan, and to test it; if their backups burned with servers it’s because they missed one, many or all those points.

That’s it, for me the case is closed.

22/03/2021

Dell iDrac java patch

Every now and then people ask me which is my favorite server producer, and every time I honestly don’t know how to reply because they all work pretty well.
What really changes between competitors are technical support and some of the small bits that many people consider irrelevant, but Imho they are very important, one of them, maybe the most important, it the lights-out management interface (LOM).
Every server producer has it’s own LOM interface, but my favorite (and one of the reasons why I prefer Dell servers) is the Dell Drac.

One of the most common problems with Dell Drac is the virtual console which requires Java JRE and obviously this makes people angry because… well basically because people are lazy, most of the time leave the brain turned off and don’t read errors and exceptions…

If you search online “dell drac java error” you’ll find a whole bunch of forums, thread, reddit posts, also useless Chrome extentions for make the damn virtual console work, sometimes those sources are crap, sometimes they contain small bits of the solution, which is changing because there are several versions of Drac devices and obviously they evolved during the years.
These errors always came from the java.security settings, Drac encrypt data transmissions, and old Drac cards use old encryption protocols and cypher suites, so I decided to make a simple patchfile for the java.security file for a quick change and rollback (it’s not a good idea to turn on old unsecure protocols for you JRE).

First of all you have to identify your java.security file, which is inside you JAVA_HOME/lib/security, after that apply this java.security patchfile.

After that open you java settings and add the url of your Drac web interface to the “Security > site exception” list.

That’s all, now you’ll be able to open the vitual console even on an old Drac 5 with the latest JRE (tested right now with JRE 1.8.0_261).

03/04/2020

AWS EC2 instance migration

Recently I received some complanings about load problems on an AWS EC2 t2.medium instance with CentOS 7, despite being a development environment it was under heavy load.
I checked logs and monitoring and excluded any kind of attack, after a speech with the dev team it was clear that the load was ok for the applications running (some kind of elasticsearch scheduled bullshit).

The load was 100% from cpu but I noticed some interesting behavior since a couple of weeks with a lot of steal load.

Looking to EC2 CPU Credits it was crystal clear that we ran out of cpu credits, which turned on some heavy throttling.

Since the developers can’t reduce the load from the applications and the management won’t move from EC2, the solution I suggested was to move to a different kind of instance specifically designed for heavy computational workloads and without cpu credits.

So I made some snapshots and launched a new C5 instance, piece of cake, right?
Well no… as soon as I started the new instance it won’t boot, and returned “/dev/centos/root does not exist” on the logs. :\

So what’s going on here?
Pretty simple, there are significant hardware differences between each type of EC2 instance, for example EC2 C type instances have NVMe SSD storage which require a specific kernel module, same for the network interface with ENA module.

The goal here is to make a new init image with these two modules inside, so during the boot the kernel could use these devices, and find a usable volume for boot and nic for network; the only problem is that we can’t simply boot the system using a live distro and build a new init image with those modules already loaded, remember we’re on AWS not on a good old Vmware instance (sigh…).

First of all I terminated the new instance, it was basically useless, and got back to the starting T2 instance.
Check which kernel version you’re using with “uname -a” and build a new init image including nvme and ena modules using mkinitrd, for example:

mkinitrd -v --with=nvme --with=ena -f /boot/initramfs-3.10.0-1062.18.1.el7.x86_64-nvme-ena.img 3.10.0-1062.18.1.el7.x86_64

Using lsinitrd you can check that your new init image has nvme and ena module files inside.

Now you have to edit your grub config file (/boot/grub2/grub.cfg) and change your first menu entry switching from the old init image to the new one.

Save /boot/grub2/grub.cfg file, CHECK AGAIN YOU HAVE A GOOD SNAPSHOT OR AN AMI, and reboot, nothing should have changed.

Now you can make a new snapshot or AMI and build a new instance from it, choose a C type instance and now it should be able to boot properly.

As you can see the new C5 instance have different storage device names, it has a new nic driver (ena) and it has ena and nvme modules loaded.

Life should be easier without the cloud… again.

05/08/2019

On containers and orchestration

I don’t know what do you think about containers and orchestration tools, but for me it’s been a while since I started discussing on those topics on forum and various platforms, and I can’t say It’s an easy discussion.
To be honest I’m quite tired of repeating the same concepts so I thought this blog could be help me to express my point of view, at least in future I only have to copy and paste this post url and not waste time anymore :)

For someone my point of view on containers and microservices could sound a bit grumpy, but I can assure that I’m not against them, or think they are bad or simply some nasty trend.

First of all It’s imperative to distinguish between two main scenarios: development/test and production environment.

For the first one containers are wonderful, they are the Nirvana and Shangri-La all together, they give developers the opportunity to setup dev environment in no time, no setups, no hard requirements, everything is perfectly consistent and works in the same way as the tools they are familiar with (think about git or other versioning softwares), everything is perfectly reproducible on whatever platform, their work pc, home pc, a server, a laptor, everything.

For production… well It’s not the same.
First of all: the most important requirement for a production environment is reliability, then comes security, then everything else.
Literature, research, experience and logic taught us that these two basic requirements could be reached only following the KISS principle (Keep It Simple Stupid) which is not a joke, It’s real, It’s true and It works, period.
Second, when your application moves from development to production It moves from developers to sysadmins, It’s the sysadmin the guy which must maintain It, that must provide resources and guarantee that the application (and all its requirements) are working, are reliable and kept secure.
Third: if you think about the lifecycle of an application most of It is in production, an application could require a few months of development but will remain online for years, and usually the more It will require for the development the more It will remain in production.

Following the KISS principle means REDUCE COMPLEXITY, containers ADD COMPLEXITY :\
This may sound strange to containers users because “well it’s a piace of cake, launch a couple of commands and you’re ready to go”, well stop for a moment and think about it.

  • On a “legacy” environment you have, applications (your site, your database, etc etc) on top of services (webservers, application servers, rdbms, etc etc shared by several applications) on top of an OS (shared by several services) on top of some sort of hardware (which could be very complex in production, and could be shared between several OSs if you’re using virtual machines).
  • With containers and orchestrators (like Kubernetes or Openshift) you have a LOT more complexity, you still have your applications, on top of services (which are the same as before for instance…), on top of containers, on top of a container environment (for example Docker), on top of orchestrators, on top of a OS (usually installed on a vm, so add also the hypervisor complexity) on top of hardware.

More complexity means less reliability and less security, or a least a lot more work and variables to manage to reach the same level of those requirements.

Like everything there are pros and cons, and someone could argue that beside that added complexity containers and orchestrators give you a lot of benefits, mainly:

  1. reproducibility
  2. horizontal scalability (adding more “nodes” and distribute load across them).

We already talked about the first before, It’s awsome for a developer, but for a sysadmin on a production environment?
Well, not really, simply we don’t need it because move an application between different environments is really rare (in 20+ years of IT consultant work It happened to me only one time, moving and Oracle 10g rdbms from Windows Server 2003 to RHEL) and usually any service have its own backup and restore procedures to accomplish this goal.

Scalability is another story, in theory it’s an awsome thing, in real world it’s not a big deal in 99% of companies or services.
The idea of having some black magic that will add more and more instances of your application and distribute the load on those is good and managers love it, they simply think this is the solution for every problem because, let’s be honest, they don’t understand the complexity behind an application and they think that all the problems come from lack of resources.

In the real world any experienced sysadmin can confirm that usually resources are enough and when an application have problems it’s all about some bug, exception, unmanaged situations (for example the application use a third party service which doesn’t work and the application does not manage it), all those things can lead to a slow or unresponsive application even if you have a lot of resources available.
Adding more and more application instances and balance load across them will lead to a simple result: more and more exceptions.

Ok, let’s ride our fantasy horse and think about a bug free application, can we scale up with containers with no worries?
Sure you can, but do you really need it?
For 99% of companies nowadays it’s really rare that an application don’t have enough resources, or have such a huge amount of requests to run out of resources.
If you are Google, Facebook, Netflix, Amazon or any other global huge company maybe you really need horizontal scalability, so orchestrators and containers are very useful (it’s not surprising that one of the most popular orchestrators, Kubernetes, came from Google), otherwise… well no, you don’t need it.

So that’s all?
No, there are a lot of smaller pros and cons (most cons to be honest imho…) on this topic, like security, access on third party resources, logging, service redundancy vs consolidation and many more.
Most of these are huge problems with containers (and most of the times they are ignored) while with a traditional architecture they simply aren’t, even a simple stdout append on a log is a pain in the ass with containers, and require to add a lot of complexity to reach this simple goal (remember KISS principle).

Let me add a small personal thought on this subject that can be extended in many other areas of the IT universe.
As you probably understood I think that containers are a really good tool for developers and not a good one for sysadmins, containers are born from developers for developers.
Why people always think that a development tool should work for production?
Why people don’t think from a production perspective when they develope tools for production?

If you are a developer and you’re working on a tool that will be used in production, please ask your sysadmin what does he think about it, what’s its requirements, what are the problems that he usually have in production and he need to fix.
Maybe if we start to develope tools from the right perspective we’ll have better results, otherwise we will continue to have always the same problems.

[EDIT 23/04/2021]
Let me just add a little contribute from John Carmack, one of the greatest developers of all times, the father of DooM and Quake, and in general of modern FPS videogames.

26/02/2019

Windows 10 1809 RDP “black screen of death”

Recently I noticed some serious issues connecting from my Windows 10 1809 laptop to another Windows 10 1809 hosts via RDP over VPN (OpenVPN over TCP).
The RDP server replied correctly to the initial connection and asked for login, but after entering username and password… nothing, only a deep, dark, freightening black screen of death :(

No problems with RDP server with other previous Windows versions (Windows 7, Server 2012, Server 2016), looking to my OpenVPN server I found no errors in the logs, but on my OpenVPN client (v 2.4.7-I603) I found some interesting log records:

Mon Feb 25 15:13:58 2019 us=737935 openvpnclient/xxx.xxx.xxx.xxx:48404 MULTI: packet dropped due to output saturation (multi_process_incoming_tun)

Looking through OpenVPN forums and ML I found no help, a few old issues with this kind of errors but no solution at all, and nothing that makes sense with my setup.
So I started to look around for Windows issues or new features introduced with 1809 service pack (forgive me but I still call those updates in the old way :P ) and I found some interesting news, MS introduced RDP UDP Transport Extension Version 2 and enabled by default UDP protocol for the RDP client.
Seems like this cause some fragmentation isssue on UDP protocol that cause the VPN connection to drop.

Actually there’s no patch to solve this problem and seems like the few threads on Technet didn’t have any official reply from MS, for now the only workaround is to enable  “Turn Off UDP On Client” using local group policies:

  • start group policy management with gpedit
  • navigate to computer configuration\Admin Templates\ Windows Components\Remote Desktop Services\ Remote Desktop Connection Client
  • open “Turn Off UDP On Client”, enable it and apply changes

Now without any reboot you should be able to connect to 1809 hosts without the ugly “black screen of death”.

« Post precedenti | Post successivi »