This post describes steps for fixing a robot that is broken. Since robots are computers, many of the steps also apply to fixing a broken phone, tablet or PC, although the details differ.
Hardware
The first step is to determine if the hardware itself is working. The hardware subsystems that most commonly fail are 1) the power subsystem 2) the compute subsystem (RAM/CPU/motherboard 3) the long-term storage subsystem (hard drive or SD card) . You want to therefore systematically check each subsystem (in that order) to verify that it is working correctly. Sometimes it is necessary to take the robot apart to check these subsystems. While it is rarely the case, it is possible that dust build-up inside your robot is causing ventilation issues. If you are pulling your robot apart and see a lot of dust, take a second to clear it out! Also, always make sure that when fixing your robot and the power is on, that you are near the e-stop. Be careful when taking the robot apart. Pieces can be very fragile and delicate and aren’t made for outside conditions. static charge can build on you, which is dangerous when getting near power systems to fix them. You should always wear an antistatic wrist strap when fixing power systems to ensure you don’t discharge any static into the robot. When doing this, be sure to keep track of all the pieces you unscrew, and it’s always good to have backups in case. Be careful with unscrewing, you don’t want to strip the screws.
Power Subsystem. The most common symptom of a power problem is that the robot or device just won’t turn on. This could be because the power supply failed, because the battery died, or maybe your wall outlet doesn’t have power. (Batteries have a limited number of charging cycles; they wear out and need to be replaced!) Most robots or devices have some kind of LED that indicates that power is being transmitted to the device, and the location, color, and meanings of these LEDs is often documented in the device spec sheet or on the board itself. For example, the Raspberry Pi Model B+ has two LEDs, labeled ACT (activity) and PWD (Power), which you can see in this picture. If you can’t verify power is coming to the device, get a multimeter and check that the voltage coming from the battery or power supply is what is specified in the datasheet. You might have to take the device apart to access the power supply; you might have to do some digging to find the data sheet. For example, here is the information on the voltages required to power the Raspberry Pi. Don’t forget to check the wall! Maybe the wall outlet isn’t receiving power, and your robot is fine. It’s also possible a short-circuit has occurred, and something is fried. In this case, you’ll most likely need to replace a fuse, wire, etc. One useful way to quickly check for this is the “smell test”: if the inside of your robot smells slightly smokey or burned, you most likely had a short and should look for something that looks burned.
Compute Subystem. The compute subsystem fails less frequently than the other two. However it is next in the debugging process because it is necessary for checking if your long-term storage system is working. Once a computer is powered on, it conducts a power-on self-test (POST). This POST occurs immediately after powering on, before you boot into your operating system (which requires access to long-term storage). It verifies that each of the hardware components of the machine are working. A PC that fails its post will make strange BIOS beep codes, and you need to figure out what the beep codes mean. This requires figuring out exactly which BIOS you have by looking up the specifications for that PC, or opening it up to look at the motherboard. Then the documentation for the BIOS will indicate what the beep codes mean. The Raspberry PI doesn’t have beep codes; instead it has LED flash codes to indicate different errors.
Of course, the best way to get more information about what’s going on with the POST is to plug in a monitor and keyboard into the computer. This may already be true if you are working with a PC, but if you are working with a robot, it may not have a monitor plugged in by default. The BIOS beep codes and LED flashes indicate problems without needing a monitor, but plugging in a monitor will show exactly what’s going on. Most robots support this somehow; for example our MOVO robot has an HDMI port on the bag to plug into. The Raspberry PI Model B+ also has an HDMI port, and many times my students are surprised to realize that you can plug in a monitor and keyboard and suddenly their quadrotor drone is a PC! Once you’ve verified the POST ran, you can be fairly confident the CPU, motherboard, and RAM are working. Or if not, it will indicate what’s wrong and you can try replacing those components.
Long-term Storage Subystem. If the POST tests succeed, the next step is for the computer to boot into its operating system, which is held on the long-term storage device. Hard drives and SD cards are another common failure point. Both have a limited number of read/writes before they will fail. Hard drives can have bad sectors, parts of the disk that are permanently bad, and can fail entirely. You want to verify that your robot is able to read the long-term storage and boot into its operating system. One of the simplest ways to do this is to plug it into a keyboard and mouse and see if it boots up! This allows you to watch the whole boot process and see if it completes successfully and enables checking other problems too, like networking and software issues. Fortunately the hard drive or SD card is relatively easy to replace, if you backed up your data. In many cases there may exist a standard disk image for the system you are working with; for example the SD card in our Kuri robot failed, but we were able to restore it using an image we could download from their website.
Networking
The next major goal is to connect to your robot over the network. You don’t want to start working on this until you are reasonably sure the robot is passing its POST test and booting into its operating system. The robot’s wifi can be configured in either Master mode or Managed mode. In Master mode, it will act as its own wifi hot spot. You will see an SSID on your base station laptop or desktop that corresponds to the robot’s network. In this mode if you connect to the robot (or PC), it will give your base station an IP address. You can find the robot’s IP address by using a tool like nmap to scan, or route to look at the gateway machine, or look at your base station’s logs to find the IP address of the DHCP server. Second, it can be configured in Managed mode, where the robot is looking for a wireless network. Typically in this case it is configured to look for a network with a particular SSID, and this configuration can be stored on the hard drive. Sometimes for commercial robots, there is a process where this is configured via an app. For example, I recently installed the Mysa smart thermostat. First my phone asked which network I wanted the Mysa to connect to. Then, it connected to the Mysa wireless network, where the Mysa itself was the AP master, and neither my phone nor Mysa had internet access. Then the phone tells Mysa what SSID and password to use, and then both devices connect to that SSID (the “house” internet), and can talk to each other. The Kuri robot works the same way. Third, the robot can be configured to use a fixed, static IP address. Many home routers work this way; in this case the static IP should hopefully be documented in the robot’s or device’s documentation. This is the cause of many issues: setting a static IP but the IP of the robot gets dynamically changed via DHCP server. A simple test is to make sure you can ping your robot, and that your robot can ping you. Sometimes the networking problem comes from firewalls. Make sure the firewalls aren’t stopping networking traffic your robot needs, but don’t just take them completely down; a full discussion of this is beyond the scope of this article, but note that we scanned the internet for robots and found a lot! You can configure your base station to use a different static IP address in the same subnet, and they should be able to connect. You can also snoop by using the ARP cache or network snooping tools, but that is beyond the scope of this article. If your network setup uses ethernet cables, make sure there are no issue with the wires. They can break quite easily without it being obvious. You should also be aware of your networking bandwidth/latency. If your robot is jittering around or sending incomplete sensor data, you may need to throttle things or swap from wifi to a wired connection.
Software
Next you want to make sure the software on your robot is working. The details of this are also beyond the scope of this article, but at a high level, you want to make sure that 1) the software to make the robot move started up without errors and 2) that it can connect to whatever it needs to connect to to do its job. Typically this means using tools like ps to show running processes, checking log files to see if there are errors on startup, and writing and running client programs to see if each of the programs is running. The key is to be systematic; check each sensor or actuator subsystem on the robot to verify it started and is working, because you may find other hardware issues in this process. Typically each sensor comes with its own drivers and minipackages for checking if they work, and these drivers are used within a larger system like ROS to connect things together. Always make sure that the drivers for your sensors work on your robot platform first. For example, a common problem on our drones is being unable to connect to the flight controller so the drone can’t send a command to arm or spin its motors, because the USB socket has failed. ROS does not make this debugging process easy, because people often configure their robot to use a single roslaunch file to start the entire robot stack, making it easy to miss errors in one subsystem or another, and hard to audit a substack. For example, to debug our MOVO robot, we manually started the MoveIt stack from the command line to see the error logs and try different configurations to fix problems.
Software Dependencies.
Calibration. A very common source of errors is miscalibration of some kind. For example, our MOVO robot has a Kinect 2 sensor mounted on its head. Even though we carefully calibrated the sensor about one year ago, it got knocked or moved in that time, and when we checked it yesterday it was off by quite a bit. If your robot does not know where its sensors are relative to its base frame, then it cannot effectively use the sensor data. To check this, use a tool such as RVis to visualize the sensor data overlayed on the robot’s model, or use some kind of fiducial such as AprilTags to check the calibration or recalibrate.