Updated:
This page contains notes from my various tests and experiments. It is a raw record of what I did, without correction for errors, or later update for things that I learn. Use at your own risk.
Jetbot ROS2 #3: Use jetbot locally
2022-03-30
The issue today is how to organize my code.
I have a single repo with my ROS2 packages, bdbd2. The package ‘bdbd2’ is a subdirectory of the repo ‘bdbd2’ with just documentation.
Under that are separate packages, all prefixed with bdbd2. _msgs has messages. _jetbot contains code that runs on jetson-based robots.
Where to put the docker nodes? Is ‘docker’ even a package? Really it is a build system. This should be as independent of platform as possible.
Which means that I also need a place to store ROS2 nodes that could be run in multiple environments, including jetson, docker, cloud, workstation, etc. What to call this? And how granular do I make this? Should a node that will run for example dialog in docker have its own package? If they have messages, I will end up with packages that just have a single message in them, which does not seem optimal. But maybe inevitable.
For now, a node that has complex dependencies (like transformers for Dialog) probably should have its own ROS2 package. But let me leave the messages in bdbd2_msg until I decide to make it more public, at which point I might break that out.
The problem with a separate directory for docker-based based nodes is that I have to checkin to test. Seems unhelpful. Try instead separate docker in the package, perhaps with another repo that will contain common prerequisites.
So goals:
- ROS2 package with conversation
- docker-based repo to build
How about different vscode workspaces? Should each be their own repo, or could combine them? I need to keep separate the idea of vscode workspace, from ROS2 workspace.
2022-03-31
I did a test to combine together Dockerfile fragments, and came up with this shell code as an example:
#!/bin/bash
# Demo of concatenating various files together
frags="from.frag sox.frag nano.frag"
cat $frags > Dockerfile.3
I’m going to recast my Dockerfiles using this method.
BUT https://github.com/athackst/dockerfiles which is my starting point used jinja for this. Let me look at that first. No, too complex. Let me start with simple bash scripts.
2022-04-01
Today I hope to implement the ability to start docker processes on remote machines. In bdbd, this was done by running a ros node on each system, but this requires that ros be installed. I would rather simply require docker, and run ros2 in docker containers.
So I want to write a ros2 node that can start docker containers on remote systems.
With ssh, it is pretty straightforward:
kent@nano:~ $ ssh uburog docker run bdbd2/ros2base ros2 topic list
/parameter_events
/rosout
But I really need to test loading a node in the bdbd2 package.
Dependencies
How am I going to manage complex dependencies of individual nodes? I don’t want the entire bdbd2_core package to depend on complex things like torch or transformers. Plus I can’t fully use the rosdep system anyway since the rosdep keys that I need in many cases do not exist.
I need an install script for specific nodes, that will presumbably be kept in the bdbd2_core package. That means that script needs to be part of the docker context. I need to change therefore how docker is organized slightly.
(Update 2022-04-06): I’ve done this, the workplace has a dockers folders with fragments for different features.
2022-04-02
Xavier setup
I’m going to bring up the Xavier kit. I want to use Jetson 4.5.1 rather than 4.6.1, as some of the jetbot stuff did not work on 4.6.1. I also want the larger 256K card.
Going to https://developer.nvidia.com/embedded/downloads, I scrolled down to find the download https://developer.nvidia.com/embedded/l4t/r32_release_v5.1/r32_release_v5.1/jetson_xavier_nx/jetson-nx-jp451-sd-card-image.zip I’m going to flash that to my 256K card.
Oops, my xavier does not accept sd cards. I ordered instead an SSD NVME M.2 drive. I also tried removing that drive, xavier booted to that ‘other’ card which I had previously setup.
I’ve got to get the serial console working to mess with boot. Looks like that is minicom. Installed that on uburog.
2022-04-03
Had lots of problems with the usb serial connection. I never really got to something I would call a boot console.
What I’ll do instead is to try to flash an OS onto the new SSD NVME that I just bought. Maybe not even booting, just the /home directory or something.
Installed the new ssd card, now boots only to the 16 GB EMMC memory. Previously that was configured with user kent hostname bdbd. I want xavier as the hostname instead.
Hmmm, the card I bought is not seen by the Xavier. Slightly different connector, with an extra slot compared to what is provided. Seems I have a wrong part!
2022-04-05
I bought a proper nvme SSD, but still it does not work. lspci
does not show the new disk.
I wonder if I can use the xavier nx on bdbd with the Nano dev kit carrier board? https://forums.developer.nvidia.com/t/considerations-when-using-jetson-xavier-nx-module-with-nano-devkit-carrier/119786 seems to say that I can. Let me try that.
Oops, the ssd card is on the nano module, and there is no M.2 slot on the Nano Dev kit. So I would not have enough disk capacity. Boo.
But I tried running the Nano board on the Xavier carrier, and it works. I can see the nvme device SSD supplied by Seeed (using ls /dev and Disks). How about the new Crucial SSD that is not seen by Xavier NX?
Yes, ls /dev
shows /dev/nvme0 and /dev/nvme0n1. Let me try to partition and format that. Yep, with Nano I can partition and format that disk. Let me now switch back to Xavier NX. Yep, now I can see it. This is similar to https://forums.developer.nvidia.com/t/970-pro-nvme-m-2-1tb-ssd-is-not-detected-on-xavier-dev-kit/129079/36
So it seems like the way to get a new SSD card configured on the Xavier NX is to first add a gpt partition table and format using the Nano SOM, then finish setting up on the Xavier SOM. Let me now try to flash that on the Xavier.
I followed this bootFromExternalStorage instructions from Jetson hacks (This video https://www.youtube.com/watch?v=drBKVbOmPEk)
That worked! I now have my Xavier booting from my new Crucial 256MB SSD.
Now I am trying that with the dks manager. I went to https://developer.nvidia.com/nvidia-sdk-manager on uburog and installed.
But it does not work. Per https://forums.developer.nvidia.com/t/sdk-manager-no-available-releases-for-ubuntu-20-04/175492 I have to be running Ubuntu 18 on the host for it to work. Grrr. Can I run that under Docker? let me try.
I got out an old PC, and installed Ubuntu 18.04 on it. Now sdkmanager works. Trying to flash to the Xavier now. When I did, there was no choice for the ndvm. Maybe I am flashing the 16GB internal drive instead? I’m using an older Jetson version so I should be able to tell at the end.
I got a screen to install sdk components, but I had to skip. It seemed to want me to enter name and password, but I had not set that up yet.
I seem to have reinstalled the 16 GB emmc memory. I can see the ssd card, but it did not boot to that. Did a cold boot, same thing.
Now I am going to try to reinstall, this time with the sdk components and Jetpack 4.6.1.
I see that the sdk components screen had options for device. For now I picked the eMMC memory.
I got to that screen gain where I had to enter the name and password. But this time, I could already login to the Xavier system directly. So for best usage, both host and targe should have screen, keyboard and mouse connected.
The reboot brought me to the nvme system. It seems to be using that system to do the emmc install. Is that because the 4.6.1 install on the eMMC made it possible to boot to the nvme? Not sure. When this is done, I’ll try installing Jetpack 4.5.1 on the nvme.
Install failed with the multimedia api. It seems to require Jetpack 4.7.1. [Nope, not right. I am guessing this is because I booted into the nvme, but the install was on the emmc. Redo with nvme ssd removed).
I might have to install this manually. The failed package is nvidia-l4t-jetson-multimedia-api
Looking back at notes, jetson-inference compile failed because it does not support tensorrt 8 which is in jetpack 4.6.1 I mainly need that for jetson-utils since I could not get the camera working from Docker.
So I think that I will use Jetpack 4.6.1 on nvme as standard configuration. But let me boot to emmc first to make sure that is stable. (Which I do by removing the SSD card).
Oops, by installing all of that, I have virtually filled the emmc memory! Need to redo with just the OS.
Checking jetson-inference, it had an update that claims it now supports JetPack 4.6.1. I should try that.
Ubutower config
The existing ubutower, if I run nvidia-smi, claims it is cuda 11.4 I could not figure out where 11.is installed. So I went to the nvidia site to reinstall.
From https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local I did the following commands:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pinsudo
mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Now nvidia-smi fails “Failed to initialize NVML: Driver/library version mismatch”
For pytorch, if I go to https://pytorch.org/get-started/locally/ the choices for cuda are 10.2 or 11.3 So I think what I want to do is to install cuda 11.3 on ubutower.
First let me see what I can uninstall. I’ll reverse the bash steps above
sudo apt-get remove cuda
sudo shutdown -r now
I did a apt list --installed | grep nvidia
to find other nvidia things. Then I removed:
kent@ubutower:/etc/apt/sources.list.d $ sudo rm cuda-ubuntu2004-11-6-local.list
sudo apt-get remove libnvidia-compute-510
sudo apt-get remove linux-objects-nvidia-470-5.13.0-39-generic
sudo apt-get remove linux-signatures-nvidia-5.13.0-39-generic
I did ```sudo apt-get install nvidia-drivers-470``` Here's what that does:
```bash
kent@ubutower:~ $ sudo apt-get install nvidia-driver-470
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
dctrl-tools dkms libatomic1:i386 libbsd0:i386 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386
libedit2:i386 libelf1:i386 libexpat1:i386 libffi7:i386 libgl1:i386 libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386
libglx-mesa0:i386 libglx0:i386 libllvm12:i386 libnvidia-cfg1-470 libnvidia-common-470 libnvidia-compute-470 libnvidia-compute-470:i386
libnvidia-decode-470 libnvidia-decode-470:i386 libnvidia-encode-470 libnvidia-encode-470:i386 libnvidia-extra-470 libnvidia-fbc1-470
libnvidia-fbc1-470:i386 libnvidia-gl-470 libnvidia-gl-470:i386 libnvidia-ifr1-470 libnvidia-ifr1-470:i386 libpciaccess0:i386
libsensors5:i386 libstdc++6:i386 libvulkan1:i386 libwayland-client0:i386 libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386
libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-randr0:i386 libxcb-shm0:i386 libxcb-sync1:i386 libxcb-xfixes0:i386
libxcb1:i386 libxdmcp6:i386 libxext6:i386 libxfixes3:i386 libxshmfence1:i386 libxxf86vm1:i386 mesa-vulkan-drivers:i386
nvidia-compute-utils-470 nvidia-dkms-470 nvidia-kernel-common-470 nvidia-kernel-source-470 nvidia-prime nvidia-settings nvidia-utils-470
screen-resolution-extra xserver-xorg-video-nvidia-470
Suggested packages:
debtags menu lm-sensors:i386
The following NEW packages will be installed:
dctrl-tools dkms libatomic1:i386 libbsd0:i386 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386
libedit2:i386 libelf1:i386 libexpat1:i386 libffi7:i386 libgl1:i386 libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386
libglx-mesa0:i386 libglx0:i386 libllvm12:i386 libnvidia-cfg1-470 libnvidia-common-470 libnvidia-compute-470 libnvidia-compute-470:i386
libnvidia-decode-470 libnvidia-decode-470:i386 libnvidia-encode-470 libnvidia-encode-470:i386 libnvidia-extra-470 libnvidia-fbc1-470
libnvidia-fbc1-470:i386 libnvidia-gl-470 libnvidia-gl-470:i386 libnvidia-ifr1-470 libnvidia-ifr1-470:i386 libpciaccess0:i386
libsensors5:i386 libstdc++6:i386 libvulkan1:i386 libwayland-client0:i386 libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386
libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-randr0:i386 libxcb-shm0:i386 libxcb-sync1:i386 libxcb-xfixes0:i386
libxcb1:i386 libxdmcp6:i386 libxext6:i386 libxfixes3:i386 libxshmfence1:i386 libxxf86vm1:i386 mesa-vulkan-drivers:i386
nvidia-compute-utils-470 nvidia-dkms-470 nvidia-driver-470 nvidia-kernel-common-470 nvidia-kernel-source-470 nvidia-prime nvidia-settings
nvidia-utils-470 screen-resolution-extra xserver-xorg-video-nvidia-470
0 upgraded, 68 newly installed, 0 to remove and 803 not upgraded.
Need to get 310 MB of archives.
After this operation, 1,303 MB of additional disk space will be used.
Do you want to continue? [Y/n]
That is, no cuda. How to get a specific version?
https://forums.developer.nvidia.com/t/older-versions-of-cuda/108163 suggests I go to https://developer.nvidia.com/cuda-toolkit-archive. I found there CUDA Toolkit 11.3.0 Selecting correct architecture there, I end up with these instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Hmmm, install errors. There is some evidence that Cuda 11.3 wants drivers 465:
Recommended packages:
libnvidia-compute-465:i386 libnvidia-decode-465:i386 libnvidia-encode-465:i386 libnvidia-ifr1-465:i386 libnvidia-fbc1-465:i386
libnvidia-gl-465:i386
The following packages will be REMOVED:
libnvidia-cfg1-470 libnvidia-common-470 libnvidia-compute-470 libnvidia-compute-470:i386 libnvidia-decode-470 libnvidia-decode-470:i386
libnvidia-encode-470 libnvidia-encode-470:i386 libnvidia-extra-470 libnvidia-fbc1-470 libnvidia-fbc1-470:i386 libnvidia-gl-470
libnvidia-gl-470:i386 libnvidia-ifr1-470 libnvidia-ifr1-470:i386 nvidia-compute-utils-470 nvidia-dkms-470 nvidia-driver-470
nvidia-kernel-common-470 nvidia-kernel-source-470 nvidia-utils-470 xserver-xorg-video-nvidia-470
...
Removing nvidia-driver-470
I’ll reboot and reinstall the drivers.
Uninstalling:
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '.*nvidia-.*'
Then I repeated:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Now I get lots of errors like:
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/nvidia-container-runtime.list:1 and /etc/apt/sources.list.d/nvidia-docker.list:1
So I am going to remove those sources.
And again I repeat:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Again errors. Here is the start:
Setting up nvidia-dkms-465 (465.19.01-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf
A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`
*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can ***
*** be loaded. ***
*****************************************************************************
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
Loading new nvidia-465.19.01 DKMS files...
Building for 5.13.0-39-generic
Building for architecture x86_64
Building initial module for 5.13.0-39-generic
Error! Bad return status for module build on kernel: 5.13.0-39-generic (x86_64)
Consult /var/lib/dkms/nvidia/465.19.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-465 (--configure):
installed nvidia-dkms-465 package post-installation script subprocess returned error exit status 10
Setting up libcusolver-dev-11-3 (11.1.1.58-1) ...
Setting up cuda-cupti-11-3 (11.3.58-1) ...
dpkg: dependency problems prevent configuration of cuda-drivers-465:
cuda-drivers-465 depends on nvidia-dkms-465 (>= 465.19.01); however:
Package nvidia-dkms-465 is not configured yet.
OOPS! This was on uburog not ubutower! Try again the scripts above, this time on ubutower.
Still showing errors, with nvidia-dkms-465
Another confusing point is that I STILL seem to be requiring nvidia-drivers-465 yet I keep seeing comments that a specific driver version is not needed.
If I try to remove by sudo apt remove nvidia-driver-465
I get the same errors. So now I have totally borked ubutower.
2022-04-07
I was able to get bacjk control with this:
sudo apt-get remove --purge '*nvidia*' 'cuda*' 'nsight*'
sudo apt autoremove
Now dpkg -l | grep -i nvidia
shows nothing.
Now I’ll try the instructions at https://towardsdatascience.com/installing-multiple-cuda-cudnn-versions-in-ubuntu-fcb6aa5194e2 to install multiple cuda drivers
sudo ubuntu-drivers autoinstall
That installed 470. Per https://blog.kovalevskyi.com/multiple-version-of-cuda-libraries-on-the-same-machine-b9502d50ae77 and others, “You need to have latest Nvidia driver that is required by the highest CUDA that you’re going to install.” Following instructions there, cuda 11.4 needs nvidia driver 470. So I should be good.
Let me first try to install cuda 11.4. Starting at https://developer.nvidia.com/cuda-toolkit-archive I select 11.4.4 with a network install, ubuntu 20.04. That gives me these instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
Executing. (Also note I cleared cuda directories from /usr/local)
That fails “ome packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation:
The following packages have unmet dependencies:
cuda : Depends: cuda-11-6 (>= 11.6.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
“
Checking errors, I followed down a tree of uninstallable packages, until finally I was able to install this:
sudo apt-get install cuda-drivers-470
but still problems.
I follow the following instructions to try to fix broken packages:
sudo apt update --fix-missing
sudo apt install -f
But still no go :( It wants cuda-drivers. apt show cuda-drivers -a
` shows all versions. I’;; try:
sudo apt install cuda-drivers=470.103.01-1
OK that did the trick! sudo apt install cuda-11-4
worked. Now trying sudo apt install cuda-11-3
That worked in the sense that I have a directory /usr/local/cuda-11.4 but the ‘cuda’ directory there still points at 11.4 via /etc/alternatives/cuda.
I also added this to .bashrc:
# cuda See link in /etc/alternatives/cuda for version
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
and modified the links in /etc/alternative/cuda to point to 11.3 Now nvcc --version
shows me 11.3
Now I am going to reinstall the nvidia container stuff on ubutower. Following https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Testing gpu on ubu-tower
First, I want to confirm that I can run pytorch locally with gpu. I need a script for that, but first let me install.
Start at https://pytorch.org/get-started/locally/ That gives me:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
What I actually rane was:
python3 -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
I got this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.4.2 requires numpy==1.19.5, but you have numpy 1.22.1 which is incompatible.
tts 0.4.2 requires torch>=1.7, but you have torch 1.6.0 which is incompatible.
So I uninstalled tts: sudo python3 -m pip uninstall tts
I can test cuda using this:
import torch
torch.cuda.is_available()
That works locally.
But I am confused why the version I have of torch, 1.6, is so old. Current pip show 1.11.0 I uninstalled transformers, torch, torchvision, torchaudio and tried their actual suggested command
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Now I have torch-1.11.0+cu113pip