Updated:
This page contains notes from my various tests and experiments. It is a raw record of what I did, without correction for errors, or later update for things that I learn. Use at your own risk.
Jetbot ROS2 #4: Continue setup including Xavier
2022-04-07
Ubutower config
The existing ubutower, if I run nvidia-smi, claims it is cuda 11.4 I could not figure out where 11.is installed. So I went to the nvidia site to reinstall.
From https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local I did the following commands:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pinsudo
mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Now nvidia-smi fails “Failed to initialize NVML: Driver/library version mismatch”
For pytorch, if I go to https://pytorch.org/get-started/locally/ the choices for cuda are 10.2 or 11.3 So I think what I want to do is to install cuda 11.3 on ubutower.
First let me see what I can uninstall. I’ll reverse the bash steps above
sudo apt-get remove cuda
sudo shutdown -r now
I did a apt list --installed | grep nvidia
to find other nvidia things. Then I removed:
kent@ubutower:/etc/apt/sources.list.d $ sudo rm cuda-ubuntu2004-11-6-local.list
sudo apt-get remove libnvidia-compute-510
sudo apt-get remove linux-objects-nvidia-470-5.13.0-39-generic
sudo apt-get remove linux-signatures-nvidia-5.13.0-39-generic
I did ```sudo apt-get install nvidia-drivers-470``` Here's what that does:
```bash
kent@ubutower:~ $ sudo apt-get install nvidia-driver-470
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
dctrl-tools dkms libatomic1:i386 libbsd0:i386 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386
libedit2:i386 libelf1:i386 libexpat1:i386 libffi7:i386 libgl1:i386 libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386
libglx-mesa0:i386 libglx0:i386 libllvm12:i386 libnvidia-cfg1-470 libnvidia-common-470 libnvidia-compute-470 libnvidia-compute-470:i386
libnvidia-decode-470 libnvidia-decode-470:i386 libnvidia-encode-470 libnvidia-encode-470:i386 libnvidia-extra-470 libnvidia-fbc1-470
libnvidia-fbc1-470:i386 libnvidia-gl-470 libnvidia-gl-470:i386 libnvidia-ifr1-470 libnvidia-ifr1-470:i386 libpciaccess0:i386
libsensors5:i386 libstdc++6:i386 libvulkan1:i386 libwayland-client0:i386 libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386
libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-randr0:i386 libxcb-shm0:i386 libxcb-sync1:i386 libxcb-xfixes0:i386
libxcb1:i386 libxdmcp6:i386 libxext6:i386 libxfixes3:i386 libxshmfence1:i386 libxxf86vm1:i386 mesa-vulkan-drivers:i386
nvidia-compute-utils-470 nvidia-dkms-470 nvidia-kernel-common-470 nvidia-kernel-source-470 nvidia-prime nvidia-settings nvidia-utils-470
screen-resolution-extra xserver-xorg-video-nvidia-470
Suggested packages:
debtags menu lm-sensors:i386
The following NEW packages will be installed:
dctrl-tools dkms libatomic1:i386 libbsd0:i386 libdrm-amdgpu1:i386 libdrm-intel1:i386 libdrm-nouveau2:i386 libdrm-radeon1:i386 libdrm2:i386
libedit2:i386 libelf1:i386 libexpat1:i386 libffi7:i386 libgl1:i386 libgl1-mesa-dri:i386 libglapi-mesa:i386 libglvnd0:i386
libglx-mesa0:i386 libglx0:i386 libllvm12:i386 libnvidia-cfg1-470 libnvidia-common-470 libnvidia-compute-470 libnvidia-compute-470:i386
libnvidia-decode-470 libnvidia-decode-470:i386 libnvidia-encode-470 libnvidia-encode-470:i386 libnvidia-extra-470 libnvidia-fbc1-470
libnvidia-fbc1-470:i386 libnvidia-gl-470 libnvidia-gl-470:i386 libnvidia-ifr1-470 libnvidia-ifr1-470:i386 libpciaccess0:i386
libsensors5:i386 libstdc++6:i386 libvulkan1:i386 libwayland-client0:i386 libx11-6:i386 libx11-xcb1:i386 libxau6:i386 libxcb-dri2-0:i386
libxcb-dri3-0:i386 libxcb-glx0:i386 libxcb-present0:i386 libxcb-randr0:i386 libxcb-shm0:i386 libxcb-sync1:i386 libxcb-xfixes0:i386
libxcb1:i386 libxdmcp6:i386 libxext6:i386 libxfixes3:i386 libxshmfence1:i386 libxxf86vm1:i386 mesa-vulkan-drivers:i386
nvidia-compute-utils-470 nvidia-dkms-470 nvidia-driver-470 nvidia-kernel-common-470 nvidia-kernel-source-470 nvidia-prime nvidia-settings
nvidia-utils-470 screen-resolution-extra xserver-xorg-video-nvidia-470
0 upgraded, 68 newly installed, 0 to remove and 803 not upgraded.
Need to get 310 MB of archives.
After this operation, 1,303 MB of additional disk space will be used.
Do you want to continue? [Y/n]
That is, no cuda. How to get a specific version?
https://forums.developer.nvidia.com/t/older-versions-of-cuda/108163 suggests I go to https://developer.nvidia.com/cuda-toolkit-archive. I found there CUDA Toolkit 11.3.0 Selecting correct architecture there, I end up with these instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Hmmm, install errors. There is some evidence that Cuda 11.3 wants drivers 465:
Recommended packages:
libnvidia-compute-465:i386 libnvidia-decode-465:i386 libnvidia-encode-465:i386 libnvidia-ifr1-465:i386 libnvidia-fbc1-465:i386
libnvidia-gl-465:i386
The following packages will be REMOVED:
libnvidia-cfg1-470 libnvidia-common-470 libnvidia-compute-470 libnvidia-compute-470:i386 libnvidia-decode-470 libnvidia-decode-470:i386
libnvidia-encode-470 libnvidia-encode-470:i386 libnvidia-extra-470 libnvidia-fbc1-470 libnvidia-fbc1-470:i386 libnvidia-gl-470
libnvidia-gl-470:i386 libnvidia-ifr1-470 libnvidia-ifr1-470:i386 nvidia-compute-utils-470 nvidia-dkms-470 nvidia-driver-470
nvidia-kernel-common-470 nvidia-kernel-source-470 nvidia-utils-470 xserver-xorg-video-nvidia-470
...
Removing nvidia-driver-470
I’ll reboot and reinstall the drivers.
Uninstalling:
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '.*nvidia-.*'
Then I repeated:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Now I get lots of errors like:
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/nvidia-container-runtime.list:1 and /etc/apt/sources.list.d/nvidia-docker.list:1
So I am going to remove those sources.
And again I repeat:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
Again errors. Here is the start:
Setting up nvidia-dkms-465 (465.19.01-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf
A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`
*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can ***
*** be loaded. ***
*****************************************************************************
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
Loading new nvidia-465.19.01 DKMS files...
Building for 5.13.0-39-generic
Building for architecture x86_64
Building initial module for 5.13.0-39-generic
Error! Bad return status for module build on kernel: 5.13.0-39-generic (x86_64)
Consult /var/lib/dkms/nvidia/465.19.01/build/make.log for more information.
dpkg: error processing package nvidia-dkms-465 (--configure):
installed nvidia-dkms-465 package post-installation script subprocess returned error exit status 10
Setting up libcusolver-dev-11-3 (11.1.1.58-1) ...
Setting up cuda-cupti-11-3 (11.3.58-1) ...
dpkg: dependency problems prevent configuration of cuda-drivers-465:
cuda-drivers-465 depends on nvidia-dkms-465 (>= 465.19.01); however:
Package nvidia-dkms-465 is not configured yet.
OOPS! This was on uburog not ubutower! Try again the scripts above, this time on ubutower.
Still showing errors, with nvidia-dkms-465
Another confusing point is that I STILL seem to be requiring nvidia-drivers-465 yet I keep seeing comments that a specific driver version is not needed.
If I try to remove by sudo apt remove nvidia-driver-465
I get the same errors. So now I have totally borked ubutower.
2022-04-07
I was able to get bacjk control with this:
sudo apt-get remove --purge '*nvidia*' 'cuda*' 'nsight*'
sudo apt autoremove
Now dpkg -l | grep -i nvidia
shows nothing.
Now I’ll try the instructions at https://towardsdatascience.com/installing-multiple-cuda-cudnn-versions-in-ubuntu-fcb6aa5194e2 to install multiple cuda drivers
sudo ubuntu-drivers autoinstall
That installed 470. Per https://blog.kovalevskyi.com/multiple-version-of-cuda-libraries-on-the-same-machine-b9502d50ae77 and others, “You need to have latest Nvidia driver that is required by the highest CUDA that you’re going to install.” Following instructions there, cuda 11.4 needs nvidia driver 470. So I should be good.
Let me first try to install cuda 11.4. Starting at https://developer.nvidia.com/cuda-toolkit-archive I select 11.4.4 with a network install, ubuntu 20.04. That gives me these instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
Executing. (Also note I cleared cuda directories from /usr/local)
That fails “ome packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation:
The following packages have unmet dependencies:
cuda : Depends: cuda-11-6 (>= 11.6.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
“
Checking errors, I followed down a tree of uninstallable packages, until finally I was able to install this:
sudo apt-get install cuda-drivers-470
but still problems.
I follow the following instructions to try to fix broken packages:
sudo apt update --fix-missing
sudo apt install -f
But still no go :( It wants cuda-drivers. apt show cuda-drivers -a
` shows all versions. I’;; try:
sudo apt install cuda-drivers=470.103.01-1
OK that did the trick! sudo apt install cuda-11-4
worked. Now trying sudo apt install cuda-11-3
That worked in the sense that I have a directory /usr/local/cuda-11.4 but the ‘cuda’ directory there still points at 11.4 via /etc/alternatives/cuda.
I also added this to .bashrc:
# cuda See link in /etc/alternatives/cuda for version
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
and modified the links in /etc/alternative/cuda to point to 11.3 Now nvcc --version
shows me 11.3
Now I am going to reinstall the nvidia container stuff on ubutower. Following https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Testing gpu on ubu-tower
First, I want to confirm that I can run pytorch locally with gpu. I need a script for that, but first let me install.
Start at https://pytorch.org/get-started/locally/ That gives me:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
What I actually rane was:
python3 -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
I got this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.4.2 requires numpy==1.19.5, but you have numpy 1.22.1 which is incompatible.
tts 0.4.2 requires torch>=1.7, but you have torch 1.6.0 which is incompatible.
So I uninstalled tts: sudo python3 -m pip uninstall tts
I can test cuda using this:
import torch
torch.cuda.is_available()
That works locally.
But I am confused why the version I have of torch, 1.6, is so old. Current pip show 1.11.0 I uninstalled transformers, torch, torchvision, torchaudio and tried their actual suggested command
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Now I have torch-1.11.0+cu113pip
Xavier setup
Previously I learned how to flash the Xavier, and reflashed both the emmc image as well as the ssd. Now I am running from the SSD. I need to do a standard configuration of it.
Things I did:
- Disabled screen blank and lock
- Copied .ssh from uburog
- Added github to .ssh/config
- added stuff to .bashrc
- cloned bdbd2_ws
- added kent to docker group
- install python3-pip
- install jetson-stats
Now I am trying to get pytorch working as a dockerfile.
There are jetson images at https://developer.nvidia.com/embedded/jetson-cloud-native which are supposed to work. But when I tried to install the pytorch container images, they do not work.python3
then import torch
fails with a missing cuda-related library.
Docs claim that the jetpack release must be the same between container and device. They werent. the container is 4.6.1 while the device SSD was 4.6. I doubt if that is the issue, but I am reinstalling the OS on the device to 4.6.1 (using the sdkmanager rather than the jetson hacks scripts).
The answer
Turns out, it was not the 4.6.1 that was key. What was key was adding –runtime nvidia in the docker run
command.
Now I have torch on xaview docker - but I cannot see cuda :(
I created a base container, now I will install torch per https://forums.developer.nvidia.com/t/pytorch-for-jetson-version-1-10-now-available/72048 No per https://elinux.org/Jetson_Zoo#PyTorch_.28Caffe2.29
Nope. I followed the instructions from https://github.com/dusty-nv/jetson-containers to run, cuda IS available! Playing more with my stuff, it seems like cuda fails for a non-root user. Why?
From https://forums.developer.nvidia.com/t/nvidia-docker-seems-unable-to-use-gpu-as-non-root-user/80276/4 it seems that the user must be a member of the group video. Tested and it works. Modifying the build files now.
I wanted to try power modes. See https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/power_management_jetson_xavier.html for a chart. Doc change modes, use: sudo nvpmodel -m 8
to change to mode 8, and sudo nvpmodel -q
to see the current mode.
Performance of various devices
I can pretty much now get dialogpt working on a variety of devices. Time to test performance.
I use dialogpt.py with and without CPU. That starts repeating after a few responses with a repeatable speed. I get the mean of the last three times
xavier: cpu: 4.33 seconds GPU: 0.49 ubutower: cpu: 0.52 gpu: 0.10
2022-04-19
nano: cpu: 7.55 GPU 1.01
Note that –gpus all is not needed on the nano (and probably not on the xavier).
TensorRT
Following the blog at https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2-for-real-time-inference-with-tensorrt/ I optimized GPT2 on HuggingFace on ubutower. Made and ran the Dockerfile, then went through the noteboot:
https://github.com/NVIDIA/TensorRT/blob/7e190bed1eed0abb91a797b08fe521427faa288d/demo/HuggingFace/notebooks/gpt2.ipynb
The result showed no difference in the before and after performance.
Trying this on Xavier failed because I need to specify the architecture.
Still having troubles, see jtensorrt.Dockerfile
I tried various ways to install, but from https://github.com/NVIDIA/Torch-TensorRT/issues/969 apparently I will need to do a source install. That requires bazel which I need to install.
2022-04-20
I’m making my way painfully though all of this.
I figured out how to did find the bazel binaries using bazelisk, then added those directly in Docker. Now I can run bazel from xavier docker jtensorrt.
Now I am trying to figure out how to compile torch-tensorrt from https://github.com/NVIDIA/Torch-TensorRT
I hae bazel installed, and confirmed that cudnn is on xaviwe. I’m going to adapt https://github.com/NVIDIA/Torch-TensorRT/blob/master/docker/Dockerfile ro bild.
I’m having issues with accessing libcurand.so.10 during ddocker build
:
Step 44/48 : RUN python3 setup.py bdist_wheel --use-cxx11-abi $* || exit 1
---> Running in 01e86221bbf3
Traceback (most recent call last):
File "setup.py", line 12, in <module>
from torch.utils import cpp_extension
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
_load_global_deps()
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
Following https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime/61737404#61737404 I’m trying this:
Edit/create the /etc/docker/daemon.json with content:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
That fixed that, now I have a new issue.
2022-06-26
Previously I got torch-tensorrt to compile. Now when I try to run the demo at https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/GPT2/trt.py it fails:
[04/26/2022-19:13:11] [TRT] [W] ShapedWeights.cpp:173: Weights transformer.h.11.mlp.c_proj.weight has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[04/26/2022-19:13:15] [TRT] [W] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
python3: /root/gpgpu/MachineLearning/myelin/src/compiler/optimizer/reshape_ppg.cpp:950: void myelin::ir::reshape_ppg_t::transform_op(myelin::ir::bb_t*, myelin::ir::operation_t*): Assertion `op->outputs()[0]->dimensions().size() == 3' failed.
Aborted (core dumped)
Not sure where to go with this. Can I make another demo work?
Conversation is perhaps most important to run remotely. So let me instead try to get the existing up-to-date stuff from dusty-nv working, like jetsonvoice.