Program Tip

Alpine Linux에 Pandas를 설치하는 데 시간이 걸리는 이유

programtip 2021. 1. 8. 22:14
반응형

Alpine Linux에 Pandas를 설치하는 데 시간이 걸리는 이유


기본 OS Alpine 대 CentOS 또는 Debian을 사용하여 Docker 컨테이너에 Pandas 및 Numpy (종속성)를 설치하는 데 훨씬 오래 걸린다는 사실을 확인했습니다. 시차를 보여주기 위해 아래에 작은 테스트를 만들었습니다. Alpine이 Pandas 및 Numpy를 설치하기 위해 빌드 종속성을 업데이트하고 다운로드하는 데 몇 초가 걸리는 것 외에도 setup.py가 데비안 설치보다 약 70 배 더 많은 시간이 걸리는 이유는 무엇입니까?

Alpine을 기본 이미지로 사용하여 설치 속도를 높일 수있는 방법이 있습니까? 아니면 Pandas 및 Numpy와 같은 패키지에 사용하는 것이 더 좋은 Alpine과 비슷한 크기의 다른 기본 이미지가 있습니까?

Dockerfile.debian

FROM python:3.6.4-slim-jessie

RUN pip install pandas

Pandas 및 Numpy로 Debian 이미지를 빌드합니다.

[PandasDockerTest] time docker build -t debian-pandas -f Dockerfile.debian . --no-cache
    Sending build context to Docker daemon  3.072kB
    Step 1/2 : FROM python:3.6.4-slim-jessie
     ---> 43431c5410f3
    Step 2/2 : RUN pip install pandas
     ---> Running in 2e4c030f8051
    Collecting pandas
      Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB)
    Collecting numpy>=1.9.0 (from pandas)
      Downloading numpy-1.14.1-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
    Collecting pytz>=2011k (from pandas)
      Downloading pytz-2018.3-py2.py3-none-any.whl (509kB)
    Collecting python-dateutil>=2 (from pandas)
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting six>=1.5 (from python-dateutil>=2->pandas)
      Downloading six-1.11.0-py2.py3-none-any.whl
    Installing collected packages: numpy, pytz, six, python-dateutil, pandas
    Successfully installed numpy-1.14.1 pandas-0.22.0 python-dateutil-2.6.1 pytz-2018.3 six-1.11.0
    Removing intermediate container 2e4c030f8051
     ---> a71e1c314897
    Successfully built a71e1c314897
    Successfully tagged debian-pandas:latest
    docker build -t debian-pandas -f Dockerfile.debian . --no-cache  0.07s user 0.06s system 0% cpu 13.605 total

Dockerfile.alpine

FROM python:3.6.4-alpine3.7

RUN apk --update add --no-cache g++

RUN pip install pandas

Pandas & Numpy로 알파인 이미지 빌드 :

[PandasDockerTest] time docker build -t alpine-pandas -f Dockerfile.alpine . --no-cache
Sending build context to Docker daemon   16.9kB
Step 1/3 : FROM python:3.6.4-alpine3.7
 ---> 4b00a94b6f26
Step 2/3 : RUN apk --update add --no-cache g++
 ---> Running in 4b0c32551e3f
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.7/community/x86_64/APKINDEX.tar.gz
(1/17) Upgrading musl (1.1.18-r2 -> 1.1.18-r3)
(2/17) Installing libgcc (6.4.0-r5)
(3/17) Installing libstdc++ (6.4.0-r5)
(4/17) Installing binutils-libs (2.28-r3)
(5/17) Installing binutils (2.28-r3)
(6/17) Installing gmp (6.1.2-r1)
(7/17) Installing isl (0.18-r0)
(8/17) Installing libgomp (6.4.0-r5)
(9/17) Installing libatomic (6.4.0-r5)
(10/17) Installing pkgconf (1.3.10-r0)
(11/17) Installing mpfr3 (3.1.5-r1)
(12/17) Installing mpc1 (1.0.3-r1)
(13/17) Installing gcc (6.4.0-r5)
(14/17) Installing musl-dev (1.1.18-r3)
(15/17) Installing libc-dev (0.7.1-r0)
(16/17) Installing g++ (6.4.0-r5)
(17/17) Upgrading musl-utils (1.1.18-r2 -> 1.1.18-r3)
Executing busybox-1.27.2-r7.trigger
OK: 184 MiB in 50 packages
Removing intermediate container 4b0c32551e3f
 ---> be26c3bf4e42
Step 3/3 : RUN pip install pandas
 ---> Running in 36f6024e5e2d
Collecting pandas
  Downloading pandas-0.22.0.tar.gz (11.3MB)
Collecting python-dateutil>=2 (from pandas)
  Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
Collecting pytz>=2011k (from pandas)
  Downloading pytz-2018.3-py2.py3-none-any.whl (509kB)
Collecting numpy>=1.9.0 (from pandas)
  Downloading numpy-1.14.1.zip (4.9MB)
Collecting six>=1.5 (from python-dateutil>=2->pandas)
  Downloading six-1.11.0-py2.py3-none-any.whl
Building wheels for collected packages: pandas, numpy
  Running setup.py bdist_wheel for pandas: started
  Running setup.py bdist_wheel for pandas: still running...
  Running setup.py bdist_wheel for pandas: still running...
  Running setup.py bdist_wheel for pandas: still running...
  Running setup.py bdist_wheel for pandas: still running...
  Running setup.py bdist_wheel for pandas: still running...
  Running setup.py bdist_wheel for pandas: still running...
  Running setup.py bdist_wheel for pandas: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/e8/ed/46/0596b51014f3cc49259e52dff9824e1c6fe352048a2656fc92
  Running setup.py bdist_wheel for numpy: started
  Running setup.py bdist_wheel for numpy: still running...
  Running setup.py bdist_wheel for numpy: still running...
  Running setup.py bdist_wheel for numpy: still running...
  Running setup.py bdist_wheel for numpy: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/9d/cd/e1/4d418b16ea662e512349ef193ed9d9ff473af715110798c984
Successfully built pandas numpy
Installing collected packages: six, python-dateutil, pytz, numpy, pandas
Successfully installed numpy-1.14.1 pandas-0.22.0 python-dateutil-2.6.1 pytz-2018.3 six-1.11.0
Removing intermediate container 36f6024e5e2d
 ---> a93c59e6a106
Successfully built a93c59e6a106
Successfully tagged alpine-pandas:latest
docker build -t alpine-pandas -f Dockerfile.alpine . --no-cache  0.54s user 0.33s system 0% cpu 16:08.47 total

Debian based images use only python pip to install packages with .whl format:

  Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB)
  Downloading numpy-1.14.1-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)

WHL format was developed as a quicker and more reliable method of installing Python software than re-building from source code every time. WHL files only have to be moved to the correct location on the target system to be installed, whereas a source distribution requires a build step before installation.

Wheel packages pandas and numpy are not supported in images based on Alpine platform. That's why when we install them using python pip during the building process, we always compile them from the source files in alpine:

  Downloading pandas-0.22.0.tar.gz (11.3MB)
  Downloading numpy-1.14.1.zip (4.9MB)

and we can see the following inside container during the image building:

/ # ps aux
PID   USER     TIME   COMMAND
    1 root       0:00 /bin/sh -c pip install pandas
    7 root       0:04 {pip} /usr/local/bin/python /usr/local/bin/pip install pandas
   21 root       0:07 /usr/local/bin/python -c import setuptools, tokenize;__file__='/tmp/pip-build-en29h0ak/pandas/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n
  496 root       0:00 sh
  660 root       0:00 /bin/sh -c gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -DTHREAD_STACK_SIZE=0x100000 -fPIC -Ibuild/src.linux-x86_64-3.6/numpy/core/src/pri
  661 root       0:00 gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -DTHREAD_STACK_SIZE=0x100000 -fPIC -Ibuild/src.linux-x86_64-3.6/numpy/core/src/private -Inump
  662 root       0:00 /usr/libexec/gcc/x86_64-alpine-linux-musl/6.4.0/cc1 -quiet -I build/src.linux-x86_64-3.6/numpy/core/src/private -I numpy/core/include -I build/src.linux-x86_64-3.6/numpy/core/includ
  663 root       0:00 ps aux

If we modify Dockerfile a little:

FROM python:3.6.4-alpine3.7
RUN apk add --no-cache g++ wget
RUN wget https://pypi.python.org/packages/da/c6/0936bc5814b429fddb5d6252566fe73a3e40372e6ceaf87de3dec1326f28/pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl
RUN pip install pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl

we get the following error:

Step 4/4 : RUN pip install pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl
 ---> Running in 0faea63e2bda
pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl is not a supported wheel on this platform.
The command '/bin/sh -c pip install pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl' returned a non-zero code: 1

Unfortunately, the only way to install pandas on an Alpine image is to wait until build finishes.

Of course if you want to use the Alpine image with pandas in CI for example, the best way to do so is to compile it once, push it to any registry and use it as a base image for your needs.

EDIT: If you want to use the Alpine image with pandas you can pull my nickgryg/alpine-pandas docker image. It is a python image with pre-compiled pandas on the Alpine platform. It should save your time.


[Update:]

ANSWER: IT DOESN'T!

In any Alpine Dockerfile you can simply do*

RUN apk add py2-numpy@community py2-scipy@community py-pandas@edge

This is because numpy, scipy and now pandas are all available prebuilt on alpine:

https://pkgs.alpinelinux.org/packages?name=*numpy

https://pkgs.alpinelinux.org/packages?name=*scipy&branch=edge

https://pkgs.alpinelinux.org/packages?name=*pandas&branch=edge

One way to avoid rebuilding every time, or using a Docker layer, is to use a prebuilt, native Alpine Linux/.apk package, e.g.

https://github.com/sgerrand/alpine-pkg-py-pandas

https://github.com/nbgallery/apks

You can build these .apks once and use them wherever in your Dockerfile you like :)

This also saves you having to bake everything else into the Docker image before the fact - i.e. the flexibility to pre-build any Docker image you like.

PS I have put a Dockerfile stub at https://gist.github.com/jtlz2/b0f4bc07ce2ff04bc193337f2327c13b that shows roughly how to build the image. These include the important steps (*):

RUN echo "@community http://dl-cdn.alpinelinux.org/alpine/edge/community" >> /etc/apk/repositories
RUN apk update
RUN apk add --update --no-cache libgfortran

LATEST UPDATE on 17/09/2019

So, py3-pandas & py3-numpy packages moved to the testing alpine repository, so, you can download it by adding these lines in to the your Dockerfile:

RUN echo "http://dl-8.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories \
  && apk update \
  && apk add py3-numpy py3-pandas

Hope it helps someone!

Alpine packages links:
- py3-pandas
- py3-numpy

Alpine repositories docks info.


Just going to bring some of these answers together in one answer and add a detail I think was missed. The reason certain python libraries, particularly optimized math and data libraries, take so long to build on alpine is because the pip wheels for these libraries include binaries precompiled from c/c++ and linked against glibc, a common set of c standard libraries. Debian, Fedora, CentOS all (typically) use glibc, but alpine, in order to stay lightweight, uses musl-libc instead. c/c++ binaries build on a glibc system will not work on a system without glibc and the same goes for musl.

Pip looks first for a wheel with the correct binaries, if it can't find one, it tries to compile the binaries from the c/c++ source and links them against musl. In many cases, this won't even work unless you have the python headers from python3-dev or build tools like make.

Now the silver lining, as others have mentioned, there are apk packages with the proper binaries provided by the community, using these will save you the (sometimes lengthy) process of building the binaries.

ReferenceURL : https://stackoverflow.com/questions/49037742/why-does-it-take-ages-to-install-pandas-on-alpine-linux

반응형