Lxml python 3 установка windows 10

Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.

Project description

lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It
provides safe and convenient access to these libraries using the ElementTree
API.

It extends the ElementTree API significantly to offer support for XPath,
RelaxNG, XML Schema, XSLT, C14N and much more.

To contact the project, go to the project home page or see our bug tracker at
https://launchpad.net/lxml

In case you want to use the current in-development version of lxml,
you can get it from the github repository at
https://github.com/lxml/lxml . Note that this requires Cython to
build the sources, see the build instructions on the project home
page. To the same end, running easy_install lxml==dev will
install lxml from
https://github.com/lxml/lxml/tarball/master#egg=lxml-dev if you have
an appropriate version of Cython installed.

After an official release of a new stable series, bug fixes may become
available at
https://github.com/lxml/lxml/tree/lxml-4.9 .
Running easy_install lxml==4.9bugfix will install
the unreleased branch state from
https://github.com/lxml/lxml/tarball/lxml-4.9#egg=lxml-4.9bugfix
as soon as a maintenance branch has been established. Note that this
requires Cython to be installed at an appropriate version for the build.

4.9.2 (2022-12-13)

Bugs fixed

  • CVE-2022-2309: A Bug in libxml2 2.9.1[0-4] could let namespace declarations
    from a failed parser run leak into later parser runs. This bug was worked around
    in lxml and resolved in libxml2 2.10.0.
    https://gitlab.gnome.org/GNOME/libxml2/-/issues/378

Other changes

  • LP#1981760: Element.attrib now registers as collections.abc.MutableMapping.

  • lxml now has a static build setup for macOS on ARM64 machines (not used for building wheels).
    Patch by Quentin Leffray.

Download files

Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.

Source Distribution

Built Distributions

I’m trying to install lmxl on my Windows 8.1 laptop with Python 3.4 and failing miserably.

First off, I tried the simple and obvious solution: pip install lxml. However, this didn’t work. Here’s what it said:

Downloading/unpacking lxml
  Running setup.py (path:C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxmlsetup.py) egg_info for package lxml
    Building lxml version 3.4.2.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)

    warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
  Running setup.py install for lxml
    Building lxml version 3.4.2.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    building 'lxml.etree' extension
    C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    error: Unable to find vcvarsall.bat
    Complete output from command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile:
    Building lxml version 3.4.2.

Building without Cython.

ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"

** make sure the development packages of libxml2 and libxslt are installed **



Using build configuration of libxslt

running install

running build

running build_py

creating build

creating buildlib.win32-3.4

creating buildlib.win32-3.4lxml

copying srclxmlbuilder.py -> buildlib.win32-3.4lxml

copying srclxmlcssselect.py -> buildlib.win32-3.4lxml

copying srclxmldoctestcompare.py -> buildlib.win32-3.4lxml

copying srclxmlElementInclude.py -> buildlib.win32-3.4lxml

copying srclxmlpyclasslookup.py -> buildlib.win32-3.4lxml

copying srclxmlsax.py -> buildlib.win32-3.4lxml

copying srclxmlusedoctest.py -> buildlib.win32-3.4lxml

copying srclxml_elementpath.py -> buildlib.win32-3.4lxml

copying srclxml__init__.py -> buildlib.win32-3.4lxml

creating buildlib.win32-3.4lxmlincludes

copying srclxmlincludes__init__.py -> buildlib.win32-3.4lxmlincludes

creating buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlbuilder.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlclean.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmldefs.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmldiff.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlElementSoup.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlformfill.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlhtml5parser.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlsoupparser.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlusedoctest.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_diffcommand.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_html5builder.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_setmixin.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml__init__.py -> buildlib.win32-3.4lxmlhtml

creating buildlib.win32-3.4lxmlisoschematron

copying srclxmlisoschematron__init__.py -> buildlib.win32-3.4lxmlisoschematron

copying srclxmllxml.etree.h -> buildlib.win32-3.4lxml

copying srclxmllxml.etree_api.h -> buildlib.win32-3.4lxml

copying srclxmlincludesc14n.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesconfig.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesdtdvalid.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesetreepublic.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludeshtmlparser.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesrelaxng.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesschematron.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludestree.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesuri.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxinclude.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlerror.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlparser.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlschema.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxpath.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxslt.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesetree_defs.h -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludeslxml-version.h -> buildlib.win32-3.4lxmlincludes

creating buildlib.win32-3.4lxmlisoschematronresources

creating buildlib.win32-3.4lxmlisoschematronresourcesrng

copying srclxmlisoschematronresourcesrngiso-schematron.rng -> buildlib.win32-3.4lxmlisoschematronresourcesrng

creating buildlib.win32-3.4lxmlisoschematronresourcesxsl

copying srclxmlisoschematronresourcesxslRNG2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl

copying srclxmlisoschematronresourcesxslXSD2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl

creating buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_abstract_expand.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_dsdl_include.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_message.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_skeleton_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_svrl_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1readme.txt -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

running build_ext

building 'lxml.etree' extension

C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'

  warnings.warn(msg)

error: Unable to find vcvarsall.bat

----------------------------------------
Cleaning up...
Command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile failed with error code 1 in C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxml
Storing debug log for failure in C:Userscarte_000pippip.log

So then I looked on this great and helpful thing called The Internet and a lot of people have the same error of needing libxml2 and libxlst. They recommend a guy called Christoph Gohlke’s page where he provides some sort of binary thingy for a bunch of packages. You can find it here (quicklink to the lxml part).

So after I gave up on trying to find libxml2 and libxslt for pip, I decided to go there, and found an absolute ton of downloads. I know I need a 64-bit one, but I have no idea which «cp» I need.

So an answer either giving me a solution on the pip method or the Gohlke index method would be great.

lxml – это библиотека Python, которая позволяет легко обрабатывать файлы XML и HTML, а также может использоваться для очистки веб-страниц. Существует множество стандартных анализаторов XML, но для достижения лучших результатов разработчики иногда предпочитают писать свои собственные анализаторы XML и HTML. Именно тогда в игру вступает библиотека lxml. Ключевые преимущества этой библиотеки заключаются в том, что она проста в использовании, чрезвычайно быстра при синтаксическом анализе больших документов, очень хорошо документирована и обеспечивает легкое преобразование данных в типы данных Python, что упрощает манипуляции с файлами.

В этом руководстве мы глубоко погрузимся в библиотеку lxml Python, начав с того, как настроить ее для различных операционных систем, а затем обсудим ее преимущества и широкий спектр функций, которые она предлагает.

Есть несколько способов установить lxml в вашу систему. Мы рассмотрим некоторые из них ниже.

Использование Pip

Pip – это менеджер пакетов Python, который используется для простой загрузки и установки библиотек в вашу локальную систему, т.е. он также загружает и устанавливает все зависимости для пакета, который вы устанавливаете.

Если в вашей системе установлен pip, просто выполните следующую команду в терминале или командной строке:

$ pip install lxml

apt-get

Если вы используете MacOS или Linux, вы можете установить lxml, выполнив эту команду в своем терминале:

$ sudo apt-get install python-lxml

easy_install

Вероятно, вы не дойдете до этой части, но если ни одна из вышеперечисленных команд по какой-то причине у вас не работает, попробуйте использовать easy_install:

$ easy_install lxml

Примечание. Если вы хотите установить какую-либо конкретную версию lxml, вы можете просто указать ее при запуске команды в командной строке или в терминале, например, lxml == 3.xy

К настоящему времени у вас должна быть установлена копия библиотеки lxml на вашем локальном компьютере. Давайте теперь посмотрим, какие классные вещи можно делать с помощью этой библиотеки.

Функциональность

Чтобы иметь возможность использовать библиотеку lxml в своей программе, вам сначала необходимо ее импортировать. Вы можете сделать это с помощью следующей команды:

from lxml import etree as et

Это позволит импортировать модуль etree, представляющий интерес, из библиотеки lxml.

Создание документов HTML и XML

Используя модуль etree, мы можем создавать элементы XML и HTML и их подэлементы, что очень полезно, если мы пытаемся писать или манипулировать файлом. Попробуем создать базовую структуру HTML-файла с помощью etree:

root = et.Element('html', version="5.0")

# Pass the parent node, name of the child node,
# and any number of optional attributes
et.SubElement(root, 'head')
et.SubElement(root, 'title', bgcolor="red", fontsize='22')
et.SubElement(root, 'body', fontsize="15")

В приведенном выше коде вам необходимо знать, что для функции Element требуется как минимум один параметр, а для функции SubElement требуется как минимум два. Это связано с тем, что функция Element «требует» только имя создаваемого элемента, тогда как функция SubElement требует создания имени как корневого узла, так и дочернего узла.

Также важно знать, что обе эти функции имеют только нижнюю границу количества аргументов, которые они могут принимать, но не имеют верхней границы, потому что вы можете связать с ними столько атрибутов, сколько захотите. Чтобы добавить атрибут к элементу, просто добавьте дополнительный параметр к функции (Sub) Element и укажите свой атрибут в форме attributeName = ‘attribute value’.

Давайте попробуем запустить код, который мы написали выше, чтобы лучше понять эти функции:

# Use pretty_print=True to indent the HTML output
print (et.tostring(root, pretty_print=True).decode("utf-8"))

Вывод:

<html version="5.0">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>

Есть еще один способ создания и организации ваших элементов в иерархическом порядке. Давайте также исследуем это:

root = et.Element('html')
root.append(et.SubElement('head')) 
root.append(et.SubElement('body'))

Поэтому в этом случае всякий раз, когда мы создаем новый элемент, мы просто добавляем его к корневому или родительскому узлу.

Анализ документов HTML и XML

До сих пор мы рассматривали только создание новых элементов, присвоение им атрибутов и т.д. Давайте теперь рассмотрим пример, в котором у нас уже есть файл HTML или XML, и мы хотим проанализировать его, чтобы извлечь определенную информацию. Предполагая, что у нас есть файл HTML, который мы создали в первом примере, давайте попробуем получить имя тега одного конкретного элемента, а затем распечатать имена тегов всех элементов.

print(root.tag)

Вывод:

html 

Теперь, чтобы перебрать все дочерние элементы в корневом узле и распечатать их теги:

for e in root:
    print(e.tag)

Вывод:

head
title
body

Работа с атрибутами

Давайте теперь посмотрим, как мы связываем атрибуты с существующими элементами, а также как получить значение определенного атрибута для данного элемента.

Используя тот же корневой элемент, что и раньше, попробуйте следующий код:

root.set('newAttribute', 'attributeValue') 

# Print root again to see if the new attribute has been added
print(et.tostring(root, pretty_print=True).decode("utf-8"))

Вывод:

<html version="5.0" newAttribute="attributeValue">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>

Здесь мы видим, что newAttribute = “attributeValue” действительно был добавлен к корневому элементу.

Давайте теперь попробуем получить значения атрибутов, которые мы установили в приведенном выше коде. Здесь мы получаем доступ к дочернему элементу, используя индексирование массива по корневому элементу, а затем используем метод get() для получения атрибута:

print(root.get('newAttribute'))
print(root[1].get('alpha')) # root[1] accesses the `title` element
print(root[1].get('bgcolor'))

Вывод:

attributeValue
None
red

Получение текста из элементов

Теперь, когда мы ознакомились с основными функциями модуля etree, давайте попробуем сделать еще несколько интересных вещей с нашими файлами HTML и XML. Почти всегда в этих файлах между тегами есть текст. Итак, давайте посмотрим, как мы можем добавить текст к нашим элементам:

# Copying the code from the very first example
root = et.Element('html', version="5.0")
et.SubElement(root, 'head')
et.SubElement(root, 'title', bgcolor="red", fontsize="22")
et.SubElement(root, 'body', fontsize="15")

# Add text to the Elements and SubElements
root.text = "This is an HTML file"
root[0].text = "This is the head of that file"
root[1].text = "This is the title of that file"
root[2].text = "This is the body of that file and would contain paragraphs etc"

print(et.tostring(root, pretty_print=True).decode("utf-8"))

Вывод:

<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>

Как проверить, есть ли дочерние элементы?

Далее, есть две очень важные вещи, которые мы должны иметь возможность проверить, поскольку это требуется во многих приложениях для очистки веб-страниц для обработки исключений. Во-первых, мы хотели бы проверить, есть ли у элемента дочерние элементы, а во-вторых, является ли узел элементом.

Сделаем это для узлов, которые мы создали выше:

if len(root) > 0:
    print("True")
else:
    print("False")

Приведенный выше код выведет «True», поскольку у корневого узла есть дочерние узлы. Однако, если мы проверим то же самое для дочерних узлов корневого узла, как в приведенном ниже коде, на выходе будет «False».

for i in range(len(root)):
    if (len(root[i]) > 0):
        print("True")
    else:
        print("False")

Вывод:

False
False
False

Теперь давайте сделаем то же самое, чтобы увидеть, является ли каждый из узлов элементом или нет:

for i in range(len(root)):
    print(et.iselement(root[i]))

Вывод:

True
True
True

Метод iselement полезен для определения, есть ли у вас действительный объект Element, и, следовательно, можете ли вы продолжить его обход, используя методы.

Как проверить, есть ли родительский элемент?

Только что мы показали, как спуститься по иерархии, то есть как проверить, есть ли у элемента дочерние элементы или нет, и теперь в этом разделе мы попытаемся подняться вверх по иерархии, то есть как проверить и получить родительский элемент дочернего узла.

print(root.getparent())
print(root[0].getparent())
print(root[1].getparent())

Первая строка не должна возвращать ничего (иначе None), поскольку сам корневой узел не имеет родителя. Два других должны указывать на корневой элемент, то есть на HTML-тег. Давайте проверим вывод, чтобы убедиться, что он соответствует нашим ожиданиям.

Вывод:

None
<Element html at 0x1103c9688>
<Element html at 0x1103c9688>

Получение братьев и сестер элемента

В этом разделе мы узнаем, как перемещаться в боковом направлении по иерархии, которая извлекает братьев и сестер элемента в дереве.

Боковое перемещение по дереву очень похоже на перемещение по нему по вертикали. Для последнего мы использовали getparent и длину элемента, для первого мы будем использовать функции getnext и getprevious. Давайте попробуем их на ранее созданных узлах, чтобы увидеть, как они работают:

# root[1] is the `title` tag
print(root[1].getnext()) # The tag after the `title` tag
print(root[1].getprevious()) # The tag before the `title` tag

Вывод:

<Element body at 0x10b5a75c8>
<Element head at 0x10b5a76c8>

Здесь вы можете видеть, что root [1] .getnext() извлек тег «body», поскольку это был следующий элемент, а root [1] .getprevious() извлек тег «head».

Точно так же, если бы мы использовали функцию getprevious для root, она вернула бы None, а если бы мы использовали функцию getnext для root [2], она также вернула бы None.

Разбор XML из строки

Двигаясь дальше, если у нас есть файл XML или HTML, и мы хотим проанализировать необработанную строку, чтобы получить или обработать требуемую информацию, мы можем сделать это, следуя приведенному ниже примеру:

root = et.XML('<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>')
root[1].text = "The title text has changed!"
print(et.tostring(root, xml_declaration=True).decode('utf-8'))

Вывод:

<?xml version='1.0' encoding='ASCII'?>
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">The title text has changed!</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>

Как видите, мы успешно изменили текст в HTML-документе. Объявление XML doctype также было автоматически добавлено из-за параметра xml_declaration, который мы передали функции tostring.

Поиск элементов

Последнее, что мы собираемся обсудить, очень удобно при синтаксическом анализе файлов XML и HTML. Мы будем проверять способы, с помощью которых мы можем увидеть, есть ли у элемента какой-либо конкретный тип дочерних элементов, и есть ли у него то, что они содержат.

У этого есть много практических вариантов использования, таких как поиск всех элементов ссылки на определенной веб-странице.

print(root.find('a')) # No <a> tags exist, so this will be `None`
print(root.find('head').tag)
print(root.findtext('title')) # Directly retrieve the the title tag's text

Вывод:

None
head
This is the title of that file

Заключение

В приведенном выше руководстве мы начали с базового введения в то, что такое библиотека lxml и для чего она используется. После этого мы узнали, как установить его в различных средах, таких как Windows, Linux и т.д. Двигаясь дальше, мы исследовали различные функции, которые могут помочь нам перемещаться по дереву HTML и XML как в вертикальном, так и в боковом направлении. В конце мы также обсудили способы поиска элементов в нашем дереве, а также получения информации из них.

I’m trying to install lmxl on my Windows 8.1 laptop with Python 3.4 and failing miserably.

First off, I tried the simple and obvious solution: pip install lxml. However, this didn’t work. Here’s what it said:

Downloading/unpacking lxml
  Running setup.py (path:C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxmlsetup.py) egg_info for package lxml
    Building lxml version 3.4.2.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)

    warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
  Running setup.py install for lxml
    Building lxml version 3.4.2.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    building 'lxml.etree' extension
    C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    error: Unable to find vcvarsall.bat
    Complete output from command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile:
    Building lxml version 3.4.2.

Building without Cython.

ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"

** make sure the development packages of libxml2 and libxslt are installed **



Using build configuration of libxslt

running install

running build

running build_py

creating build

creating buildlib.win32-3.4

creating buildlib.win32-3.4lxml

copying srclxmlbuilder.py -> buildlib.win32-3.4lxml

copying srclxmlcssselect.py -> buildlib.win32-3.4lxml

copying srclxmldoctestcompare.py -> buildlib.win32-3.4lxml

copying srclxmlElementInclude.py -> buildlib.win32-3.4lxml

copying srclxmlpyclasslookup.py -> buildlib.win32-3.4lxml

copying srclxmlsax.py -> buildlib.win32-3.4lxml

copying srclxmlusedoctest.py -> buildlib.win32-3.4lxml

copying srclxml_elementpath.py -> buildlib.win32-3.4lxml

copying srclxml__init__.py -> buildlib.win32-3.4lxml

creating buildlib.win32-3.4lxmlincludes

copying srclxmlincludes__init__.py -> buildlib.win32-3.4lxmlincludes

creating buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlbuilder.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlclean.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmldefs.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmldiff.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlElementSoup.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlformfill.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlhtml5parser.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlsoupparser.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlusedoctest.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_diffcommand.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_html5builder.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_setmixin.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml__init__.py -> buildlib.win32-3.4lxmlhtml

creating buildlib.win32-3.4lxmlisoschematron

copying srclxmlisoschematron__init__.py -> buildlib.win32-3.4lxmlisoschematron

copying srclxmllxml.etree.h -> buildlib.win32-3.4lxml

copying srclxmllxml.etree_api.h -> buildlib.win32-3.4lxml

copying srclxmlincludesc14n.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesconfig.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesdtdvalid.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesetreepublic.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludeshtmlparser.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesrelaxng.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesschematron.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludestree.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesuri.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxinclude.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlerror.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlparser.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlschema.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxpath.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxslt.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesetree_defs.h -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludeslxml-version.h -> buildlib.win32-3.4lxmlincludes

creating buildlib.win32-3.4lxmlisoschematronresources

creating buildlib.win32-3.4lxmlisoschematronresourcesrng

copying srclxmlisoschematronresourcesrngiso-schematron.rng -> buildlib.win32-3.4lxmlisoschematronresourcesrng

creating buildlib.win32-3.4lxmlisoschematronresourcesxsl

copying srclxmlisoschematronresourcesxslRNG2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl

copying srclxmlisoschematronresourcesxslXSD2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl

creating buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_abstract_expand.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_dsdl_include.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_message.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_skeleton_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_svrl_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1readme.txt -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

running build_ext

building 'lxml.etree' extension

C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'

  warnings.warn(msg)

error: Unable to find vcvarsall.bat

----------------------------------------
Cleaning up...
Command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile failed with error code 1 in C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxml
Storing debug log for failure in C:Userscarte_000pippip.log

So then I looked on this great and helpful thing called The Internet and a lot of people have the same error of needing libxml2 and libxlst. They recommend a guy called Christoph Gohlke’s page where he provides some sort of binary thingy for a bunch of packages. You can find it here (quicklink to the lxml part).

So after I gave up on trying to find libxml2 and libxslt for pip, I decided to go there, and found an absolute ton of downloads. I know I need a 64-bit one, but I have no idea which «cp» I need.

So an answer either giving me a solution on the pip method or the Gohlke index method would be great.

I’m trying to install lmxl on my Windows 8.1 laptop with Python 3.4 and failing miserably.

First off, I tried the simple and obvious solution: pip install lxml. However, this didn’t work. Here’s what it said:

Downloading/unpacking lxml
  Running setup.py (path:C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxmlsetup.py) egg_info for package lxml
    Building lxml version 3.4.2.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)

    warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
  Running setup.py install for lxml
    Building lxml version 3.4.2.
    Building without Cython.
    ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    building 'lxml.etree' extension
    C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
      warnings.warn(msg)
    error: Unable to find vcvarsall.bat
    Complete output from command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile:
    Building lxml version 3.4.2.

Building without Cython.

ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"

** make sure the development packages of libxml2 and libxslt are installed **



Using build configuration of libxslt

running install

running build

running build_py

creating build

creating buildlib.win32-3.4

creating buildlib.win32-3.4lxml

copying srclxmlbuilder.py -> buildlib.win32-3.4lxml

copying srclxmlcssselect.py -> buildlib.win32-3.4lxml

copying srclxmldoctestcompare.py -> buildlib.win32-3.4lxml

copying srclxmlElementInclude.py -> buildlib.win32-3.4lxml

copying srclxmlpyclasslookup.py -> buildlib.win32-3.4lxml

copying srclxmlsax.py -> buildlib.win32-3.4lxml

copying srclxmlusedoctest.py -> buildlib.win32-3.4lxml

copying srclxml_elementpath.py -> buildlib.win32-3.4lxml

copying srclxml__init__.py -> buildlib.win32-3.4lxml

creating buildlib.win32-3.4lxmlincludes

copying srclxmlincludes__init__.py -> buildlib.win32-3.4lxmlincludes

creating buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlbuilder.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlclean.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmldefs.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmldiff.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlElementSoup.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlformfill.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlhtml5parser.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlsoupparser.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtmlusedoctest.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_diffcommand.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_html5builder.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml_setmixin.py -> buildlib.win32-3.4lxmlhtml

copying srclxmlhtml__init__.py -> buildlib.win32-3.4lxmlhtml

creating buildlib.win32-3.4lxmlisoschematron

copying srclxmlisoschematron__init__.py -> buildlib.win32-3.4lxmlisoschematron

copying srclxmllxml.etree.h -> buildlib.win32-3.4lxml

copying srclxmllxml.etree_api.h -> buildlib.win32-3.4lxml

copying srclxmlincludesc14n.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesconfig.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesdtdvalid.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesetreepublic.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludeshtmlparser.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesrelaxng.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesschematron.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludestree.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesuri.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxinclude.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlerror.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlparser.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxmlschema.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxpath.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesxslt.pxd -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludesetree_defs.h -> buildlib.win32-3.4lxmlincludes

copying srclxmlincludeslxml-version.h -> buildlib.win32-3.4lxmlincludes

creating buildlib.win32-3.4lxmlisoschematronresources

creating buildlib.win32-3.4lxmlisoschematronresourcesrng

copying srclxmlisoschematronresourcesrngiso-schematron.rng -> buildlib.win32-3.4lxmlisoschematronresourcesrng

creating buildlib.win32-3.4lxmlisoschematronresourcesxsl

copying srclxmlisoschematronresourcesxslRNG2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl

copying srclxmlisoschematronresourcesxslXSD2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl

creating buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_abstract_expand.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_dsdl_include.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_message.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_skeleton_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_svrl_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

copying srclxmlisoschematronresourcesxsliso-schematron-xslt1readme.txt -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1

running build_ext

building 'lxml.etree' extension

C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'

  warnings.warn(msg)

error: Unable to find vcvarsall.bat

----------------------------------------
Cleaning up...
Command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile failed with error code 1 in C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxml
Storing debug log for failure in C:Userscarte_000pippip.log

So then I looked on this great and helpful thing called The Internet and a lot of people have the same error of needing libxml2 and libxlst. They recommend a guy called Christoph Gohlke’s page where he provides some sort of binary thingy for a bunch of packages. You can find it here (quicklink to the lxml part).

So after I gave up on trying to find libxml2 and libxslt for pip, I decided to go there, and found an absolute ton of downloads. I know I need a 64-bit one, but I have no idea which «cp» I need.

So an answer either giving me a solution on the pip method or the Gohlke index method would be great.

Hello geeks, I hope all are doing great. So, no one denies that the number of libraries in Python gives strong support to the python programming language. These libraries extend the scope of the language to a higher extent and extend the domain of usage of the language. Today in this article, we will briefly introduce the library and its installation. The name of the library is lxml.

lxml Module

This open-source library gives us the ease of processing XML and HTML in the Python language. This library is the pythonic binding of C libraries such as libxml2_ and libxslt_. It combines the speed and completeness of XML libraries with the simplicity of native python API. It is compatible but somewhat superior to Element tree API. However, we are not going much more profound in discussing this module. For this article, we will only focus on its installation.

Requirements

But, before heading toward the installation procedure of the library, first, see the requirements for the lxml library.

  • One should have python installed in the system with version 2.7 or 3.4 or above.
  • If you are not using a static binary distribution (e.g. from a Windows binary installer), you need supporting libraries installed in the system. They are as follows:
    • libxml2 version 2.9.2 or later.
    • libxslt version 1.1.27 or later.

Installing lxml on Linux

To install the package with all the required dependencies on the Linux system, one can use the installation tools, i.e., apt-get. You can follow the following command to install it along with all the packages.

sudo apt-get install libxml2-dev libxslt-dev python-dev

Or, you can also use the following command for the same without mentioning the name of the required dependency.

sudo apt-get build-dep python3-lxml

Installing lxml Using PIP

Windows/Linux

If you are a pip user, you can use the following command.

Now, this command will install the library locally in your virtual environment. However, to install it globally, you can use the following command.

Note:- This works only for Linux systems.

Or, you can also specify the version while entering the installation command.

pip install lxml==3.4.2

To speed up the build in test environments, e.g., on a continuous integration server, disable the C compiler optimizations by setting the CFLAGS environment variable:

CFLAGS="-O0"  pip install lxml

We can check it using the following command.

>>> import lxml
>>> lxml.__version__
'4.7.1'

[Fixed] Module Seaborn has no Attribute Histplot Error

Install lxml in Debian based System

For Debian-based systems, we can use the following command.

sudo apt-get build-dep python3-lxml

Installing Python lxml in MacOS

However, you can use the following command to install the package on macOS. This command will also install the required dependency, so we also need not to

care about that.

STATIC_DEPS=true sudo pip install lxml

Install python lxml in CentOS

However, to install lxml, we first need to install its dependency in centOS. To do that, we will use the following command.

sudo yum install libxml2 libxml2-devel libxml2-python libxslt libxslt-devel

pip install lxml or easy_install lxml

Installing lxml Using Conda

However, you are an anaconda user. You can install it using the following command.

conda install -c anaconda lxml

Installing lxml in Pycharm

To install lxml in pycharm, you can follow the following step:

  • Open File > Settings > Project from the PyCharm menu.
  • Select your current project.
  • Click the Python Interpreter tab within your project tab.
  • Click the “+" symbol to add a new library to the project.
  • Now type in the library to be installed, in your example "lxml" without quotes, and click Install Package.
  • Wait for the installation to terminate and close all pop-ups.

Installing lxml in Jupyter Notebook

To install lxml in jupyter notebook, you can run the following command in the Jupyter notebook code cell.

!pip install lxml

Using lxml with python-libxml2

However, if you want to install the dependency along with the library statically, you can use the following command. The consequences of not doing that are that the two packages will interfere in places where the libxml2 library requires global configuration, which may lead to the crash of the program.

STATIC_DEPS=true pip install lxml

Use Binary wheel files to install lxml

Despite installing lxml using these commands, we have another option available. In this method, we first install the binary wheel file for lxml and then run it with pip install. We can download the file from the given website.

Unofficial Windows binaries, Click here.

Now once done, we can install it using the following command.

pip install lxml‑4.6.5‑cp39‑cp39‑win_amd64.whl

Installing lxml in RedHat

To install lxml in RedHat, we need to follow the series of commands.

sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y 

sudo yum install python-devel -y 

sudo curl -o /tmp/ez_setup.py https://sources.rhodecode.com/setuptools/raw/bootstrap/ez_setup.py 

sudo /usr/bin/python /tmp/ez_setup.py 

sudo /usr/bin/easy_install pip 

sudo rm setuptools-*.tar.gz 

sudo pip install -i https://pypi.rhodecode.com/ --upgrade pip 

sudo pip install virtualenv 

[Fixed] JavaScript error: IPython is Not Defined

FAQs on Python Install lxml

Does lxml come with Python?

No, we need to download it separately.

Is lxml faster than BeautifulSoup?

Yes, lxml is way faster than BeautifulSoup.

Do you need to install a parser library lxml to use BeautifulSoup?

Yes, we need to install both lxml and BeautifulSoup both for using the library.

Conclusion

So, today in this article, we have seen how we can install the lxml library on different platforms. We have taken examples of different environments where we can install the library. I hope this article has helped you. Thank You.

Trending Right Now

  • [Fixed] Module Seaborn has no Attribute Histplot Error

    [Fixed] Module Seaborn has no Attribute Histplot Error

    January 18, 2023

  • Thonny: Text Wrapping Made Easy

    Thonny: Text Wrapping Made Easy

    by Rahul Kumar YadavJanuary 18, 2023

  • [Fixed] JavaScript error: IPython is Not Defined

    [Fixed] JavaScript error: IPython is Not Defined

    by Rahul Kumar YadavJanuary 18, 2023

  • [Fixed] “io.unsupportedoperation not readable” Error

    [Fixed] “io.unsupportedoperation not readable” Error

    by Rahul Kumar YadavJanuary 18, 2023

In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump onto processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. Each step of this tutorial is complete with practical Python lxml examples.

Prerequisite

This tutorial is aimed at developers who have at least a basic understanding of Python. A basic understanding of XML and HTML is also required. Simply put, if you know what an attribute is in XML, that is enough to understand this article. 

This tutorial uses Python 3 code snippets but everything works on Python 2  with minimal changes as well.

What is lxml in Python?

lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. This library is essentially a wrapper over C libraries libxml2 and libxslt. This combines the speed of the native C library and the simplicity of Python.

Using Python lxml library, XML and HTML documents can be created, parsed, and queried. It is a dependency on many of the other complex packages like Scrapy.

Installation

The best way to download and install the lxml library is from Python Package Index (PyPI). If you are on Linux (debian-based), simply run:

sudo apt-get install python3-lxml

Another way is to use the pip package manager. This works on Windows, Mac, and Linux:

On windows, just use pip install lxml, assuming you are running Python 3.

Creating a simple XML document

Any XML or any XML compliant HTML can be visualized as a tree. A tree has a root and branches. Each branch optionally may have further branches. All these branches and the root are represented as an Element.

A very simple XML document would look like this:

<root>
    <branch>
        <branch_one>
        </branch_one>
        <branch_one>
        </branch_one >
    </branch>
</root>

Link to GitHub

If an HTML is XML compliant, it will follow the same concept. 

Note that HTML may or may not be XML compliant. For example, if an HTML has <br> without a corresponding closing tag, it is still valid HTML, but it will not be a valid XML. In the later part of this tutorial, we will see how these cases can be handled.  For now, let’s focus on XML compliant HTML.

The Element class

To create an XML document using python lxml, the first step is to import the etree module of lxml:

Every XML document begins with the root element. This can be created using the Element type. The Element type is a flexible container object which can store hierarchical data. This can be described as a cross between a dictionary and a list.

In this python lxml example, the objective is to create an HTML, which is XML compliant. It means that the root element will have its name as html:

Similarly, every html will have a head and a body:

>>> head = etree.Element("head")
>>> body = etree.Element("body")

Link to GitHub

To create parent-child relationships, we can simply use the append() method.

This document can be serialized and printed to the terminal with the help of tostring() function. This function expects one mandatory argument, which is the root of the document. We can optionally set pretty_print to True to make the output more readable. Note that tostring() serializer actually returns bytes. This can be converted to string by calling decode():

>>> print(etree.tostring(root, pretty_print=True).decode())

Link to GitHub

The SubElement class

Creating an Element object and calling the append() function can make the code messy and unreadable. The easiest way is to use the SubElement type. Its constructor takes two arguments – the parent node and the element name. Using SubElement, the following two lines of code can be replaced by just one.

body = etree.Element("body")
root.append(body)
# is same as 
body = etree.SubElement(root,"body")

Setting text and attributes

Setting text is very easy with the lxml library. Every instance of the Element and SubElement exposes two methods – text and set, the former is used to specify the text and later is used to set the attributes. Here are the examples:

para = etree.SubElement(body, "p")
para.text="Hello World!"

Link to GitHub

Similarly, attributes can be set using key-value convention:

One thing to note here is that the attribute can be passed in the constructor of SubElement:

para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"

Link to GitHub

The benefit of this approach is saving lines of code and clarity. Here is the complete code. Save it in a python file and run it. It will print an HTML which is also a well-formed XML.

from lxml import etree
 
root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p",  id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p",  id="secondPara")
para.text = "This is the second paragraph."
 
etree.dump(root)  # prints everything to console. Use for debug only

Link to GitHub

Note that here we used etree.dump() instead of calling etree.tostring(). The difference is that dump() simply writes everything to the console and doesn’t return anything, tostring() is used for serialization and returns a string which you can store in a variable or write to a file. dump() is good for debug only and should not be used for any other purpose. 

Add the following lines at the bottom of the snippet and run it again:

with open(‘input.html’, ‘wb’) as f:
    f.write(etree.tostring(root, pretty_print=True)

Link to GitHub

This will save the contents to input.html in the same folder you were running the script. Again, this is a well-formed XML, which can be interpreted as XML or HTML.

How do you parse an XML file using LXML in Python?

The previous section was a Python lxml tutorial on creating XML files. In this section, we will look at traversing and manipulating an existing XML document using the lxml library.

Before we move on, save the following snippet as input.html.

<html>
  <head>
    <title>This is Page Title</title>
  </head>
  <body>
    <h1 style="font-size:20pt" id="head">Hello World!</h1>
    <p id="firstPara">This HTML is XML Compliant!</p>
    <p id="secondPara">This is the second paragraph.</p>
  </body>
</html>

Link to GitHub

When an XML document is parsed, the result is an in-memory ElementTree object.

The raw XML contents can be in a file system or a string. If it is in a file system, it can be loaded using the parse method. Note that the parse method will return an object of type ElementTree. To get the root element, simply call the getroot() method.

from lxml import etree
 
tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) #prints file contents to console

Link to GitHub

The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — fromstring()

xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)

Link to GitHub

One important difference to note here is that fromstring() method returns an object of element. There is no need to call getroot().

If you want to dig deeper into parsing, we have already written a tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. But to quickly answer what is lxml in BeautifulSoup, lxml can use BeautifulSoup as a parser backend. Similarly, BeautifulSoup can employ lxml as a parser. 

Finding elements in XML

Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath. For example, the following code will return the first paragraph element.

Note that the selector is very similar to XPath. Also note that the root element name was not used because elem contains the root of the XML tree.

tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)
 
# Output 
# <p id="firstPara">This HTML is XML Compliant!</p>

Link to GitHub

Similarly, findall() will return a list of all the elements matching the selector.

elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
    etree.dump(e)
 
# Outputs
# <p id="firstPara">This HTML is XML Compliant!</p>
# <p id="secondPara">This is the second paragraph.</p>

Link to GitHub

The second way of selecting the elements is by using XPath directly. This approach is easier to follow by developers who are familiar with XPath. Furthermore, XPath can be used to return the instance of the element, the text, or the value of any attribute using standard XPath syntax.

para = elem.xpath('//p/text()')
for e in para:
    print(e)
 
# Output
# This HTML is XML Compliant!
# This is the second paragraph.

Link to GitHub

Handling HTML with lxml.html

Throughout this article, we have been working with a well-formed HTML which is XML compliant. This will not be the case a lot of the time. For these scenarios, you can simply use lxml.html instead of lxml.etree

Note that reading directly from a file is not supported. The file contents should be read in a string first. Here is the code to print all paragraphs from the same HTML file.

from lxml import html
with open('input.html') as f:
    html_string = f.read()
tree = html.fromstring(html_string)
para = tree.xpath('//p/text()')
for e in para:
    print(e)
 
# Output
# This HTML is XML Compliant!
# This is the second paragraph

Link to GitHub

lxml web scraping tutorial

Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.

For this, the ‘requests’ library is a great choice. It can be installed using the pip package  manager:

Once the requests library is installed, HTML of any web page can be retrieved using a simple get() method. Here is an example.

import requests
 
response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints source HTML

Link to GitHub

This can be combined with lxml to retrieve any data that is required.

Here is a quick example that prints a list of countries from Wikipedia:

import requests
from lxml import html
 
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
 
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
    print(country.xpath('./following-sibling::a/text()')[0])

Link to GitHub

In this code, the HTML returned by response.text is parsed into the variable tree. This can be queried using standard XPath syntax. The XPaths can be concatenated. Note that the xpath() method returns a list and thus only the first item is taken in this code snippet.

This can easily be extended to read any attribute from the HTML. For example, the following modified code prints the country name and image URL of the flag.

for country in countries:
    flag = country.xpath('./img/@src')[0]
    country = country.xpath('./following-sibling::a/text()')[0]
    print(country, flag)

Link to GitHub

You can click here to find the complete code used in this article for your convenience.

Conclusion

In this Python lxml tutorial, various aspects of XML and HTML handling using the lxml library have been introduced. Python lxml library is a light-weight, fast, and feature-rich library. This can be used to create XML documents, read existing documents, and find specific elements. This makes this library equally powerful for both XML and HTML documents. Combined with requests library, it can also be easily used for web scraping.

You can read up and learn more on web scraping using Selenium or other useful libraries like Beautiful Soup in our blog.

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play. The key benefits of this library are that it’s ease of use, extremely fast when parsing large documents, very well documented, and provides easy conversion of data to Python data types, resulting in easier file manipulation.

In this tutorial, we will deep dive into Python’s lxml library, starting with how to set it up for different operating systems, and then discussing its benefits and the wide range of functionalities it offers.

Installation

There are multiple ways to install lxml on your system. We’ll explore some of them below.

Using Pip

Pip is a Python package manager which is used to download and install Python libraries to your local system with ease i.e. it downloads and installs all the dependencies for the package you’re installing, as well.

If you have pip installed on your system, simply run the following command in terminal or command prompt:

$ pip install lxml

Using apt-get

If you’re using MacOS or Linux, you can install lxml by running this command in your terminal:

$ sudo apt-get install python-lxml

Using easy_install

You probably won’t get to this part, but if none of the above commands works for you for some reason, try using easy_install:

$ easy_install lxml

Note: If you wish to install any particular version of lxml, you can simply state it when you run the command in the command prompt or terminal like this, lxml==3.x.y.

By now, you should have a copy of the lxml library installed on your local machine. Let’s now get our hands dirty and see what cool things can be done using this library.

Functionality

To be able to use the lxml library in your program, you first need to import it. You can do that by using the following command:

from lxml import etree as et

This will import the etree module, the module of our interest, from the lxml library.

Creating HTML/XML Documents

Using the etree module, we can create XML/HTML elements and their subelements, which is a very useful thing if we’re trying to write or manipulate an HTML or XML file. Let’s try to create the basic structure of an HTML file using etree:

root = et.Element('html', version="5.0")

# Pass the parent node, name of the child node,
# and any number of optional attributes
et.SubElement(root, 'head')
et.SubElement(root, 'title', bgcolor="red", fontsize='22')
et.SubElement(root, 'body', fontsize="15")

In the code above, you need to know that the Element function requires at least one parameter, whereas the SubElement function requires at least two. This is because the Element function only ‘requires’ the name of the element to be created, whereas the SubElement function requires the name of both the root node and the child node to be created.

It’s also important to know that both these functions only have a lower bound to the number of arguments they can accept, but no upper bound because you can associate as many attributes with them as you want. To add an attribute to an element, simply add an additional parameter to the (Sub)Element function and specify your attribute in the form of attributeName='attribute value'.

Let’s try to run the code we wrote above to gain a better intuition regarding these functions:

# Use pretty_print=True to indent the HTML output
print (et.tostring(root, pretty_print=True).decode("utf-8"))

Output:

<html version="5.0">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>

There’s another way to create and organize your elements in a hierarchical manner. Let’s explore that as well:

root = et.Element('html')
root.append(et.SubElement('head')) 
root.append(et.SubElement('body'))

So in this case whenever we create a new element, we simply append it to the root/parent node.

Parsing HTML/XML Documents

Until now, we have only considered creating new elements, assigning attributes to them, etc. Let’s now see an example where we already have an HTML or XML file, and we wish to parse it to extract certain information. Assuming that we have the HTML file that we created in the first example, let’s try to get the tag name of one specific element, followed by printing the tag names of all the elements.

print(root.tag)

Output:

html 

Now to iterate through all the child elements in the root node and print their tags:

for e in root:
    print(e.tag)

Output:

head
title
body

Working with Attributes

Let’s now see how we associate attributes to existing elements, as well as how to retrieve the value of a particular attribute for a given element.

Using the same root element as before, try out the following code:

root.set('newAttribute', 'attributeValue') 

# Print root again to see if the new attribute has been added
print(et.tostring(root, pretty_print=True).decode("utf-8"))

Output:

<html version="5.0" newAttribute="attributeValue">
  <head/>
  <title bgcolor="red" fontsize="22"/>
  <body fontsize="15"/>
</html>

Here we can see that the newAttribute="attributeValue" has indeed been added to the root element.

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Let’s now try to get the values of the attributes we have set in the above code. Here we access a child element using array indexing on the root element, and then use the get() method to retrieve the attribute:

print(root.get('newAttribute'))
print(root[1].get('alpha')) # root[1] accesses the `title` element
print(root[1].get('bgcolor'))

Output:

attributeValue
None
red

Retrieving Text from Elements

Now that we have seen basic functionalities of the etree module, let’s try to do some more interesting things with our HTML and XML files. Almost always, these files have some text in between the tags. So, let’s see how we can add text to our elements:

# Copying the code from the very first example
root = et.Element('html', version="5.0")
et.SubElement(root, 'head')
et.SubElement(root, 'title', bgcolor="red", fontsize="22")
et.SubElement(root, 'body', fontsize="15")

# Add text to the Elements and SubElements
root.text = "This is an HTML file"
root[0].text = "This is the head of that file"
root[1].text = "This is the title of that file"
root[2].text = "This is the body of that file and would contain paragraphs etc"

print(et.tostring(root, pretty_print=True).decode("utf-8"))

Output:

<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>

Check if an Element has Children

Next, there are two very important things that we should be able to check, as that is required in a lot of web scraping applications for exception handling. First thing we’d like to check is whether or not an element has children, and second is whether or not a node is an Element.

Let’s do that for the nodes we created above:

if len(root) > 0:
    print("True")
else:
    print("False")

The above code will output «True» since the root node does have child nodes. However, if we check the same thing for the root’s child nodes, like in the code below, the output will be «False».

for i in range(len(root)):
    if (len(root[i]) > 0):
        print("True")
    else:
        print("False")

Output:

False
False
False

Now let’s do the same thing to see if each of the nodes is an Element or not:

for i in range(len(root)):
    print(et.iselement(root[i]))

Output:

True
True
True

The iselement method is helpful for determining if you have a valid Element object, and thus if you can continue traversing it using the methods we’ve shown here.

Check if an Element has a Parent

Just now, we showed how to go down the hierarchy, i.e. how to check if an element has children or not, and now in this section we will try to go up the hierarchy, i.e. how to check and get the parent of a child node.

print(root.getparent())
print(root[0].getparent())
print(root[1].getparent())

The first line should return nothing (aka None) as the root node itself doesn’t have any parent. The other two should both point to the root element i.e. the HTML tag. Let’s check the output to see if it is what we expect:

Output:

None
<Element html at 0x1103c9688>
<Element html at 0x1103c9688>

Retrieving Element Siblings

In this section we will learn how to traverse sideways in the hierarchy, which retrieves an element’s siblings in the tree.

Traversing the tree sideways is quite similar to navigating it vertically. For the latter, we used the getparent and the length of the element, for the former, we’ll use getnext and getprevious functions. Let’s try them on nodes that we previously created to see how they work:

# root[1] is the `title` tag
print(root[1].getnext()) # The tag after the `title` tag
print(root[1].getprevious()) # The tag before the `title` tag

Output:

<Element body at 0x10b5a75c8>
<Element head at 0x10b5a76c8>

Here you can see that root[1].getnext() retrieved the «body» tag since it was the next element, and root[1].getprevious() retrieved the «head» tag.

Similarly, if we had used the getprevious function on root, it would have returned None, and if we had used the getnext function on root[2], it would also have returned None.

Parsing XML from a String

Moving on, if we have an XML or HTML file and we wish to parse the raw string in order to obtain or manipulate the required information, we can do so by following the example below:

root = et.XML('<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>')
root[1].text = "The title text has changed!"
print(et.tostring(root, xml_declaration=True).decode('utf-8'))

Output:

<?xml version='1.0' encoding='ASCII'?>
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">The title text has changed!</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>

As you can see, we successfully changed some text in the HTML document. The XML doctype declaration was also automatically added because of the xml_declaration parameter that we passed to the tostring function.

Searching for Elements

The last thing we’re going to discuss is quite handy when parsing XML and HTML files. We will be checking ways through which we can see if an Element has any particular type of children, and if it does what do they contain.

This has many practical use-cases, such as finding all of the link elements on a particular web page.

print(root.find('a')) # No <a> tags exist, so this will be `None`
print(root.find('head').tag)
print(root.findtext('title')) # Directly retrieve the the title tag's text

Output:

None
head
This is the title of that file

Conclusion

In the above tutorial, we started with a basic introduction to what lxml library is and what it is used for. After that, we learned how to install it on different environments like Windows, Linux, etc. Moving on, we explored different functionalities that could help us in traversing through the HTML/XML tree vertically as well as sideways. In the end, we also discussed ways to find elements in our tree, and as well as obtain information from them.

Понравилась статья? Поделить с друзьями:
  • Lvrs64 sys синий экран windows 10
  • M2n mx se plus драйвера для windows 7
  • Mac os file system on windows
  • Lvrs sys синий экран windows 10
  • M1522nf сканирование по сети windows 10