Project description
lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It
provides safe and convenient access to these libraries using the ElementTree
API.
It extends the ElementTree API significantly to offer support for XPath,
RelaxNG, XML Schema, XSLT, C14N and much more.
To contact the project, go to the project home page or see our bug tracker at
https://launchpad.net/lxml
In case you want to use the current in-development version of lxml,
you can get it from the github repository at
https://github.com/lxml/lxml . Note that this requires Cython to
build the sources, see the build instructions on the project home
page. To the same end, running easy_install lxml==dev will
install lxml from
https://github.com/lxml/lxml/tarball/master#egg=lxml-dev if you have
an appropriate version of Cython installed.
After an official release of a new stable series, bug fixes may become
available at
https://github.com/lxml/lxml/tree/lxml-4.9 .
Running easy_install lxml==4.9bugfix will install
the unreleased branch state from
https://github.com/lxml/lxml/tarball/lxml-4.9#egg=lxml-4.9bugfix
as soon as a maintenance branch has been established. Note that this
requires Cython to be installed at an appropriate version for the build.
4.9.2 (2022-12-13)
Bugs fixed
-
CVE-2022-2309: A Bug in libxml2 2.9.1[0-4] could let namespace declarations
from a failed parser run leak into later parser runs. This bug was worked around
in lxml and resolved in libxml2 2.10.0.
https://gitlab.gnome.org/GNOME/libxml2/-/issues/378
Other changes
-
LP#1981760: Element.attrib now registers as collections.abc.MutableMapping.
-
lxml now has a static build setup for macOS on ARM64 machines (not used for building wheels).
Patch by Quentin Leffray.
Download files
Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
I’m trying to install lmxl
on my Windows 8.1 laptop with Python 3.4 and failing miserably.
First off, I tried the simple and obvious solution: pip install lxml
. However, this didn’t work. Here’s what it said:
Downloading/unpacking lxml
Running setup.py (path:C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxmlsetup.py) egg_info for package lxml
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
building 'lxml.etree' extension
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
error: Unable to find vcvarsall.bat
Complete output from command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile:
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
running install
running build
running build_py
creating build
creating buildlib.win32-3.4
creating buildlib.win32-3.4lxml
copying srclxmlbuilder.py -> buildlib.win32-3.4lxml
copying srclxmlcssselect.py -> buildlib.win32-3.4lxml
copying srclxmldoctestcompare.py -> buildlib.win32-3.4lxml
copying srclxmlElementInclude.py -> buildlib.win32-3.4lxml
copying srclxmlpyclasslookup.py -> buildlib.win32-3.4lxml
copying srclxmlsax.py -> buildlib.win32-3.4lxml
copying srclxmlusedoctest.py -> buildlib.win32-3.4lxml
copying srclxml_elementpath.py -> buildlib.win32-3.4lxml
copying srclxml__init__.py -> buildlib.win32-3.4lxml
creating buildlib.win32-3.4lxmlincludes
copying srclxmlincludes__init__.py -> buildlib.win32-3.4lxmlincludes
creating buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlbuilder.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlclean.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmldefs.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmldiff.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlElementSoup.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlformfill.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlhtml5parser.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlsoupparser.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlusedoctest.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_diffcommand.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_html5builder.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_setmixin.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml__init__.py -> buildlib.win32-3.4lxmlhtml
creating buildlib.win32-3.4lxmlisoschematron
copying srclxmlisoschematron__init__.py -> buildlib.win32-3.4lxmlisoschematron
copying srclxmllxml.etree.h -> buildlib.win32-3.4lxml
copying srclxmllxml.etree_api.h -> buildlib.win32-3.4lxml
copying srclxmlincludesc14n.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesconfig.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesdtdvalid.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesetreepublic.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludeshtmlparser.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesrelaxng.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesschematron.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludestree.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesuri.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxinclude.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlerror.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlparser.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlschema.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxpath.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxslt.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesetree_defs.h -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludeslxml-version.h -> buildlib.win32-3.4lxmlincludes
creating buildlib.win32-3.4lxmlisoschematronresources
creating buildlib.win32-3.4lxmlisoschematronresourcesrng
copying srclxmlisoschematronresourcesrngiso-schematron.rng -> buildlib.win32-3.4lxmlisoschematronresourcesrng
creating buildlib.win32-3.4lxmlisoschematronresourcesxsl
copying srclxmlisoschematronresourcesxslRNG2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl
copying srclxmlisoschematronresourcesxslXSD2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl
creating buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_abstract_expand.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_dsdl_include.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_message.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_skeleton_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_svrl_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1readme.txt -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
error: Unable to find vcvarsall.bat
----------------------------------------
Cleaning up...
Command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile failed with error code 1 in C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxml
Storing debug log for failure in C:Userscarte_000pippip.log
So then I looked on this great and helpful thing called The Internet and a lot of people have the same error of needing libxml2
and libxlst
. They recommend a guy called Christoph Gohlke’s page where he provides some sort of binary thingy for a bunch of packages. You can find it here (quicklink to the lxml part).
So after I gave up on trying to find libxml2 and libxslt for pip, I decided to go there, and found an absolute ton of downloads. I know I need a 64-bit one, but I have no idea which «cp
» I need.
So an answer either giving me a solution on the pip
method or the Gohlke index method would be great.
lxml – это библиотека Python, которая позволяет легко обрабатывать файлы XML и HTML, а также может использоваться для очистки веб-страниц. Существует множество стандартных анализаторов XML, но для достижения лучших результатов разработчики иногда предпочитают писать свои собственные анализаторы XML и HTML. Именно тогда в игру вступает библиотека lxml. Ключевые преимущества этой библиотеки заключаются в том, что она проста в использовании, чрезвычайно быстра при синтаксическом анализе больших документов, очень хорошо документирована и обеспечивает легкое преобразование данных в типы данных Python, что упрощает манипуляции с файлами.
В этом руководстве мы глубоко погрузимся в библиотеку lxml Python, начав с того, как настроить ее для различных операционных систем, а затем обсудим ее преимущества и широкий спектр функций, которые она предлагает.
Есть несколько способов установить lxml в вашу систему. Мы рассмотрим некоторые из них ниже.
Использование Pip
Pip – это менеджер пакетов Python, который используется для простой загрузки и установки библиотек в вашу локальную систему, т.е. он также загружает и устанавливает все зависимости для пакета, который вы устанавливаете.
Если в вашей системе установлен pip, просто выполните следующую команду в терминале или командной строке:
$ pip install lxml
apt-get
Если вы используете MacOS или Linux, вы можете установить lxml, выполнив эту команду в своем терминале:
$ sudo apt-get install python-lxml
easy_install
Вероятно, вы не дойдете до этой части, но если ни одна из вышеперечисленных команд по какой-то причине у вас не работает, попробуйте использовать easy_install:
$ easy_install lxml
Примечание. Если вы хотите установить какую-либо конкретную версию lxml, вы можете просто указать ее при запуске команды в командной строке или в терминале, например, lxml == 3.xy
К настоящему времени у вас должна быть установлена копия библиотеки lxml на вашем локальном компьютере. Давайте теперь посмотрим, какие классные вещи можно делать с помощью этой библиотеки.
Функциональность
Чтобы иметь возможность использовать библиотеку lxml в своей программе, вам сначала необходимо ее импортировать. Вы можете сделать это с помощью следующей команды:
from lxml import etree as et
Это позволит импортировать модуль etree, представляющий интерес, из библиотеки lxml.
Создание документов HTML и XML
Используя модуль etree, мы можем создавать элементы XML и HTML и их подэлементы, что очень полезно, если мы пытаемся писать или манипулировать файлом. Попробуем создать базовую структуру HTML-файла с помощью etree:
root = et.Element('html', version="5.0") # Pass the parent node, name of the child node, # and any number of optional attributes et.SubElement(root, 'head') et.SubElement(root, 'title', bgcolor="red", fontsize='22') et.SubElement(root, 'body', fontsize="15")
В приведенном выше коде вам необходимо знать, что для функции Element требуется как минимум один параметр, а для функции SubElement требуется как минимум два. Это связано с тем, что функция Element «требует» только имя создаваемого элемента, тогда как функция SubElement требует создания имени как корневого узла, так и дочернего узла.
Также важно знать, что обе эти функции имеют только нижнюю границу количества аргументов, которые они могут принимать, но не имеют верхней границы, потому что вы можете связать с ними столько атрибутов, сколько захотите. Чтобы добавить атрибут к элементу, просто добавьте дополнительный параметр к функции (Sub) Element и укажите свой атрибут в форме attributeName = ‘attribute value’.
Давайте попробуем запустить код, который мы написали выше, чтобы лучше понять эти функции:
# Use pretty_print=True to indent the HTML output print (et.tostring(root, pretty_print=True).decode("utf-8"))
Вывод:
<html version="5.0"> <head/> <title bgcolor="red" fontsize="22"/> <body fontsize="15"/> </html>
Есть еще один способ создания и организации ваших элементов в иерархическом порядке. Давайте также исследуем это:
root = et.Element('html') root.append(et.SubElement('head')) root.append(et.SubElement('body'))
Поэтому в этом случае всякий раз, когда мы создаем новый элемент, мы просто добавляем его к корневому или родительскому узлу.
Анализ документов HTML и XML
До сих пор мы рассматривали только создание новых элементов, присвоение им атрибутов и т.д. Давайте теперь рассмотрим пример, в котором у нас уже есть файл HTML или XML, и мы хотим проанализировать его, чтобы извлечь определенную информацию. Предполагая, что у нас есть файл HTML, который мы создали в первом примере, давайте попробуем получить имя тега одного конкретного элемента, а затем распечатать имена тегов всех элементов.
print(root.tag)
Вывод:
html
Теперь, чтобы перебрать все дочерние элементы в корневом узле и распечатать их теги:
for e in root: print(e.tag)
Вывод:
head title body
Работа с атрибутами
Давайте теперь посмотрим, как мы связываем атрибуты с существующими элементами, а также как получить значение определенного атрибута для данного элемента.
Используя тот же корневой элемент, что и раньше, попробуйте следующий код:
root.set('newAttribute', 'attributeValue') # Print root again to see if the new attribute has been added print(et.tostring(root, pretty_print=True).decode("utf-8"))
Вывод:
<html version="5.0" newAttribute="attributeValue"> <head/> <title bgcolor="red" fontsize="22"/> <body fontsize="15"/> </html>
Здесь мы видим, что newAttribute = “attributeValue” действительно был добавлен к корневому элементу.
Давайте теперь попробуем получить значения атрибутов, которые мы установили в приведенном выше коде. Здесь мы получаем доступ к дочернему элементу, используя индексирование массива по корневому элементу, а затем используем метод get() для получения атрибута:
print(root.get('newAttribute')) print(root[1].get('alpha')) # root[1] accesses the `title` element print(root[1].get('bgcolor'))
Вывод:
attributeValue None red
Получение текста из элементов
Теперь, когда мы ознакомились с основными функциями модуля etree, давайте попробуем сделать еще несколько интересных вещей с нашими файлами HTML и XML. Почти всегда в этих файлах между тегами есть текст. Итак, давайте посмотрим, как мы можем добавить текст к нашим элементам:
# Copying the code from the very first example root = et.Element('html', version="5.0") et.SubElement(root, 'head') et.SubElement(root, 'title', bgcolor="red", fontsize="22") et.SubElement(root, 'body', fontsize="15") # Add text to the Elements and SubElements root.text = "This is an HTML file" root[0].text = "This is the head of that file" root[1].text = "This is the title of that file" root[2].text = "This is the body of that file and would contain paragraphs etc" print(et.tostring(root, pretty_print=True).decode("utf-8"))
Вывод:
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>
Как проверить, есть ли дочерние элементы?
Далее, есть две очень важные вещи, которые мы должны иметь возможность проверить, поскольку это требуется во многих приложениях для очистки веб-страниц для обработки исключений. Во-первых, мы хотели бы проверить, есть ли у элемента дочерние элементы, а во-вторых, является ли узел элементом.
Сделаем это для узлов, которые мы создали выше:
if len(root) > 0: print("True") else: print("False")
Приведенный выше код выведет «True», поскольку у корневого узла есть дочерние узлы. Однако, если мы проверим то же самое для дочерних узлов корневого узла, как в приведенном ниже коде, на выходе будет «False».
for i in range(len(root)): if (len(root[i]) > 0): print("True") else: print("False")
Вывод:
False False False
Теперь давайте сделаем то же самое, чтобы увидеть, является ли каждый из узлов элементом или нет:
for i in range(len(root)): print(et.iselement(root[i]))
Вывод:
True True True
Метод iselement полезен для определения, есть ли у вас действительный объект Element, и, следовательно, можете ли вы продолжить его обход, используя методы.
Как проверить, есть ли родительский элемент?
Только что мы показали, как спуститься по иерархии, то есть как проверить, есть ли у элемента дочерние элементы или нет, и теперь в этом разделе мы попытаемся подняться вверх по иерархии, то есть как проверить и получить родительский элемент дочернего узла.
print(root.getparent()) print(root[0].getparent()) print(root[1].getparent())
Первая строка не должна возвращать ничего (иначе None), поскольку сам корневой узел не имеет родителя. Два других должны указывать на корневой элемент, то есть на HTML-тег. Давайте проверим вывод, чтобы убедиться, что он соответствует нашим ожиданиям.
Вывод:
None <Element html at 0x1103c9688> <Element html at 0x1103c9688>
Получение братьев и сестер элемента
В этом разделе мы узнаем, как перемещаться в боковом направлении по иерархии, которая извлекает братьев и сестер элемента в дереве.
Боковое перемещение по дереву очень похоже на перемещение по нему по вертикали. Для последнего мы использовали getparent и длину элемента, для первого мы будем использовать функции getnext и getprevious. Давайте попробуем их на ранее созданных узлах, чтобы увидеть, как они работают:
# root[1] is the `title` tag print(root[1].getnext()) # The tag after the `title` tag print(root[1].getprevious()) # The tag before the `title` tag
Вывод:
<Element body at 0x10b5a75c8> <Element head at 0x10b5a76c8>
Здесь вы можете видеть, что root [1] .getnext() извлек тег «body», поскольку это был следующий элемент, а root [1] .getprevious() извлек тег «head».
Точно так же, если бы мы использовали функцию getprevious для root, она вернула бы None, а если бы мы использовали функцию getnext для root [2], она также вернула бы None.
Разбор XML из строки
Двигаясь дальше, если у нас есть файл XML или HTML, и мы хотим проанализировать необработанную строку, чтобы получить или обработать требуемую информацию, мы можем сделать это, следуя приведенному ниже примеру:
root = et.XML('<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>') root[1].text = "The title text has changed!" print(et.tostring(root, xml_declaration=True).decode('utf-8'))
Вывод:
<?xml version='1.0' encoding='ASCII'?> <html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">The title text has changed!</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>
Как видите, мы успешно изменили текст в HTML-документе. Объявление XML doctype также было автоматически добавлено из-за параметра xml_declaration, который мы передали функции tostring.
Поиск элементов
Последнее, что мы собираемся обсудить, очень удобно при синтаксическом анализе файлов XML и HTML. Мы будем проверять способы, с помощью которых мы можем увидеть, есть ли у элемента какой-либо конкретный тип дочерних элементов, и есть ли у него то, что они содержат.
У этого есть много практических вариантов использования, таких как поиск всех элементов ссылки на определенной веб-странице.
print(root.find('a')) # No <a> tags exist, so this will be `None` print(root.find('head').tag) print(root.findtext('title')) # Directly retrieve the the title tag's text
Вывод:
None head This is the title of that file
Заключение
В приведенном выше руководстве мы начали с базового введения в то, что такое библиотека lxml и для чего она используется. После этого мы узнали, как установить его в различных средах, таких как Windows, Linux и т.д. Двигаясь дальше, мы исследовали различные функции, которые могут помочь нам перемещаться по дереву HTML и XML как в вертикальном, так и в боковом направлении. В конце мы также обсудили способы поиска элементов в нашем дереве, а также получения информации из них.
I’m trying to install lmxl
on my Windows 8.1 laptop with Python 3.4 and failing miserably.
First off, I tried the simple and obvious solution: pip install lxml
. However, this didn’t work. Here’s what it said:
Downloading/unpacking lxml
Running setup.py (path:C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxmlsetup.py) egg_info for package lxml
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
building 'lxml.etree' extension
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
error: Unable to find vcvarsall.bat
Complete output from command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile:
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
running install
running build
running build_py
creating build
creating buildlib.win32-3.4
creating buildlib.win32-3.4lxml
copying srclxmlbuilder.py -> buildlib.win32-3.4lxml
copying srclxmlcssselect.py -> buildlib.win32-3.4lxml
copying srclxmldoctestcompare.py -> buildlib.win32-3.4lxml
copying srclxmlElementInclude.py -> buildlib.win32-3.4lxml
copying srclxmlpyclasslookup.py -> buildlib.win32-3.4lxml
copying srclxmlsax.py -> buildlib.win32-3.4lxml
copying srclxmlusedoctest.py -> buildlib.win32-3.4lxml
copying srclxml_elementpath.py -> buildlib.win32-3.4lxml
copying srclxml__init__.py -> buildlib.win32-3.4lxml
creating buildlib.win32-3.4lxmlincludes
copying srclxmlincludes__init__.py -> buildlib.win32-3.4lxmlincludes
creating buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlbuilder.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlclean.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmldefs.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmldiff.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlElementSoup.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlformfill.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlhtml5parser.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlsoupparser.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlusedoctest.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_diffcommand.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_html5builder.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_setmixin.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml__init__.py -> buildlib.win32-3.4lxmlhtml
creating buildlib.win32-3.4lxmlisoschematron
copying srclxmlisoschematron__init__.py -> buildlib.win32-3.4lxmlisoschematron
copying srclxmllxml.etree.h -> buildlib.win32-3.4lxml
copying srclxmllxml.etree_api.h -> buildlib.win32-3.4lxml
copying srclxmlincludesc14n.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesconfig.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesdtdvalid.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesetreepublic.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludeshtmlparser.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesrelaxng.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesschematron.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludestree.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesuri.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxinclude.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlerror.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlparser.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlschema.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxpath.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxslt.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesetree_defs.h -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludeslxml-version.h -> buildlib.win32-3.4lxmlincludes
creating buildlib.win32-3.4lxmlisoschematronresources
creating buildlib.win32-3.4lxmlisoschematronresourcesrng
copying srclxmlisoschematronresourcesrngiso-schematron.rng -> buildlib.win32-3.4lxmlisoschematronresourcesrng
creating buildlib.win32-3.4lxmlisoschematronresourcesxsl
copying srclxmlisoschematronresourcesxslRNG2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl
copying srclxmlisoschematronresourcesxslXSD2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl
creating buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_abstract_expand.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_dsdl_include.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_message.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_skeleton_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_svrl_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1readme.txt -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
error: Unable to find vcvarsall.bat
----------------------------------------
Cleaning up...
Command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile failed with error code 1 in C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxml
Storing debug log for failure in C:Userscarte_000pippip.log
So then I looked on this great and helpful thing called The Internet and a lot of people have the same error of needing libxml2
and libxlst
. They recommend a guy called Christoph Gohlke’s page where he provides some sort of binary thingy for a bunch of packages. You can find it here (quicklink to the lxml part).
So after I gave up on trying to find libxml2 and libxslt for pip, I decided to go there, and found an absolute ton of downloads. I know I need a 64-bit one, but I have no idea which «cp
» I need.
So an answer either giving me a solution on the pip
method or the Gohlke index method would be great.
I’m trying to install lmxl
on my Windows 8.1 laptop with Python 3.4 and failing miserably.
First off, I tried the simple and obvious solution: pip install lxml
. However, this didn’t work. Here’s what it said:
Downloading/unpacking lxml
Running setup.py (path:C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxmlsetup.py) egg_info for package lxml
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
warning: no previously-included files found matching '*.py'
Installing collected packages: lxml
Running setup.py install for lxml
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
building 'lxml.etree' extension
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
error: Unable to find vcvarsall.bat
Complete output from command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile:
Building lxml version 3.4.2.
Building without Cython.
ERROR: b"'xslt-config' is not recognized as an internal or external command,rnoperable program or batch file.rn"
** make sure the development packages of libxml2 and libxslt are installed **
Using build configuration of libxslt
running install
running build
running build_py
creating build
creating buildlib.win32-3.4
creating buildlib.win32-3.4lxml
copying srclxmlbuilder.py -> buildlib.win32-3.4lxml
copying srclxmlcssselect.py -> buildlib.win32-3.4lxml
copying srclxmldoctestcompare.py -> buildlib.win32-3.4lxml
copying srclxmlElementInclude.py -> buildlib.win32-3.4lxml
copying srclxmlpyclasslookup.py -> buildlib.win32-3.4lxml
copying srclxmlsax.py -> buildlib.win32-3.4lxml
copying srclxmlusedoctest.py -> buildlib.win32-3.4lxml
copying srclxml_elementpath.py -> buildlib.win32-3.4lxml
copying srclxml__init__.py -> buildlib.win32-3.4lxml
creating buildlib.win32-3.4lxmlincludes
copying srclxmlincludes__init__.py -> buildlib.win32-3.4lxmlincludes
creating buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlbuilder.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlclean.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmldefs.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmldiff.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlElementSoup.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlformfill.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlhtml5parser.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlsoupparser.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtmlusedoctest.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_diffcommand.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_html5builder.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml_setmixin.py -> buildlib.win32-3.4lxmlhtml
copying srclxmlhtml__init__.py -> buildlib.win32-3.4lxmlhtml
creating buildlib.win32-3.4lxmlisoschematron
copying srclxmlisoschematron__init__.py -> buildlib.win32-3.4lxmlisoschematron
copying srclxmllxml.etree.h -> buildlib.win32-3.4lxml
copying srclxmllxml.etree_api.h -> buildlib.win32-3.4lxml
copying srclxmlincludesc14n.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesconfig.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesdtdvalid.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesetreepublic.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludeshtmlparser.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesrelaxng.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesschematron.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludestree.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesuri.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxinclude.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlerror.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlparser.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxmlschema.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxpath.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesxslt.pxd -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludesetree_defs.h -> buildlib.win32-3.4lxmlincludes
copying srclxmlincludeslxml-version.h -> buildlib.win32-3.4lxmlincludes
creating buildlib.win32-3.4lxmlisoschematronresources
creating buildlib.win32-3.4lxmlisoschematronresourcesrng
copying srclxmlisoschematronresourcesrngiso-schematron.rng -> buildlib.win32-3.4lxmlisoschematronresourcesrng
creating buildlib.win32-3.4lxmlisoschematronresourcesxsl
copying srclxmlisoschematronresourcesxslRNG2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl
copying srclxmlisoschematronresourcesxslXSD2Schtrn.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsl
creating buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_abstract_expand.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_dsdl_include.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_message.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_schematron_skeleton_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1iso_svrl_for_xslt1.xsl -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
copying srclxmlisoschematronresourcesxsliso-schematron-xslt1readme.txt -> buildlib.win32-3.4lxmlisoschematronresourcesxsliso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
C:Python34libdistutilsdist.py:260: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
error: Unable to find vcvarsall.bat
----------------------------------------
Cleaning up...
Command C:Python34python.exe -c "import setuptools, tokenize;__file__='C:\Users\CARTE_~1\AppData\Local\Temp\pip_build_carte_000\lxml\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('rn', 'n'), __file__, 'exec'))" install --record C:UsersCARTE_~1AppDataLocalTemppip-l8vvrv9g-recordinstall-record.txt --single-version-externally-managed --compile failed with error code 1 in C:UsersCARTE_~1AppDataLocalTemppip_build_carte_000lxml
Storing debug log for failure in C:Userscarte_000pippip.log
So then I looked on this great and helpful thing called The Internet and a lot of people have the same error of needing libxml2
and libxlst
. They recommend a guy called Christoph Gohlke’s page where he provides some sort of binary thingy for a bunch of packages. You can find it here (quicklink to the lxml part).
So after I gave up on trying to find libxml2 and libxslt for pip, I decided to go there, and found an absolute ton of downloads. I know I need a 64-bit one, but I have no idea which «cp
» I need.
So an answer either giving me a solution on the pip
method or the Gohlke index method would be great.
Hello geeks, I hope all are doing great. So, no one denies that the number of libraries in Python gives strong support to the python programming language. These libraries extend the scope of the language to a higher extent and extend the domain of usage of the language. Today in this article, we will briefly introduce the library and its installation. The name of the library is lxml.
lxml Module
This open-source library gives us the ease of processing XML and HTML in the Python language. This library is the pythonic binding of C libraries such as libxml2_ and libxslt_. It combines the speed and completeness of XML libraries with the simplicity of native python API. It is compatible but somewhat superior to Element tree API. However, we are not going much more profound in discussing this module. For this article, we will only focus on its installation.
Requirements
But, before heading toward the installation procedure of the library, first, see the requirements for the lxml library.
- One should have python installed in the system with version 2.7 or 3.4 or above.
- If you are not using a static binary distribution (e.g. from a Windows binary installer), you need supporting libraries installed in the system. They are as follows:
- libxml2 version 2.9.2 or later.
- libxslt version 1.1.27 or later.
Installing lxml on Linux
To install the package with all the required dependencies on the Linux system, one can use the installation tools, i.e., apt-get. You can follow the following command to install it along with all the packages.
sudo apt-get install libxml2-dev libxslt-dev python-dev
Or, you can also use the following command for the same without mentioning the name of the required dependency.
sudo apt-get build-dep python3-lxml
Installing lxml Using PIP
Windows/Linux
If you are a pip user, you can use the following command.
Now, this command will install the library locally in your virtual environment. However, to install it globally, you can use the following command.
Note:- This works only for Linux systems.
Or, you can also specify the version while entering the installation command.
pip install lxml==3.4.2
To speed up the build in test environments, e.g., on a continuous integration server, disable the C compiler optimizations by setting the CFLAGS environment variable:
CFLAGS="-O0" pip install lxml
We can check it using the following command.
>>> import lxml >>> lxml.__version__ '4.7.1'
Install lxml in Debian based System
For Debian-based systems, we can use the following command.
sudo apt-get build-dep python3-lxml
Installing Python lxml in MacOS
However, you can use the following command to install the package on macOS. This command will also install the required dependency, so we also need not to
care about that.
STATIC_DEPS=true sudo pip install lxml
Install python lxml in CentOS
However, to install lxml, we first need to install its dependency in centOS. To do that, we will use the following command.
sudo yum install libxml2 libxml2-devel libxml2-python libxslt libxslt-devel pip install lxml or easy_install lxml
Installing lxml Using Conda
However, you are an anaconda user. You can install it using the following command.
conda install -c anaconda lxml
Installing lxml in Pycharm
To install lxml in pycharm, you can follow the following step:
- Open
File > Settings > Project
from the PyCharm menu. - Select your current project.
- Click the
Python Interpreter
tab within your project tab. - Click the “
+"
symbol to add a new library to the project. - Now type in the library to be installed, in your example
"lxml"
without quotes, and clickInstall Package
. - Wait for the installation to terminate and close all pop-ups.
Installing lxml in Jupyter Notebook
To install lxml in jupyter notebook, you can run the following command in the Jupyter notebook code cell.
!pip install lxml
Using lxml with python-libxml2
However, if you want to install the dependency along with the library statically, you can use the following command. The consequences of not doing that are that the two packages will interfere in places where the libxml2 library requires global configuration, which may lead to the crash of the program.
STATIC_DEPS=true pip install lxml
Use Binary wheel files to install lxml
Despite installing lxml using these commands, we have another option available. In this method, we first install the binary wheel file for lxml and then run it with pip install. We can download the file from the given website.
Unofficial Windows binaries, Click here.
Now once done, we can install it using the following command.
pip install lxml‑4.6.5‑cp39‑cp39‑win_amd64.whl
Installing lxml in RedHat
To install lxml in RedHat, we need to follow the series of commands.
sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y sudo yum install python-devel -y sudo curl -o /tmp/ez_setup.py https://sources.rhodecode.com/setuptools/raw/bootstrap/ez_setup.py sudo /usr/bin/python /tmp/ez_setup.py sudo /usr/bin/easy_install pip sudo rm setuptools-*.tar.gz sudo pip install -i https://pypi.rhodecode.com/ --upgrade pip sudo pip install virtualenv
FAQs on Python Install lxml
Does lxml come with Python?
No, we need to download it separately.
Is lxml faster than BeautifulSoup?
Yes, lxml is way faster than BeautifulSoup.
Do you need to install a parser library lxml to use BeautifulSoup?
Yes, we need to install both lxml and BeautifulSoup both for using the library.
Conclusion
So, today in this article, we have seen how we can install the lxml library on different platforms. We have taken examples of different environments where we can install the library. I hope this article has helped you. Thank You.
Trending Right Now
-
[Fixed] Module Seaborn has no Attribute Histplot Error
●January 18, 2023
-
Thonny: Text Wrapping Made Easy
by Rahul Kumar Yadav●January 18, 2023
-
[Fixed] JavaScript error: IPython is Not Defined
by Rahul Kumar Yadav●January 18, 2023
-
[Fixed] “io.unsupportedoperation not readable” Error
by Rahul Kumar Yadav●January 18, 2023
In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump onto processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml. Each step of this tutorial is complete with practical Python lxml examples.
Prerequisite
This tutorial is aimed at developers who have at least a basic understanding of Python. A basic understanding of XML and HTML is also required. Simply put, if you know what an attribute is in XML, that is enough to understand this article.
This tutorial uses Python 3 code snippets but everything works on Python 2 with minimal changes as well.
What is lxml in Python?
lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. This library is essentially a wrapper over C libraries libxml2 and libxslt. This combines the speed of the native C library and the simplicity of Python.
Using Python lxml library, XML and HTML documents can be created, parsed, and queried. It is a dependency on many of the other complex packages like Scrapy.
Installation
The best way to download and install the lxml library is from Python Package Index (PyPI). If you are on Linux (debian-based), simply run:
sudo apt-get install python3-lxml
Another way is to use the pip package manager. This works on Windows, Mac, and Linux:
On windows, just use pip install lxml, assuming you are running Python 3.
Creating a simple XML document
Any XML or any XML compliant HTML can be visualized as a tree. A tree has a root and branches. Each branch optionally may have further branches. All these branches and the root are represented as an Element.
A very simple XML document would look like this:
<root>
<branch>
<branch_one>
</branch_one>
<branch_one>
</branch_one >
</branch>
</root>
Link to GitHub
If an HTML is XML compliant, it will follow the same concept.
Note that HTML may or may not be XML compliant. For example, if an HTML has <br> without a corresponding closing tag, it is still valid HTML, but it will not be a valid XML. In the later part of this tutorial, we will see how these cases can be handled. For now, let’s focus on XML compliant HTML.
The Element class
To create an XML document using python lxml, the first step is to import the etree module of lxml:
Every XML document begins with the root element. This can be created using the Element type. The Element type is a flexible container object which can store hierarchical data. This can be described as a cross between a dictionary and a list.
In this python lxml example, the objective is to create an HTML, which is XML compliant. It means that the root element will have its name as html:
Similarly, every html will have a head and a body:
>>> head = etree.Element("head")
>>> body = etree.Element("body")
Link to GitHub
To create parent-child relationships, we can simply use the append() method.
This document can be serialized and printed to the terminal with the help of tostring() function. This function expects one mandatory argument, which is the root of the document. We can optionally set pretty_print to True to make the output more readable. Note that tostring() serializer actually returns bytes. This can be converted to string by calling decode():
>>> print(etree.tostring(root, pretty_print=True).decode())
Link to GitHub
The SubElement class
Creating an Element object and calling the append() function can make the code messy and unreadable. The easiest way is to use the SubElement type. Its constructor takes two arguments – the parent node and the element name. Using SubElement, the following two lines of code can be replaced by just one.
body = etree.Element("body")
root.append(body)
# is same as
body = etree.SubElement(root,"body")
Setting text and attributes
Setting text is very easy with the lxml library. Every instance of the Element and SubElement exposes two methods – text and set, the former is used to specify the text and later is used to set the attributes. Here are the examples:
para = etree.SubElement(body, "p")
para.text="Hello World!"
Link to GitHub
Similarly, attributes can be set using key-value convention:
One thing to note here is that the attribute can be passed in the constructor of SubElement:
para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"
Link to GitHub
The benefit of this approach is saving lines of code and clarity. Here is the complete code. Save it in a python file and run it. It will print an HTML which is also a well-formed XML.
from lxml import etree
root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p", id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p", id="secondPara")
para.text = "This is the second paragraph."
etree.dump(root) # prints everything to console. Use for debug only
Link to GitHub
Note that here we used etree.dump() instead of calling etree.tostring(). The difference is that dump() simply writes everything to the console and doesn’t return anything, tostring() is used for serialization and returns a string which you can store in a variable or write to a file. dump() is good for debug only and should not be used for any other purpose.
Add the following lines at the bottom of the snippet and run it again:
with open(‘input.html’, ‘wb’) as f:
f.write(etree.tostring(root, pretty_print=True)
Link to GitHub
This will save the contents to input.html in the same folder you were running the script. Again, this is a well-formed XML, which can be interpreted as XML or HTML.
How do you parse an XML file using LXML in Python?
The previous section was a Python lxml tutorial on creating XML files. In this section, we will look at traversing and manipulating an existing XML document using the lxml library.
Before we move on, save the following snippet as input.html.
<html>
<head>
<title>This is Page Title</title>
</head>
<body>
<h1 style="font-size:20pt" id="head">Hello World!</h1>
<p id="firstPara">This HTML is XML Compliant!</p>
<p id="secondPara">This is the second paragraph.</p>
</body>
</html>
Link to GitHub
When an XML document is parsed, the result is an in-memory ElementTree object.
The raw XML contents can be in a file system or a string. If it is in a file system, it can be loaded using the parse method. Note that the parse method will return an object of type ElementTree. To get the root element, simply call the getroot() method.
from lxml import etree
tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) #prints file contents to console
Link to GitHub
The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — fromstring()
xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)
Link to GitHub
One important difference to note here is that fromstring() method returns an object of element. There is no need to call getroot().
If you want to dig deeper into parsing, we have already written a tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents. But to quickly answer what is lxml in BeautifulSoup, lxml can use BeautifulSoup as a parser backend. Similarly, BeautifulSoup can employ lxml as a parser.
Finding elements in XML
Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath. For example, the following code will return the first paragraph element.
Note that the selector is very similar to XPath. Also note that the root element name was not used because elem contains the root of the XML tree.
tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)
# Output
# <p id="firstPara">This HTML is XML Compliant!</p>
Link to GitHub
Similarly, findall() will return a list of all the elements matching the selector.
elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
etree.dump(e)
# Outputs
# <p id="firstPara">This HTML is XML Compliant!</p>
# <p id="secondPara">This is the second paragraph.</p>
Link to GitHub
The second way of selecting the elements is by using XPath directly. This approach is easier to follow by developers who are familiar with XPath. Furthermore, XPath can be used to return the instance of the element, the text, or the value of any attribute using standard XPath syntax.
para = elem.xpath('//p/text()')
for e in para:
print(e)
# Output
# This HTML is XML Compliant!
# This is the second paragraph.
Link to GitHub
Handling HTML with lxml.html
Throughout this article, we have been working with a well-formed HTML which is XML compliant. This will not be the case a lot of the time. For these scenarios, you can simply use lxml.html instead of lxml.etree.
Note that reading directly from a file is not supported. The file contents should be read in a string first. Here is the code to print all paragraphs from the same HTML file.
from lxml import html
with open('input.html') as f:
html_string = f.read()
tree = html.fromstring(html_string)
para = tree.xpath('//p/text()')
for e in para:
print(e)
# Output
# This HTML is XML Compliant!
# This is the second paragraph
Link to GitHub
lxml web scraping tutorial
Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.
For this, the ‘requests’ library is a great choice. It can be installed using the pip package manager:
Once the requests library is installed, HTML of any web page can be retrieved using a simple get() method. Here is an example.
import requests
response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints source HTML
Link to GitHub
This can be combined with lxml to retrieve any data that is required.
Here is a quick example that prints a list of countries from Wikipedia:
import requests
from lxml import html
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
print(country.xpath('./following-sibling::a/text()')[0])
Link to GitHub
In this code, the HTML returned by response.text is parsed into the variable tree. This can be queried using standard XPath syntax. The XPaths can be concatenated. Note that the xpath() method returns a list and thus only the first item is taken in this code snippet.
This can easily be extended to read any attribute from the HTML. For example, the following modified code prints the country name and image URL of the flag.
for country in countries:
flag = country.xpath('./img/@src')[0]
country = country.xpath('./following-sibling::a/text()')[0]
print(country, flag)
Link to GitHub
You can click here to find the complete code used in this article for your convenience.
Conclusion
In this Python lxml tutorial, various aspects of XML and HTML handling using the lxml library have been introduced. Python lxml library is a light-weight, fast, and feature-rich library. This can be used to create XML documents, read existing documents, and find specific elements. This makes this library equally powerful for both XML and HTML documents. Combined with requests library, it can also be easily used for web scraping.
You can read up and learn more on web scraping using Selenium or other useful libraries like Beautiful Soup in our blog.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play. The key benefits of this library are that it’s ease of use, extremely fast when parsing large documents, very well documented, and provides easy conversion of data to Python data types, resulting in easier file manipulation.
In this tutorial, we will deep dive into Python’s lxml library, starting with how to set it up for different operating systems, and then discussing its benefits and the wide range of functionalities it offers.
Installation
There are multiple ways to install lxml on your system. We’ll explore some of them below.
Using Pip
Pip is a Python package manager which is used to download and install Python libraries to your local system with ease i.e. it downloads and installs all the dependencies for the package you’re installing, as well.
If you have pip installed on your system, simply run the following command in terminal or command prompt:
$ pip install lxml
Using apt-get
If you’re using MacOS or Linux, you can install lxml by running this command in your terminal:
$ sudo apt-get install python-lxml
Using easy_install
You probably won’t get to this part, but if none of the above commands works for you for some reason, try using easy_install
:
$ easy_install lxml
Note: If you wish to install any particular version of lxml, you can simply state it when you run the command in the command prompt or terminal like this, lxml==3.x.y
.
By now, you should have a copy of the lxml library installed on your local machine. Let’s now get our hands dirty and see what cool things can be done using this library.
Functionality
To be able to use the lxml library in your program, you first need to import it. You can do that by using the following command:
from lxml import etree as et
This will import the etree
module, the module of our interest, from the lxml library.
Creating HTML/XML Documents
Using the etree
module, we can create XML/HTML elements and their subelements, which is a very useful thing if we’re trying to write or manipulate an HTML or XML file. Let’s try to create the basic structure of an HTML file using etree
:
root = et.Element('html', version="5.0")
# Pass the parent node, name of the child node,
# and any number of optional attributes
et.SubElement(root, 'head')
et.SubElement(root, 'title', bgcolor="red", fontsize='22')
et.SubElement(root, 'body', fontsize="15")
In the code above, you need to know that the Element
function requires at least one parameter, whereas the SubElement
function requires at least two. This is because the Element
function only ‘requires’ the name of the element to be created, whereas the SubElement
function requires the name of both the root node and the child node to be created.
It’s also important to know that both these functions only have a lower bound to the number of arguments they can accept, but no upper bound because you can associate as many attributes with them as you want. To add an attribute to an element, simply add an additional parameter to the (Sub)Element function and specify your attribute in the form of attributeName='attribute value'
.
Let’s try to run the code we wrote above to gain a better intuition regarding these functions:
# Use pretty_print=True to indent the HTML output
print (et.tostring(root, pretty_print=True).decode("utf-8"))
Output:
<html version="5.0">
<head/>
<title bgcolor="red" fontsize="22"/>
<body fontsize="15"/>
</html>
There’s another way to create and organize your elements in a hierarchical manner. Let’s explore that as well:
root = et.Element('html')
root.append(et.SubElement('head'))
root.append(et.SubElement('body'))
So in this case whenever we create a new element, we simply append it to the root/parent node.
Parsing HTML/XML Documents
Until now, we have only considered creating new elements, assigning attributes to them, etc. Let’s now see an example where we already have an HTML or XML file, and we wish to parse it to extract certain information. Assuming that we have the HTML file that we created in the first example, let’s try to get the tag name of one specific element, followed by printing the tag names of all the elements.
print(root.tag)
Output:
html
Now to iterate through all the child elements in the root
node and print their tags:
for e in root:
print(e.tag)
Output:
head
title
body
Working with Attributes
Let’s now see how we associate attributes to existing elements, as well as how to retrieve the value of a particular attribute for a given element.
Using the same root
element as before, try out the following code:
root.set('newAttribute', 'attributeValue')
# Print root again to see if the new attribute has been added
print(et.tostring(root, pretty_print=True).decode("utf-8"))
Output:
<html version="5.0" newAttribute="attributeValue">
<head/>
<title bgcolor="red" fontsize="22"/>
<body fontsize="15"/>
</html>
Here we can see that the newAttribute="attributeValue"
has indeed been added to the root element.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Let’s now try to get the values of the attributes we have set in the above code. Here we access a child element using array indexing on the root
element, and then use the get()
method to retrieve the attribute:
print(root.get('newAttribute'))
print(root[1].get('alpha')) # root[1] accesses the `title` element
print(root[1].get('bgcolor'))
Output:
attributeValue
None
red
Retrieving Text from Elements
Now that we have seen basic functionalities of the etree
module, let’s try to do some more interesting things with our HTML and XML files. Almost always, these files have some text in between the tags. So, let’s see how we can add text to our elements:
# Copying the code from the very first example
root = et.Element('html', version="5.0")
et.SubElement(root, 'head')
et.SubElement(root, 'title', bgcolor="red", fontsize="22")
et.SubElement(root, 'body', fontsize="15")
# Add text to the Elements and SubElements
root.text = "This is an HTML file"
root[0].text = "This is the head of that file"
root[1].text = "This is the title of that file"
root[2].text = "This is the body of that file and would contain paragraphs etc"
print(et.tostring(root, pretty_print=True).decode("utf-8"))
Output:
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>
Check if an Element has Children
Next, there are two very important things that we should be able to check, as that is required in a lot of web scraping applications for exception handling. First thing we’d like to check is whether or not an element has children, and second is whether or not a node is an Element
.
Let’s do that for the nodes we created above:
if len(root) > 0:
print("True")
else:
print("False")
The above code will output «True» since the root node does have child nodes. However, if we check the same thing for the root’s child nodes, like in the code below, the output will be «False».
for i in range(len(root)):
if (len(root[i]) > 0):
print("True")
else:
print("False")
Output:
False
False
False
Now let’s do the same thing to see if each of the nodes is an Element
or not:
for i in range(len(root)):
print(et.iselement(root[i]))
Output:
True
True
True
The iselement
method is helpful for determining if you have a valid Element
object, and thus if you can continue traversing it using the methods we’ve shown here.
Check if an Element has a Parent
Just now, we showed how to go down the hierarchy, i.e. how to check if an element has children or not, and now in this section we will try to go up the hierarchy, i.e. how to check and get the parent of a child node.
print(root.getparent())
print(root[0].getparent())
print(root[1].getparent())
The first line should return nothing (aka None
) as the root node itself doesn’t have any parent. The other two should both point to the root element i.e. the HTML tag. Let’s check the output to see if it is what we expect:
Output:
None
<Element html at 0x1103c9688>
<Element html at 0x1103c9688>
Retrieving Element Siblings
In this section we will learn how to traverse sideways in the hierarchy, which retrieves an element’s siblings in the tree.
Traversing the tree sideways is quite similar to navigating it vertically. For the latter, we used the getparent
and the length of the element, for the former, we’ll use getnext
and getprevious
functions. Let’s try them on nodes that we previously created to see how they work:
# root[1] is the `title` tag
print(root[1].getnext()) # The tag after the `title` tag
print(root[1].getprevious()) # The tag before the `title` tag
Output:
<Element body at 0x10b5a75c8>
<Element head at 0x10b5a76c8>
Here you can see that root[1].getnext()
retrieved the «body» tag since it was the next element, and root[1].getprevious()
retrieved the «head» tag.
Similarly, if we had used the getprevious
function on root, it would have returned None
, and if we had used the getnext
function on root[2], it would also have returned None
.
Parsing XML from a String
Moving on, if we have an XML or HTML file and we wish to parse the raw string in order to obtain or manipulate the required information, we can do so by following the example below:
root = et.XML('<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>')
root[1].text = "The title text has changed!"
print(et.tostring(root, xml_declaration=True).decode('utf-8'))
Output:
<?xml version='1.0' encoding='ASCII'?>
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">The title text has changed!</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>
As you can see, we successfully changed some text in the HTML document. The XML doctype declaration was also automatically added because of the xml_declaration
parameter that we passed to the tostring
function.
Searching for Elements
The last thing we’re going to discuss is quite handy when parsing XML and HTML files. We will be checking ways through which we can see if an Element
has any particular type of children, and if it does what do they contain.
This has many practical use-cases, such as finding all of the link elements on a particular web page.
print(root.find('a')) # No <a> tags exist, so this will be `None`
print(root.find('head').tag)
print(root.findtext('title')) # Directly retrieve the the title tag's text
Output:
None
head
This is the title of that file
Conclusion
In the above tutorial, we started with a basic introduction to what lxml library is and what it is used for. After that, we learned how to install it on different environments like Windows, Linux, etc. Moving on, we explored different functionalities that could help us in traversing through the HTML/XML tree vertically as well as sideways. In the end, we also discussed ways to find elements in our tree, and as well as obtain information from them.