Podcast
Questions and Answers
Quel est le principal avantage des fichiers ouverts par rapport aux fichiers propriétaires?
Quel est le principal avantage des fichiers ouverts par rapport aux fichiers propriétaires?
Les fichiers HTML sont des fichiers binaires créant des pages web statiques.
Les fichiers HTML sont des fichiers binaires créant des pages web statiques.
False
L'un des avantages de LATEX est d'être un langage WYSIWYG (What You See Is What You Get).
L'un des avantages de LATEX est d'être un langage WYSIWYG (What You See Is What You Get).
False
Quel langage de programmation est utilisé par Perl pour effectuer un matching d'expression régulière ?
Quel langage de programmation est utilisé par Perl pour effectuer un matching d'expression régulière ?
Signup and view all the answers
Donnez un exemple d'expression régulière qui recherche une chaîne de caractères commençant par un 'M', suivie d'une voyelle, d'au moins une lettre, d'un espace et se terminant par au moins un chiffre.
Donnez un exemple d'expression régulière qui recherche une chaîne de caractères commençant par un 'M', suivie d'une voyelle, d'au moins une lettre, d'un espace et se terminant par au moins un chiffre.
Signup and view all the answers
Quel est le rôle de l'instruction 'chop' en Perl ?
Quel est le rôle de l'instruction 'chop' en Perl ?
Signup and view all the answers
Expliquez la différence entre les instructions 'print' et 'return' en Perl.
Expliquez la différence entre les instructions 'print' et 'return' en Perl.
Signup and view all the answers
La hiérarchie de Chomsk y classie les grammaires en 4 types: Type 0, Type 1, Type 2 et Type 3. Type 0 étant le plus complexe et contenant tous les types de grammaires et Type 3 étant le plus simple correspondant aux grammaires régulières.
La hiérarchie de Chomsk y classie les grammaires en 4 types: Type 0, Type 1, Type 2 et Type 3. Type 0 étant le plus complexe et contenant tous les types de grammaires et Type 3 étant le plus simple correspondant aux grammaires régulières.
Signup and view all the answers
Associez chaque type de grammaire avec son niveau de complexité de reconnaissance.
Associez chaque type de grammaire avec son niveau de complexité de reconnaissance.
Signup and view all the answers
Study Notes
Introduction to Perl and Texts
- Course: M1 ÉdNITL-LTTAC 2024-2025
- Instructor: Fabien TORRE
- Institution: Université de Lille
Motivations and Context
- Reasons for learning Perl:
- Mastering data
- Working with unstructured text
- Exploring Big Data and open data
- Discovering the hidden web and data journalism
- Automatically creating text corpora from the web
- Automating the creation of documents
- Converting between different formats (text, HTML, LaTeX)
Some Ideas (1)
- Motivations:
- Mastering data
- Utilizing unstructured texts
- Exploring big data, open data, etc.
- Discovering hidden web and data journalism
- Creating automatic corpora from the web
- Automating document generation
- Handling format changes (text, HTML, LaTeX)
Some Ideas (2)
- Examples of applications:
- Text generation (prefixes/suffixes, conjugations, proper nouns, etc.)
- Entity recognition and typing in text
- Automatic text annotation
- Discovering co-occurrences
- Concordances
- Anagram generators
- Automatic text classification
- Access to Medline, Wikileaks, Enron documents
- Retrieving information from various sources
Perl Overview (1)
- Perl's strengths:
- "Glue language"
- Simple syntax for files
- Easy handling of regular expressions
- Useful for text processing
- Turing complete
Perl Overview (2)
- Characteristics:
- Natural language-like syntax
- Multiple ways to express the same thing
- Can generate poems or complete programs in one line
- Semantics depend on context
- No mandatory variable declaration
- Default variables exist
- Note: These characteristics are a contrast to typical algorithmic principles.
Perl and Algorithmics (1)
- Good practices:
-
use strict;
-
use warnings;
-
-
affiche_tableau
subroutine example showcasing these good practices.
Perl and Algorithmics (2)
- Bad practices:
- Example of poor subroutine
affiche
. - Highlights potential mistakes in Perl programming
- Example of poor subroutine
Work Environments
- Required tools: Perl, console, and text editor
- Installation methods:
- Linux: Perl usually pre-installed.
- MacOS: Perl might need installation.
- Windows: Use Windows Subsystem for Linux, Strawberry Perl, or a virtual machine.
- Online: Use a web-based interpreter.
- Recommendation prioritize Linux for university machines (often dual-boot).
Linux (Overview)
- Linux topics:
- Motivations and context
- Free and open-source software, Linux distributions
- Linux in practice
- Open formats
Linux (General Information)
- Components of a Linux distribution:
- Operating system
- File system
- Graphical environment
- Applications (console, text editor, office suite, archivers/compressors, etc.)
- Web browser, mail reader
Linux (Distributions)
- Examples of Linux distributions: Ubuntu, Mint, Debian, Mandriva, Gentoo, Fedora
- Different graphical environments (Xfce, KDE, Gnome, Cinnamon)
- Different strategies for software choices /updates
Linux (File System)
- Principles: Hierarchical structure of directories and files
- Permissions: read (r), write (w), execute (x), for users, groups, and others.
- Notations:
-
/
: root directory -
~
: user's home directory -
~user
: other user's home directory - Current directory indicator
- Parent directory indicator
- The specified file/folder
-
Linux (Syntax and Commands)
- Command syntax: parameters, options, background execution, redirection, error channel (pipes), and running programs
Basic Linux Commands (1/2)
- Navigation and management:
-
cd
: Change directory -
ls
: List directory contents -
pwd
: Show current directory -
cp
: Copy files/directories -
mv
: Move files/directories -
rm
: Remove files/directories -
mkdir
: Create a directory
-
- Archiving and compression:
-
tar
: Archive a directory -
bzip2
: Compress a file
-
Basic Linux Commands (2/2)
- Information about commands
-
history
: Previous commands -
man
: Manual of a command -
apropos
: Command search -
which
: Command lookup -
top
: Active processes
-
- Text editor/viewers
-
gnome-text-editor
: Text editor (varies by distribution) -
evince
: PDF viewer -
eog
: Image viewer
-
- Web access
-
wget
: Retrieve files -
firefox-esr
: Web browser
-
Text File Management
- Basic text file utilities: counting lines, words, and characters (
wc
), searching within content (grep
), extracting data (cut
), sorting lines (sort
), removing duplicate lines (uniq
), viewing content page by page (more
), displaying header/footer (head
/tail
), and differences between files (diff
).
Commands for Other Documents
- Search and information:
find
,file
- Document text extraction:
pdftotext
,tesseract
- Conversion between formats:
pandoc
- PDF manipulation:
pdfinfos
,xournal
,pdftk
,pdfjam
Terminal
- Terminal shortcuts:
clear
orCtrl+l
: Clear the console. -
Ctrl+ +
andCtrl+-
: Change character size,tab
: Completion of instructions and file names, Arrow keys (up/down
): Access history,! deb
: Search and launch the last command starting with 'deb',alt+tab
: Window switching. - No mouse needed
Open Formats (Overview)
- Overview of open formats
- Discussion of closed formats, binary format, proprietary formats.
- Advantages and disadvantages of open formats for automatic processing
HTML Overview
- HTML files are editable text files with tags (e.g.,
<html>
,<body>
,<h1>
,<p>
,<table>
).- HTML has tags, attributes (e.g.,align
,size
), nesting, and a tree-like structure. An example of HTML is provided.
HTML plus Semantics Example
- Example HTML with semantic structure highlighting the use of
<head>
,<link>
, and semantic elements for improved organization.
CSS
- Styling of HTML with CSS (Cascading Style Sheets) is demonstrated, showing how CSS allows for formatting without affecting the basic HTML structure. The example includes style definitions for the
<body>
and<h1>
elements.
Markdown
- Introduction to Markdown formatting.
- Usage of different levels of headings and lists with examples.
LaTeX
- LaTeX overview, principles, formatting, example document.
LaTeX Software
- Description of LaTeX as a program for transforming LaTeX files into PDF files (
pdflatex
). - Highlights of LaTeX's features, reliability (rare bugs), quality for printing, and its historical development.
Number Encodings
- Numbers are represented in binary form (0s and 1s).
- Principles similar to the decimal system (base 10) but limited by binary/machine constraints.
- Encoding and decoding, and addition are discussed.
Text Encodings
- Character numbering and encoding of numbers.
- Numbering of bits used for characters.
- ASCII, ISO-Latin1, UTF-16, UTF-8 are mentioned as encoding standards.
Perl (Language Basics)
- Core Perl language components
- Syntax elements
- Control Structures
- Functions and procedures
- Tables/Arrays
- Files (input/output)
Syntax and Comments
- Basic Perl syntax elements
-
use strict;
-
use warnings;
- Handling accents with
use utf8
- Handling of standard output and error streams in utf8
- Example of displaying text
-
Minimal Perl Syntax
- Basic rules for Perl programs
- Statements, curly braces, comments (
#
) - Character strings (single quotes or double quotes)
- Discussion of back slashes, and print statements, newline characters
Interpreted Text in Perl
- Concept of interpreted vs. non-interpreted character strings in Perl
- Examples demonstrating string handling and output
Variables in Perl
- Data types (booleans, integers, reals, characters, strings).
- Variables prefixed by
$
. - Variable assignment and equality tests.
- Example demonstrating variable use.
Perl Operators
- Arithmetic operators (+, -, *, /).
- Integer part (
int
), random number generation (rand
). - String concatenation.
- Logical operators (AND, OR, NOT).
- Comparisons (numbers and strings).
String Manipulation
- String functions (e.g.,
length
,substr
). - Demonstrating extraction/manipulation of the parts of a string
Control Structures in Perl
- Conditional structures (
if...else
statements). - Iterative structures (loops).
-
while
loop
-
-
for
loop
Functions in Perl
- Defining and calling subroutines/functions.
- Returning values.
- Examples showing how to define and use simple functions.
Arrays in Perl
- Array characteristics.
- Array creation and modification.
- Using arrays with loops
for
andforeach
in Perl.
File Handling in Perl
- Writing to files using
open
,print
,close
, andbinmode
for specific encodings. - Reading from files using
open
, awhile
loop for reading lines, andchop
to remove newline characters.
Perl's Advantages
- Perl's ability to handle different text formats such as CSV, Markdown, HTML, XML (including TEI) and LATEX is highlighted.
Regular Expressions Introduction
- Introduction to regular expressions (regex) in the context of language theory.
- Chomsky hierarchy, including regular languages.
- Perl regex operators and functions are introduced.
Regular Expressions in Perl
- Basic regex concepts and notations for character classes, positions (start/end).
- Quantifiers
?
,*
, and+
. - Example using regex for string matching and extraction, showing how to find specific patterns in a string
Regular Expression Operators in Perl
-
=~
(match operator),i
(case-insensitive flag), ands///
(substitution operator). -
g
: global flag for multiple replacements, example demonstrating matching a particular pattern and capturing parts of the string.
Operators in Perl
- Splitting strings/data into pieces using the
split
function and explaining its usage. - Examples to handle the extraction of elements in strings with delimiter.
Regular Expression Summary
- Recap of regular expression capabilities and limitations.
- Strengths and weaknesses for different kinds of linguistic tasks.
- Overview of practical use and limitations.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.