Technology and Translation PDF
Document Details
Uploaded by EquitablePanPipes
Qassim University
Dr. Manal Alghannam
Tags
Summary
This document explains technology, specifically XML and its use in translation. It highlights infrastructure technologies, the XML family (including HTML and XHTML), and the importance of Unicode. The document also discusses open-source software examples and how they relate to translation.
Full Transcript
Technology and Translation Dr. Manal Alghannam week 4 1. Infrastructure technologies This lecture describes some of the less ‘visible’ technologies that underpin the globalization environment by making it possible to create and share translation data. 1.1 Xml and fami...
Technology and Translation Dr. Manal Alghannam week 4 1. Infrastructure technologies This lecture describes some of the less ‘visible’ technologies that underpin the globalization environment by making it possible to create and share translation data. 1.1 Xml and family One of the most powerful technologies providing a platform for globalization is stand for eXtensible Markup Language (XML – http://www.w3.org/XML). It’s a way to store and organize data so that both humans and computers can understand it. Think of it as a method for writing information in a structured way. The reason it is so important, is that it is increasingly the medium in which text is delivered for translation and in which translation resources are shared. XML Family XML is part of a bigger family of technologies that help in different ways. Here are some important members of the XML family: Comparison HTML This is HTML written as XML. HTML is the language used to create web pages. XHTML makes web pages more structured and strict. indicates how information is to be displayed in a browser, XML describes what pieces of information mean. XML simplifies the transport and sharing of content across otherwise incompatible platforms. It also makes content more accessible by making it available, for example, to devices that can ‘publish’ it as text for the Deaf and as speech for the blind. ﻛﻨﺺ ﻟﻠﺼﻢ وﻛﻼم ﻟﻠﻤﻜﻔﻮﻓﲔ The set of tags is not closed but extensible, allowing communities of users to agree on the definition of new tags for particular applications. In other words, XML is a metalanguage, used to create many new languages in different domains of knowledge and activity. Among the most important of these for globalization are XLIFF, TBX, TMX and DITA. XLIFF (XML Localisation Interchange File Format) TBX (Term Base eXchange) makes it possible to consistently reuse the same terminology in CAT and MT tools as was used in the original authoring process. TMX (Translation Memory eXchange) Simple Example Imagine you have a recipe written down. In a normal text, it might look like this: If we use XML to write this recipe, it would look like this: Why Use XML? 1. Organization: XML organizes data in a clear, structured way. 2. Flexibility: You can create your own tags (like , , etc.) It’s can be Translation or any thing to fit your needs. 3. Readability: Both humans and computers can read XML. How XML Helps in Translation In translation, XML is often used to store and exchange texts that need to be translated. For example, a website might use XML to separate the content (text) from the design (layout). This way, translators can work on the text without affecting the design. 1.2 Unicode and open source Unicode is the character encoding standard for XML and has been widely adopted by global organizations, since its use can hugely facilitate software localization. Open source software is code that is designed to be publicly accessible —anyone can see, modify, and distribute the code as they see fit. While open-source activity in translation technology remains relatively low, there are some notable exceptions in MT and TM and we can expect this model of software development to become more widespread in translation. Examples: Web Browsers: Firefox is an example of open source software. Office Suites: LibreOffice is another example that provides free tools like a word processor and spreadsheet. Unicode: 1.Text Representation: Universal System: Unicode is like a big international alphabet. It gives every letter, number, and symbol a unique number, no matter what language it's in. One Standard for All: Think of it like a universal translator. It ensures that a letter in one language appears the same in another language's text without getting messed up. 1.Why It Matters: Consistency: If you write a message in Arabic on your phone and send it to a friend who speaks Chinese, Unicode makes sure your message looks correct on their phone too. Inclusivity: Unicode includes thousands of characters from different languages, even emojis, making digital communication more inclusive and understandable for everyone. Open source web examples for translation https://www.designrush.com/agency/software-development/trends/open-source-software-examples https://github.com/LibreTranslate/LibreTranslate https://ai.meta.com/tools/translate/ Putting It Together Imagine you have a recipe book that you want to share with friends around the world: Open Source: You can share the book for free, and your friends can read it, add their own recipes, and share it back with you. Together, you make the book even better. Unicode: When you write a recipe with special symbols or in a different language, Unicode ensures that all your friends see the recipe just as you wrote it, no matter what language they speak or what device they use. By using open source software and Unicode, people around the world can collaborate and communicate more effectively and inclusively. 1.3 corpus data and tools A corpus (plural corpora) is ‘a collection of naturally occurring language data’. To be useable for the purposes translation technology a corpus must be machine readable which is typically the case whatever the purpose – and large, consisting of tens of millions rather than tens of thousands of words. Corpora are the raw resource for many applications such as extracting terminology, creating authoring and MT systems, and reusing previous translations. What is Corpus Data in Translation? 1.Corpus (Plural: Corpora): 1. Collection of Texts: Imagine a big library filled with books, but instead of physical books, it’s a digital collection of texts. This digital collection is called a corpus. 2. Real-Life Language Use: The texts in a corpus are examples of how people actually use language in real life. These could be books, articles, conversations, or any other written or spoken content. 1. Machine-Readable: 1. Digital Format: To be useful for computers, these texts are stored in a format that computers can read and analyze. It’s like having all the books in the library in a special format that computers can understand. Why is Corpus Data Important in Translation? 1. Understanding Language Patterns: Learning from Examples: Translators and translation tools can look at many examples of how words and phrases are used in different contexts. This helps them understand the best way to translate something. Context Matters: Language can be tricky because words can have different meanings depending on how they’re used. By looking at lots of examples in a corpus, translators can see the different ways words and phrases are used. 1. Improving Translation Quality: Consistency: Using a corpus, translators can ensure they are using terms consistently throughout a translation. This is especially important for technical documents where specific terms need to be translated the same way every time. Accuracy: By referencing real-life examples from a corpus, translations can be more accurate and sound more natural. What are Corpus Tools? 1. Corpus Analysis Tools: Software Programs: These are computer programs that help Analysis translators search and analyze the texts in a corpus. They can find how often a word is used, see examples of sentences, and much more. Concordancers: One common tool is a concordancer. It shows all the places a word or phrase appears in the corpus, along with some surrounding text for context. 1. Building and Using Corpora: Creating Corpora: Sometimes translators or companies need to build their own corpora. They collect texts related to their specific field (like medical texts or legal documents) and convert them into a digital, searchable format. Using Existing Corpora: There are also many ready-made corpora that translators can use. These might be general language corpora or specialized ones for different subjects. Examples of Corpus Data and Tools 1. British National Corpus (BNC): Large Collection of English: The BNC is a large collection of English texts from various sources, like books, newspapers, and conversations. It helps translators understand how English is used in different contexts. 1. European Union’s Acquis Communautaire: Multilingual Corpus: This is a collection of legal texts from the European Union, translated into multiple languages. It helps translators working with legal documents see how terms and phrases are translated in different languages. 1. AntConc: Free Concordancer Tool: AntConc is a free tool that lets you search and analyise texts in a corpus. You can see how often words are used, their common contexts, and much more. Alignment What is Alignment in Technology? Imagine you have a friend who speaks a different language, and you both want to understand each other. You need a way to match words and sentences from one language to another so you can communicate. This process of matching or lining things up so they make sense in both languages is similar to what alignment in technology does. Alignment in Translation In technology, especially in translation, alignment is about making sure that sentences or phrases in one language correspond correctly to sentences or phrases in another language. It's like having a dictionary that not only tells you what a word means but also how to use it in a sentence that makes sense. 1.4 Alignment Alignment is a process of matching source and target segments of text. The purpose of alignment is to capture relations of equivalence or correspondence in a translation. Automatically pairing the corresponding segments – sentences, headings, bulleted items – of the source and target texts may not be simple, as Somers (2003a: 34–7) explains. Why? This is because the delimitation of the segments in the first place usually relies on punctuation, but punctuation conventions and even the notion of sentence vary from language to language. So, alignment tools allow the user to specify how punctuation should be taken into account. The translator may have distributed the content of a long source sentence over two or more target sentences or merged several source sentences into one, to conform to target language norms. How Does it Work? 1.Source and Target Texts: You start with a text in the original language (source text) and the translated text in the new language (target text). 2.Matching Sentences: The goal is to line up sentences from the source text with the sentences in the target text that mean the same thing. Think of it as drawing lines between matching sentences in two columns. 3.Technology's Role: Computers help with this by using special programs and algorithms. These programs analyze the text, look for patterns, and try to find the best matches between the source and target texts. Why is Alignment Important? 1. Accuracy: Proper alignment ensures that the translation accurately reflects the original meaning. 2. Consistency: It helps maintain consistency in translations, especially in large documents or multiple translations. 3. Improvement: It allows translators and software to learn from previous translations and improve over time. Example Imagine translating a simple story from English to Spanish. The alignment process would look at each English sentence and find the matching Spanish sentence. English: "The cat sat on the mat." Spanish: "El gato se sentó en la alfombra." Alignment would pair these two sentences together, ensuring that the meaning is preserved. 2. Terminology tools ‘Terminology’ is both the process of identifying, organizing and presenting terms to users and the product of this process – collections of domain-specific expressions, often multi-word expressions (MWEs). In translation applications, terminology can be massively multilingual. Terms are lexical items which have specialized reference within a particular subject domain. The benefits of terminology tools: 1. a possibly significant cut in time spent on research and revision. 2. achieve consistent and accurate translations. 2.1 Term extraction Extracting, or ‘mining’ terminology from monolingual or parallel corpora may be done by a language service provider (LSP) in preparation for a job or, in the case of an MT vendor, prospectively to extend the system’s domain coverage. The technology exploits two main approaches for finding candidate terms: 1. linguistic approaches require part-of-speech tagged data to identify word combinations that match predetermined patterns (e.g. noun + noun – water pressure), 2. while statistical approaches rely on the fact that the component parts of terminological MWEs (multi-word expressions) tend to co-occur more often than would be predicted by chance. A particular tool may combine elements of both approaches. Searching for patterns such as noun + of + noun (e.g. part of speech, best of breed) or adjective + noun (hard drive) will successfully find matching terms however infrequent, but tends to return also many false or irrelevant candidates that need to be eliminated manually (e.g. cup of tea, long walk). So the initial list of candidates may be filtered according to various statistical criteria and the survivors ranked according to their likely ‘termhood’. A further disadvantage of the linguistic approach is that: The patterns need to be redefined for every language processed. Purely statistical methods escape this drawback and are language- independent, but overlook terms whose frequency of occurrence is low. 2.2 Term management Term storage mechanisms for translation range from a simple two-column table or spreadsheet holding simply the paired terms to a complex relational database capable of presenting equivalents across any or all of a large number of languages. Such databases typically contain a wealth of other data: linguistic (synonyms, variants, equivalents and so on), conceptual (e.g. domain, definition, related concepts), pragmatic (usage restrictions), bibliographic (source) and management (such as date, creator and reliability). For productivity, an organization’s term management system is integrated with the translation environment to scan the source text for known terms and propose them for insertion at the press of a key, avoiding error-prone retyping. What is a termbase? Terms are stored in a termbase. A termbase is multi-lingual. It is a database which includes the term, translations and associated metadata, such as a description or definition, the rules for usage or formatting. 3. Authoring tools 3.1 Single-source content management Single-source authoring is a methodology commonly used by technical writers to increase the re-use of existing written content instead of rewriting information. When preparing materials for marketing abroad, single-source authoring software can make a significant difference in reducing the amount of technical translations you will need to complete for foreign audiences. Benefits of single-sourcing Single-sourcing follows the principle of separating content from format, so that a single piece of content can be published as, for example, a Word document (.doc), a web page (.html) or online help (.chm). It also aims to write content only once and maintain it in a single place while publishing it in many places, thus reducing redundancy. 3.2. Controlled language checking A controlled language (CL): is a version of a human language that embodies explicit restrictions on vocabulary, grammar and style for the purpose of authoring technical documentation. The initial objective was to minimize ambiguity and maximize clarity for human readers (e.g. Simplified English). Several commercial tools are available for checking automatically that a technical author’s text conforms to the rules of the particular CL in use. these tools can be customized to the company’s lexical, syntactic and stylistic rules, which might include detecting typical errors made by non- native writers of the authoring language. Conclusion Unicode: A universal system that ensures every character from every language can be represented consistently on computers and the internet. Open Source: A way of making software where the source code is freely available for anyone to use, modify, and share. Alignment in technology, particularly in translation, is like making sure two people speaking different languages can understand each other perfectly. It involves matching sentences from one language to another to maintain meaning, accuracy, and consistency, with the help of technology. XML is like a special way of writing information that makes it easy to organize, read, and use. Its family includes various tools and languages that help manage and transform XML data in different ways, making it a versatile tool in technology and translation.