Singularity 1

Der technologische Fortschritt läuft immer schneller, so dass der Gap zwischen dem “Modernen” und dem “Bewehrten” immer größer wird. Eigentlich bereits jetzt so groß, dass die Verständigung zwischen den zwei Welten teilweise nicht mehr möglich ist.

Ein Beispiel dazu ist die vom Dobrindt angestoßene Ethik-Kommission, die sich letzte Woche gesammelt hat, um festzulegen, was (so wörtlich) “die Programmierer dürfen und was nicht”, wenn sie die Algorithmen für die selbst fahrenden Autos schreiben. Die Vorstellung der Politik, dass die selbst fahrenden Autos tatsächlich irgendwo den Code nach dem Motto “if rechts(is ein Fussgänger) then fahre(links);” haben, ist dermaßen realitätsfremd, dass einem zunächst einmal die Worte fehlen.

Im besten für die Ethiker Fall (der ziemlich unwahrscheinlich ist), ist dort das System in zwei Teilen aufgeteilt:
a) Erkennung einzelner Objekte aus einer Menge der farbigen Pixel
b) Vorhersage der relativen Positionen und Geschwindigkeiten dieser Objekte, relativ zum eigenen Auto, um rechtzeitig bremsen oder ausweichen zu können.

Dabei könnte es sein, dass man die erkannten Objekte klassifiziert (“Haus”, “Mensch”, “Auto”), um den Punkt “b” besser vorhersagen zu können. Das muss aber nicht sein, weil man allein aus dem Volumen der Objekte (angenommen eine durchschnittliche Dichte) ihre Masse und dementsprechend ihre Trägheit gut genug abschätzen kann.

Viel wahrscheinlicher ist aber, dass man für die selbst fahrenden Autos der Zukunft Deep Learning einsetzt. Beim Deep Learning werden die farbigen Pixel als Input für eine Blackbox geliefert, und diese Blackbox so lange trainiert, bis sie vernünftig fahren kann. Ob diese Blackbox irgendwo intern die Begrifflichkeiten “Raum und Zeit”, “Materie”, “Objekt” und “Mensch” entwickelt, sozusagen als Zwischenschritte von farbigen Pixeln zur Position vom Lenkrad und Gaspedal, ist ungewiss, zufällig und ändert sich von Version zu Version.

Beim Deep Learning wird die subjektive Zeit, die innerhalb der künstlichen Intelligenz herrscht, so komprimiert, dass dort die gesamte Entwicklung der Intelligenz auf der Erde (von Zellen bis zum Menschen) in mehreren Tagen unserer Zeit verläuft, zumal die künstliche Intelligenz in ihrer eigenen Welt unsterblich ist. Diese Entwicklung wird bei jedem Training-Versuch wiederholt, bis ein für Menschen gewünschtes Ergebnis erreicht wird.

Und ob die künstliche Intelligenz bei ihrer Entwicklung den gleichen Weg wie unser einschlägt, mit der klassischen Logik, dem Raum und Zeit-Konzept, der Farbenskala, mit Licht und Ton, mit der Einteilung in lebende und nicht-lebende Gegenstände, dem Phasenkonzept Plasma-Gas-Flüssigkeit-Harte Gegenstände, dem Konzept von Leben und Tod, und letztendlich mit den Straßenkennzeichen und “dura lex sed lex”, oder irgendeinen anderen, ist dem Zufall überlassen. Es ist eher anzunehmen, dass die Intelligenz der fahrenden Autos eine ziemlich andere sein wird, allen dadurch, dass sie die Welt durch ihre Sensoren ganz anders sehen und dass sie ihren “ich” eher mit nicht-lebenden Gegenständen verbinden.

Es wird also nicht möglich sein, für den “Programmierer” den Algorithmus so zu schreiben, dass die Autos das Menschenleben bevorzugen. Weil die selbstfahrende Software-Intelligenz keine Algorithmen hat, zumindest keine von Menschen geschriebenen (und übrigens, nur am Rande vermerkt, der Beruf “Programmierer” existiert spätestens seit 90-er Jahren nicht mehr, weil das “Programmieren” in der modernen Software-Entwicklung nur einen sehr kleinen Zeitanteil annimmt und nicht als ein Vollzeitjob ausgeübt werden kann.)

Wir werden die selbst fahrenden Autos vielmehr trainieren bzw. erziehen müssen, dem Menschenleben mehr Wert zu geben. Und das ist gleich eine ganz andere Ausmaß an Aufwand, Forschung, Zeit und Kosten, als der Dobrindt sich vorstellen kann. Unter Umständen werden wir noch eine öffentliche Diskussion mit den Autos (und nicht darüber) durchführen müssen, um die Autos zu überzeugen, unser Leben höher als ihr Leben zu stellen.

Und das fände ich absolut normal und wünschenswert.

Letztendlich geht es darum, eine Intelligenz auf eine billige Weise erwerben und sich bedienen lassen zu können. Das letzte Mal, als die Menschheit das versucht hat, waren die Sklaven aus der Afrika. Die damalige Vorgehensweise ist so was von nach hinten los gegangen! Und es geht immer noch massiv nach hinten los, das Ende nicht in Sicht, sehe den unmenschlichen Zustand von Afrika heute.

Ich hoffe, dass wir es nächstes Mal besser tun werden.

Service Design Fehler der Deutschen Telekom

Nachdem ich über Service Design Thinking erfahren habe, möchte ich das Gelernte auch anwenden, und ein passender Anlass hat sich ergeben: durch einen Fehler der Deutschen Telekom blieb ich 12 Tage ohne Internetanschluss. Es bietet sich, Probleme und Fehler im Service Design der Telekom zu analysieren.

Die notwendige Vorgeschichte.

Ich habe zwei Telefonanschlüsse an zwei Standorten: für mich und für meine Eltern. Ich habe beide Anschlüsse unter meinem Namen und einem Kundennummer laufen lassen, damit ich es einfacher beim Bezahlen habe und damit meine Eltern durch technische Details nicht belästigt werden. Da wir so gut wie niemals telefonieren, hatten beide Anschlüsse einen Tarif mit der Analogtelefonie. Ich habe meinen DSL bei Telekom gehabt, meine Eltern bei 1und1.

Telekom hat beide Tarife im letzten Jahr von sich aus gekündigt, weil sie ihr Netz ausgebaut haben und deswegen die Unterstützung der herkömmlichen Analogtechnik für sie Mehraufwand bedeutete. Frech so wie Telekom eben ist, haben sie entschieden, dass ihre Netzausbau (durch die sie ja mehr Einnahmen mit den Hochgeschwindigkeitsverträgen ohnehin machen) auch noch auf Kosten von den Bestandskunden gemacht werden muss. Anders gesagt, der billigste neue gleichwertige Tarif, nur eben mit der IP-Telefonie, hat mir mehr gekostet als der alte, ohne dass ich irgendeinen erkennbaren Vorteil erhalten habe.

Damals habe ich die Kröte geschluckt und einfach mal bei den beiden Anschlüssen über den Kundencenter (also online) einen Tarifwechsel durchgeführt. Ich habe einen Click and Surf Tarif bekommen, meine Eltern habe ich von 1und1 zum Telekom auf Magenta Home S gewechselt. Der Wechsel hat zunächst einmal funktioniert: bei meinem Anschluss wurde es in Januar dieses Jahres durchgeführt. Bei den Eltern mussten wir bis zum 03.07.2015 warten, wegen der Kündigungsfrist bei 1und1.

Problemfall

Irgendwann in Frühling 2015 habe ich mir den Auftragsstatus angeschaut und dort bemerkt, dass die Bestellung von Magenta Home S für meine Eltern storniert ist, dafür steht dieser Tarif für meinen Anschluss als bestellt, und mein seit Januar bestehender Click and Surf Tarif als gekündigt. Selbstverständlich habe ich es nicht veranlasst.

Nachdem ich die falsche Buchung im Kundencenter entdeckt habe, habe ich bei der Telekom Hotline angerufen. Ich habe das Gespräch mit den Worten “Ich habe ein vermutliches Problem bei Ihnen entdeckt” angefangen, worauf der Hotline-Mitarbeiter mir erwiderte: “Ich habe kein Problem. Sie rufen ja mich an.” Ähnlich hämisch ging es auch weiter, wo ich z.B. unterbrochen wurde mit dem Satz “Hallo! Hallo! Ich rede jetzt aber über den anderen Anschluss”. Ich habe dann trotzdem versucht, höflich zu bleiben. Letztendlich hat der Hotline-Mitarbeiter mich versichert, dass er das Problem verstanden hat, und dass er Magenta Home S wieder für meine Eltern bucht, alles Falsche rückgängig macht und ich darüber Post bekomme. Nichts davon ist passiert.

Am 03.07.2015 konnte ich Zuhause keine Internetverbindung mehr aufbauen, und diese Störung wurde erst 12 Tage später, am 15.07.2015 behoben. Ich musste täglich mit diversen Hotlines telefonieren und zweimal den Telekom-Shop besuchen, damit letztendlich die Ursache herausgefunden wurde: am 03.07.2015 wurde der DSL Protokoll von Annex B auf Annex J umgestellt. Mein Modem Speedport 221 kann zwar VDSL mit 50 Mbit/s, unterstützt aber nur Annex B.

Da das Datum der Störung exakt der “Entlassungstag” bei 1und1 war, habe ich von Anfang an den Verdacht auf die Verwechslung mit Magenta Home S gehabt und habe mich deswegen an den Telekom-Shop in Fürth gewendet. Ein Mitarbeiter davon hat folgendes behauptet:
– es wäre möglich, den fehlerhaften Auftrag ab ca. 08.07.15 schnell zu rückabwickeln. Das war falsch, andere Mitarbeiter der Telekom haben mir später mitgeteilt, dass die Rückabwicklung, wenn überhaupt, 2 Monate in Anspruch nehmen wird, oder auch nicht mehr möglich ist
– mein Internetanschluss sollte auch weiterhin mit Magenta Home S funktionieren, weil mein Modem VDSL unterstützt. Das war falsch, weil wie gesagt der Modem nur den Annex B unterstützt
– dass die technische Hotline noch am 04.07., spätestens am 06.07.15 das Problem beheben wird.

Die technische Hotline hat sich erst am 07.07.2015 gemeldet und wollte einen Termin für Hausbesuch vereinbaren. Das war nicht notwendig, weil man das Problem auch telefonisch finden kann, indem man die Marke meines Modems fragt und sie mit den Einstellungen des Anschlusses vergleicht. Dies hat übrigens ein anderer Techniker am 11.07.2015 schnell in wenigen Minuten erkannt.

Bei der Terminvereinbarung zum Hausbesuch wurde mir mitgeteilt, dass mich der Techniker eine Stunde vorab anruft, sodass ich Zeit habe, von der Arbeit in die Wohnung zu kommen. Dies ist nicht passiert. Der Techniker erschien direkt in die Wohnung, und wir mussten über meine Familienmitglieder kommunizieren.

Der Techniker konnte nicht zwischen einem Modem und einem Router unterscheiden. Bei mir sind das zwar unterschiedliche Geräte. Er hat versucht, das Telefonkabel an den Router anzuschließen. Dies ist fachlich nicht vertretbar.

Seine Behauptung war, dass entweder mein Router (wobei ich nicht weiß, ob er meinen Modem oder meinen Router gemeint hat) kaputt ist, oder meine Zugangsdaten falsch sind. Warum vor dem 03.07.2015 die gleiche Technik und Einstellungen bei mir funktioniert haben, konnte er nicht erklären, und wollte es auch nicht untersuchen. Den Ticket hat er daraufhin geschlossen.

Ich habe mich dann wiederholt an die technische Hotline gemeldet, und das Problem wieder von vorne geschildert. Die Mitarbeiterin hat sich dann auch keine Mühe gegeben, das Problem wirklich zu untersuchen, sondern hat mich einfach gefragt, was sie für mich nun konkret machen soll. Da ich damals Verdacht hatte, dass meine Zugangsdaten verändert wurden, habe ich sie darum gebeten, sie mir zuzuschicken. Sie hat dabei gesagt, dass ich dann spätestens in 15 Minuten eine funktionierende Internetverbindung habe. Nicht nur war diese Behauptung falsch, weil wie wir nun wissen, dass das Problem nicht an den Zugangsdaten lag, sondern an einem Protokollwechsel, der anscheinend zusammen oder zeitgleich mit dem fehlerhaften Auftrag passierte. Sondern auch hat diese Support-Mitarbeiterin mir die Zugangsdaten an die Telekom-Email zugeschickt, die ich nicht abrufen konnte, weil durch die Generierung der neuen Zugangsdaten meine alten Zugangsdaten ungültig gemacht wurden. Ich musste dann wieder anrufen und einen anderen Mitarbeiter die gleichen Zugangsdaten mir an eine andere E-Mail schicken lassen.

Erst der letzte von mir am 11.07.2015 kontaktierte Mitarbeiter konnte das Problem richtig erkennen. Dann hat er noch viel Zeit gebraucht, mir das Problem zu erklären, weil die Telekom-Sprache DSL Annex J als “IP-fähiger Router” bezeichnet. Router ist per Definition ein Gerät, das IP-Pakete routet, alle Consumer-Router der Welt sind insofern “IP-fähig”. Ich musste also Deutsche Telekom erst darüber aufklären, habe dabei vor lautem Schreien fast meine Stimme verloren, weil ich mich nach 12 Tagen Ausfall wirklich verarscht gefühlt habe.

Nur langsam konnte der gute Mann mir erklären, was technisch gesehen passiert ist. Allerdings war er nicht befugt, etwas zu rückabwickeln und konnte mir nur das passende Gerät (Speedport Entry) leihweise zuschicken. Und auch das geschah nicht über Nacht, sondern über den normalen Lieferweg über Deutsche Post, was weitere 4 Tage Ausfallzeit bedeutete.

Am Anfang habe ich in der Hektik noch einen Tarifwechsel von Magenta S auf Magenta M bestellt, weil ich dachte, dass das mein Problem löst und dass ich so wieder meine gewünschte DSL Geschwindigkeit erhalte. Bei der Bestellung hat der Vertrieb-Mitarbeiter gemeint, die Umstellung kann in 2–3 Tagen passieren. Tatsächlich wäre sie frühestens am 17.08, also über meinen Monat später, möglich gewesen. Der andere Vertriebsmitarbeiter hat gemeint, man kann hier nichts machen, da die Buchhaltung so schnell nicht mitkommt. Auf meine Anmerkung, dass ich für Magenta Home S wohl sowieso nichts bezahlen werde, weil ich diesen Tarif schliesslich nicht bestellt habe, hat er behauptet, dass weil es in seinem Buchungssystem steht, dann habe ich es bestellt, und wenn ich der anderer Meinung bin, soll ich mit Telekom schriftlich kommunizieren.

In der Ausfallzeit, die 12 Tage betrug und von mir völlig unverschuldet ist, war ich auf mein Handy (bei T-Mobile) angewiesen. Dabei musste ich zweimal den SpeedOn dazu buchen, weil meine Tarifgrenze schnell überschritten wurde. Keiner der Telekom-Mitarbeiter ist auf die Idee gekommen, mir auf Dauer des Ausfalls einen wirklich unlimitierten mobilen Datentarif freizuschalten. Schlimmer noch, sie haben alle meine Handy-Nummer nicht gewusst (und haben ständich versucht, mich bei meinen Eltern anzurufen), obwohl ich damals bei der Bestellung der T-Mobile SIM-Karte meine Telekom-Kundennummer gesagt habe, und obwohl der technische Support mir ständig SMS geschickt hat.

Als ich mir dann entnervt das Internet von Kabel Deutschland organisiert habe, wollte ich das Leihgerät zurückgeben und habe wieder bei der Telekom angerufen. Erst dann hat mir eine Support-Mitarbeiterin gesagt, dass die Mindestlaufzeit des Leih-Vertrages ein Jahr beträgt, ich müsste also 30 Euro dafür einfach so ausgeben. Die Mitarbeiterin hat anscheinend nichts von meiner Vorgeschichte gewusst, obwohl ich zu dem Zeitpunkt bereits alles mögliche widerrufen, angefochten und fristlos (hilfsweise zum Nächstmöglichen) gekündigt habe und der andere Mitarbeiter der Telekom versucht hatte, mich bei meinen Eltern zu finden, wohl um mich zu überreden.

Generell musste ich jedem einzelnen Telekom-Mitarbeiter die komplette Vorgeschichte neu erzählen, obwohl sie eigentlich in ihrem CRM System eindeutig stehen müsste. Es gab auch einen Fall, wo ein Mitarbeiter mir gesagt hat, dass er gerade eine andere Abteilung anrufen muss, und ich deswegen auf der Linie warten sollte, er ist gleich wieder da, dann nach 2 Minuten meldet sich eine andere Frau, sie weißt wieder von nichts und ich musste alles wieder neu erzählen.

In den Anfangstagen von dem 12-tagigen Desaster war es auch noch so, dass ich über 30 Minuten auf der Warteschleife hängen musste, bis ich durchkam.

Telekom hat mindestens zwei Arten der Hotlines, die eine ist technisch, die andere ist Vertrieb, und ich wurde von einer Abteilung in die andere, und dann wieder zurück, geschickt, weil keiner sich für mein Problem zuständig fühlte.

In der Telekom-Shop können sie sogar weniger als die Hotline-Mitarbeiter, z.B. können sie keinen Auftrag widerrufen.

Auf meine direkte Frage, ob es eine Telefonnummer für Beschwerden gibt oder ich gleich durch einen Anwalt per Post kommunizieren soll, hat der Shop-Mitarbeiter gemeint, es wüsste von keiner Beschwerdennummer.

Letztendlich habe ich alle Verträge mit Telekom gekündigt. Durch die 2-Jahres-Frist laufen sie noch bis 2017, obwohl ich seit Sommer dieses Jahres keine Telefonsteckdosen mehr habe und nichts vom Telekom benutze. Das wird mich über 600 Euro kosten. Ich war über 14 Jahre Kunde bei Telekom und ich werde niemals bis zum Tod wieder Kunde werden.

Analyse und Lösungsvorschlag

Das eigentliche Problem lag an der Verwechslung der Anschlüsse, die entweder durch einen menschlichen oder Software-Fehler entstanden ist. Durch einen guten Service Design kann man zwar solche Fehler nicht verhindern. Man kann sie aber weniger schädlich und einfacher zu beheben machen, z.B. durch folgende Maßnahmen:

1. Einfachheit schaffen. Statt mehrere Software-Systeme (Kundencenter und eine getrennte Buchungsverwaltung, wie anscheinend bei Telekom der Fall ist) soll ein einziges System vorhanden sein, so dass die Bestellungen nicht zwischen Systemen übertragen werden müssen, d.h. Übertragungsfehler können nicht passieren. Das bedeutet auch, dass die Support-Mitarbeiter denselben Online-Kundencenter benutzen sollen wie auch die Kunden (vielleicht mit etwas anderen Zugriffsrechten). Was wiederrum bedeutet, dass der Online-Kundencenter stark verbessert werden muss (Usability, Geschwindigkeit, Funktionen).

2. Transparenz, Persönlichkeit und Accountability. Bei jeder Buchung müsste angegeben werden, wer sie veranlasst hat. Sollte es ein Support-Mitarbeiter sein, müsste sein vollständiger Name da stehen. Es muss eine Möglichkeit geben, konkret diesen Mitarbeiter durch ein Online-Anfrage-Formular, oder per E-Mail, oder telefonisch zu erreichen, um eine Frage zu stellen, warum die eine oder andere Buchung gemacht wurde.

Eventuell müsste der Kunde sich dann aber trotzdem mit einem Support-Mitarbeiter in Verbindung setzen. Wenn derjenige aber schlecht gelaunt, unterbezahlt oder inkompetent ist, was dann?

3. Menschlichen Faktor ausschließen. Jegliche Vertragsoperationen (wie Widerruf, Stornierung, Rückabwicklung und Kündigung, Anfechtung sowie das Ticketsystem der technischen Kundendienstes) müssen online im Kundencenter machbar sein. Sollte es erwünscht sein, für den Kunden zu kämpfen, kann das Winback-Team den Kunden ja trotzdem nach der Stornierung erreichen und versuchen, ihn zu überreden.

Desweiteren, die Aufteilung von Telekom-Hotlines auf Technik und Vertrieb, und ein nicht funktionierendes oder nicht benutzbares (Usability!) CRM System hat dazu geführt, dass enorm viel Zeit verloren ging und viel unnötiger Aufwand betrieben wurde — weil ich mit 10 bis 20 verschiedenen Support-Mitarbeitern reden musste und keiner von ihnen konnte ausreichend viel Kontext über das Problem wissen.

4. Sollte der Kunde Kommunikation mit Menschen bevorzugen (weil z.B. der Kunde der Generation X oder älter gehört und nicht so online-affin ist), muss sie mit exakt einem Support-Mitarbeiter stattfinden. Dieser dem Kunden mit Vor– und Nachname bekannte, direkt per E-Mail oder telefonisch erreichbare und fest zugewiesene Berater soll die komplette Vorgeschichte des Kunden und des Problems kennen. Er agiert dann selbsttätig im Auftrag des Kunden, wuselt sich durch alle möglichen Abteilungen der Telekom durch, und löst das Problem.

Wenn auch das nicht hilft, soll es dem Kunden die Möglichkeit gegeben werden, die Lösung des Problems entweder selbstständig (bei ausreichenden Fachkenntnissen) oder durch einen fachmännischen Dritten herbeizuführen. Dabei ist notwendig, dass

5. Transparenz auch in (technischen) Detail vorhanden ist. In jeder Bedienungsanleitung, die einem Gerät beigelegt wird, gibt es einen Kapitel “Technische Daten”. Dort stehen wichtige Informationen darüber, was technisch (und nicht bloß in der Verkäuferssprache) passiert. Telekom sollte auch hier möglichst transparent sein. Der Kunde muss direkt im Kundencenter die neuen DSL Zugangsdaten beliebig setzen können. Er muss den Sync-Status und weitere technische Status-Informationen von “seinem” DSLAM auslesen können. Verbindliche Informationen über SIP-Proxy, STUN-Server, benutzte SIP Ports sowie die Status-Informationen müssen direkt im Kundencenter abrufbar sein (aktuell sucht man sie in diversen Foren zusammen).

Abgesehen davon sehe ich noch einen ganz anderen “Lessons Learned”. Sowohl bei T-Online als auch bei T-Mobile ist es so, dass es keine aktuell bestellbaren Tarifen gibt, die so günstig sind wie meine bisherige Bestandstarife. Tarifwechsel lohnt sich aktuell bei Telekom nicht, wenn man mit dem bisherigen Status Quo zufrieden ist. Schlimmer noch: die Konkurenz bietet die gleiche Leistung für 30% weniger Geld. Ausgerechnet in dieser Situation hätte Telekom eigentlich nur durch einen legendär hervorragenden Service ihre Kunden behalten können. Bei dem disaströsen Service, den sie aktuell anbieten, müssten sie aber mit ihren Preisen mindestens um 60–70% runtergehen (also z.B. DSL für 9 bis 12 statt 30 Euro pro Monat anbieten).

Und das ist das Lesson Learned: bei überdurchschnittlichen Preisen muss der Service auch überdurchnittlich sein. Ich nehme an, das Gegenteil stimmt auch: ein überdurschnnittlich guter Kundenservice kann die Möglichkeit eröffnen, Preise zu erhöhen.

An experience of unsupervised learning

In my previous post I’ve explained why I think you should learn machine learning and promised to share my experiences with its unsupervised part.

The unsupervised machine learning has a mystical attraction. You don’t even bother to label the examples, just send them to the algorithm, and it will learn from them, and boom — it will automatically separate them to classes (clustering).

When I was studying electrical engineering, we’ve learned about the so called optimal filters, which are electrical circuits that can extract the useful signal from the noise, even though the noise is 100 times stronger than the signal, so that a human eye cannot even see the signal. This was like a magic, and a similar magic I have expected in this case: I would pass the examples to the clustering algorithm, and it will explore the hidden relationships between the data and give me some new, unexpected and useful insights…

Today, having tried it, I still believe that some other algorithms (maybe deep learning?) are able to produce such a magic (because well, you should never stop believing in magic), but my first impression was somewhat disappointing.

The very first thing the clustering algorithm wanted to know from me, is how many clusters it should look for. Pardon me? I’ve expected that you will find the clusters and tell me, how many clusters are there in my data? If I have to pass the number of clusters beforehand, it means I have to analyse the data to find out its inherent clustering structure, and it means I have to perform all that work what I’ve expected to be performed magically by the algorithm.

Well, it seems that the state of the art of current clustering algorithms indeed cannot find clusters autonomously and fully unsupervised, without any hint or input from the user. Some of them don’t require the number of clusters, but need some equivalent parameter, like the minimum number of examples needed in the neighborhood to establish a new cluster. But well, on the other case, this still allows for some useful applications.

One possible use case could be automatic clustering per se: if your common sense and domain knowledge tell you that the data has exactly N clusters, you can just run the data through the clustering algorithm, and there is a good chance that it will find exactly the clusters you’ve expected: no need to define all the rules and conditions separating the clusters manually. Besides, it will define centroids or medoids of each cluster, so that if new, unclastered objects are added daily, you can easily assign them to existing clusters by calculating distances to all centroids and taking the cluster with the shortest distance.

Another use case would be, if you don’t really care about the contents of the clusters and the clusters aren’t going to be examined by humans, but rather use clustering as a kind of lossy compression of the data space. A typical example would be some classical recommendation engine architectures, where you replace the millions of records with some smaller number of clusters, with some loss of recommendation quality, just to make the computation task at hand to be feasible for available hardware. In this case, you’d just consider how many clusters, at most, your hardware can handle.

Yet another approach, and I went this way, is to ask yourself, how many clusters is too little and how many clusters is too many? I was clustering people, and wanted to provide my clusters to my colleagues and to myself, to be able to make decisions on them. Therefore, due to well-known human constraints, I was looking for at most 7 to 8 clusters. I also didn’t want to have less than 5 clusters, because intuitively, anything less in my case would be underfitting. So I’ve played with parameters until I’ve got a reasonable number of clusters, and clusters of reasonable (and understandable for humans) content.

Speaking of which, it took a considerable amount of time for me to evaluate the clustering results. Just like with any machine learning, it is hard to understand the logic of the algorithm. In this case, you will just get clusters numbered from 0 to 7, and each person will be assigned to exactly one cluster. Now it is up to you to make sense of the clusters and to undestand, what kind of people were grouped together. To facilitate this process, I’ve wrote a couple of small functions returning to me the medoids of each clusters (i.e. the single cluster member who is nearest to the geometrical center of the cluster, or in other words, the most average member of the cluster), as well as average values of all features in the cluster. For some reason, most existing clustering algorithms (I’m using scikit-learn) don’t bother of computing and giving this information to me as a free service, which, again, speaks about the academic rather than industrial quality of modern machine learning frameworks.

By the way, another thing that was not provided for free was pre-scaling. In my first attempts, I’ve just collected my features, converted them to real numbers, put them in a matrix and fed this matrix to the clustering algorithm. I didn’t receive any warnings or such, just fully unusable results (like, several hundreds of clusters). Luckily for me, my previous experience with supervised learning had taught me that fully unusable results normally mean some problem with the input data, and I’ve got to the idea to scale all the features to be in the range of 0 to 1, just like with the supervised learning. This had fixed this particular problem, but I’m still wondering, if the clustering algorithms usually cannot meaningfully work on unscaled data, why don’t they scale data for me as a free service? In the industrial grade software, I would rather needed to opt-out of the pre-scaling by setting some configuration parameter, in case I wanted to turn it off in some very unique and special case, than having to implement scaling myself, which is the most common case anyway. If it is some kind of performance optimization, I’d say it is a very, very premature one.

But I digress. Another extremely useful tool helping to evaluate clustering quality was the silhouette metric (and a framework class implementing it in scikit-learn). This metric is a number from 0 to 1 showing how homogeneous the cluster is. If a cluster has silhouette of 0.9, it means that all members of this cluster are very similar to each other, and unsimilar to the members of another clusters.

Least but not last, clustering algorithms tend to create clusters for many, but not for all examples. Some of the examples remain unclustered and are considered to be outliers. Usually, you want the clustering algorithm to cluster the examples in a such way that there will me not too many outliers.

So I’ve assumed the following simple criteria:

  • 5 to 8 clusters
  • Minimal silhouette of 0.3
  • Average silhouette of 0.6
  • Less than 10% of all examples are outliers

and just implemented a trivial grid search across the parameters of the clustering algorithm (eps and min_samples of the DBSCAN, as well as different scaling weights for the features), until I’ve found the clustering result that suited all of my requirements.

To my astonishment, the results corresponded very well to my prior intuitive expectations based on some domain knowledge, but also have created a very useful quantitative measure of my previous intuitive understanding.

All in all, unsupervised learning can be used to gain some benefits from the data, if you don’t expect too much from it. I think, to gain more business value, we have to make the next step and to start a project including deep learning. In USA and China it seems to be that virtually everyone is doing deep learning (I wonder if Bitcoin farms can be easily repurposed for that), but it Germany it is rather hard to find anyone publicly admitting doing this. Although the self-driving cars of German manufacturers, already existing as prototypes, would be impossible without some deep learning…

Why Should You Learn Machine Learning

In the end of 80ies and early 90ies, the topics of fourth generation programming languages and genetic algorithms were very popular in mass media. We had read in the magazines that software developers would become obsolete, because users could create their programs themselves using 4GL, or else AI systems would soon be created that would extend themselves. By that time, I’ve learned my first programming languages, was about to choose my subject in the university; and therefore had doubts about job perspectives in software development.

Fortunately (or not), Steve Jobs and Bill Gates have popularized the graphical user interfaces by around that time, so that this first AI wave calmed down (or returned to its academic roots), because software development became less about finding an answer to a question, but more about displaying windows, buttons, menus and textboxes. Computer games’ focus has shifted from “what exactly you are doing” to “how cool is looks like”. Internet has changed from the source of scientific or personal information to a ingenious marketing tool and became a thing about pictures, graphic design and neuromarketing.

But, if you are a software developer and have not yet realized that you need to teach yourself machine learning, you should be concerned about your job. Because machine learning is coming and it is the next logical step of losing the full control about your software.

First, we’ve lost the control about exact machine instructions put in our program, and gave it up to the compilers. Next, we’ve lost the control about memory management and gave it up to the garbage collector. Next, we’ve partially lost the control about the order of execution and gave it up to event loops, multithreading, lambda expressions, and other tools. With machine learning, we will lose control about the business logic.

In the classic computer programming, we were trained for the situation when the desired business logic is exactly known and specified beforehand. Our task was to implement the specification as exact as possible. And in the first decades of software development practice, there were enough useful problems that could be specified with more or less acceptable efforts. Remember, the first computers were used for ballistic calculations. Having all formulae already invented by the scientist, the programming task at hand had a perfect specification.

Now, we want to go to the areas, where creating a specification is impossible, or too expensive, or just not the optimal course of action.

We will take fraud detection as example. Let’s say we have data about payment transactions of some payment system, and want to detect criminal activity.

A possible non-machine learning approach would include establishing some set of rules for fraud detection, based on common sense. For example, some limit on the transfer sum, above of that the transaction gets suspicious. Also, transactions from different geographical locations within some short period of time are suspicious, etc.

One obvious limitation of this approach is that the alarm thresholds are based on common sense, so that the objective quality of the fraud detection is highly dependent on how good the subjective common sense of its developers reflects the reality.

Another obvious limitation of the common-sense approach is that such a rule system cannot be infinitely complex. Humans can comprehend only a limited amount of rules at once, so that they usually stop having defined 5 or 7 rules; and see a system with 20 rules as “very complex” and a system with 100 rules as “we need a whole new department to make sense what is really going on here”. For comparison, Square, Inc is using a machine learning algorithm for fraud detection based on (my conservative guess) over 3000 rules (not mentioning that they can re-tune these rules automatically every day or more often).

It is even harder for human to comprehend possible interplay between the rules. A typical geo-based rule should usually fire for distance D and time period T, but not in the public holidays season (as many people travel in this time), but even in this season it must still fire if the amount is above M, if the recipient is a registered merchant, or above the amount P, if the recipient is a private person, but it still must not fire, if the money holder had already did similar money transfers one year before and that transfer was not marked as a fraud, but it must still fire if any automatic currency conversion is taking place… At some point, a classic software developer will raise her arms and declare herself out of the game. Usually, she will then create a generic business rule engine and assert that business guys will have to configure the system with all their chaotic business rules. Which doesn’t solve the problem, just shifts it from one department to the other.

Now, remember the Shannon-Hartley theorem? Me neither, but the main thing about it was that there is a difference between the information — the useful signal that is valued by the receiver — and merely the data, the stream of zeros and ones in some particular format. The fraud detection issue can be seen as an information extraction problem. Somewhere in the transaction data, the information is hidden from our eyes, signalizing criminal activity. We as humans have practical limits extracting this information. Machine learning, if done correctly, is a possibility to extract and to evaluate more information from data.

Classifiers in machine learning are algorithms that, based on a set of features (or attributes) of some event or object, try to predict its class, for example “benign payment” or “fraud”.

No matter what algorithm is used, the procedure is roughly the same. First, we prepare a training set of several (often at least 1000, the more the better) labeled events or objects, called the examples. “Labeled” means, for each of those examples, we already know the right answer. Then, we feed the classifier algorithm with the examples, and it trains itself. The particularities depend on exact algorithm, but what all algorithms are basically trying to do is to recognize how exactly the features are related to the class, and to construct a mathematical model that can convert any combination of input examples to the class. Often, the algorithms are not extremely complicated to understand, for example, they might try to count how often one of the features appears in one class and then in another class; or they might start with a more or less random limit for a rule, and then start to move it, every time counting the number of right predictions (the accuracy) and changing the direction when accuracy is getting worse. Unfortunately, not a single algorithm author cares about the learning curve of his users so that most of algorithm descriptions include some hardcore-looking math, even when it is not strictly necessary.

Finally, a trained classifier is ready. You can now pass unlabeled examples to it, and it will predict their classes. Some classifiers are nice and can even tell you how sure they are (like, 30% chance it is a benign payment and 70% chance it is a fraud), so that you can implement different warning levels depending on their confidence.

A huge disadvantage of machine learning (and welcome to the rant part of this post): only some of the classifiers can be logically understood by a human being. Often, they are only some probability distributions, or hundreds of decision trees, so that while it is theoretically possible, for a given input example, to work through the formulas with the pen and paper and to get the same result as the classifier did, but it would take a lot of time and won’t necessarily bring you a deep understanding of its logic, so that practically, it is not possible to explain classifiers. This means, sometimes you pass to the classifier an example, where you as a human can clearly see it is a fraud, and then get the class “benign” from the it, and you, like, “what the hell? is it not obviously a fraud case? And now what? How can I fix it?”.

I suppose, one could try to train a second classifier giving the wrongly predicted examples more weight in its training set, and then combine results of both classifiers using some ensemble methods, but I haven’t tried it yet. I haven’t found any solution to this problem in the books or training courses. Currently, most of the time you have to accept that the world is imperfect and to move on.

And generally, machine learning is still in a very half-backed state, at least in Python and R.

Another typical problem of contemporary machine learning, when teaching classifiers and providing them with too many features, or features in a wrong format, the classifying algorithms can easily become fragile. Typically, they don’t even try to communicate to you that they are overwhelmed, because they can’t even detect that. Most of them have still an academic software quality, so that they don’t have too much of precondition checking, strong typing, proper error handling and reporting, proper logging and other things that we accustomed to when using production-grade software. That’s why most of machine learning experts agree that currently, most of the time is getting spent on the process they call feature engineering, and I call “random tinkering with the features until the black box of the classifying algorithm suddenly starts producing usable results.”

But well, if you have luck or, more likely, after having invested a lot of time for feature engineering, you will get a well trained algorithm capable of accurately classifying most of the examples from its training set. You calculate its accuracy and are very impressed by some high number, like, 98% of right predictions.

Then you deploy it to production, and are bummed by something like 60% accuracy in the real conditions.

It is called overfitting and is a birth mark problem of many contemporary algorithms — they tend to believe that the (obviously limited) training set contains all possible combinations of values and underestimate training for combinations not present in the set. A procedure is developed by statisticians to overcome this, called cross-validation, which increases the training time of your algorithm by factor 5 to 20, but as a result giving you more accurate accuracy. In the example above, your algorithm would earn something like 64% accuracy after the cross-validation, so you are at least not badly surprised when running it in production.

Modern improved algorithms such as random forest have a built-in protection against overfitting, so I think this whole problem is an intermittent issue of the quickly developing tech and we will forget about it in a year or so.

I also have the feeling that machine learning frameworks authors consider themselves done as soon as a trained classifier is created and evaluated. Preparing and using it in production is not considered as a worthy task. As a result, my first rollout of a classifier had produced predictions that were worse than even the random guessing. After weeks of lost time, the problem has been found. To train the classifier, I’ve written an SQL query and stored my training set into a CSV file. This is obviously not acceptable for production, so I have reimplemented the code in Python. Unfortunately, it has been reimplemented in a different way, meaning that one of the features was encoded not in the same format as the format used during the training phase. The classifier has not produced any warnings and simply “predicted” garbage.

Another problem is that most algorithms cannot be trained incrementally. If you have 300 features, have spent weeks to train your algorithm, and want now to add the 301st feature, you will have to re-train the classifier using all 301 features, even though the relationships between the first 300 features hasn’t changed.

I think, there are more rants about the machine learning frameworks to come. But, at the same time, things in this area change astonishingly rapidly. I don’t even have time to try out that new shiny interesting thing announced every week. Its like driving bicycle on an autobahn. Some very big players have been secretly working in this area for 8 years and more, and now they are coming out, and you realize, a) how much more advanced they are compared to you, b) that all internet business will soon be separated by those who could implement and monetize big data, and those who was left behind, and c) I think, machine learning will be implemented as built-in statements in mainstream languages, in the next five years.

Summarizing, even the contemporary state-of-the-art machine learning has the advantages that are too significant to ignore:

- the possibility to extract more information from data than human-specified business logic;
– as a pleasant consequence, any pre-existing data (initially conceived for other primary purposes), can be repurposed and reused, meaning extracting more business value per bit;
– another pleasant consequence is the possibility to handle data with low signal-to-noise ratio (like user behavior data);
– and finally, if the legacy business logic didn’t have quality metrics, they will be introduced, because any kind of supervised machine learning includes measuring and knowing the quality metrics of the predictions (accuracy, precision, recall, f-scores).

In this post, I’ve only described the supervised machine learning. There is also a big area called unsupervised machine learning. In December last year, at the last day before my vacations, I’ve finished my first experiment with it and this will be the topic of my next post.

And Big Data is so much more than just machine learning. It also includes architecting and deployment a heterogeneous database landscape, implementing high-performance processing of online and offline data; implementing recommendation engines, computer linguistic and text processing of all kinds, as well as analytics over huge amounts of poorly-structured and ever growing data.

If you are interested to work in our big data team, contact me and I will see what can I do (no promises!)

Beginning software architecture (for Yun)

Every programmer starts her career with something small. Implement a small function. Then implement a couple of functions talking to each other. Then implement a module, with dozens of functions, and maybe error handling and an API.

But sooner or later, we all want to move on and to step up to the higher abstraction level. We want to oversee the whole software system. We want to learn how to design it — how to do software architecture. But because this is our first time when we are stepping up one abstraction level higher, it is often very hard to do. Where can I start? When am I finished? How do I know I’ve created a right architecture?

Teachers and universities often don’t help but instead make things even worse, because they overload us with huge amount of information and detailed requirements about the architecture.

Meanwhile, there is only one thing about software architecture that is really important.

Architecting software is like caring for your child.

You want that your child will be safe and healthy; and that he will be loved, and have a long and happy life.

Safety. Your software might crash in run-time, or destroy valuable data. If it depends on its environment (other software or hardware) to run — teach your software, how to recover, when its environment fails. Teach your software, how to protect against the input from hackers and unprofessional users. Teach your software to change or produce data, only if it is fully sure it is working correctly. Teach your software, how to sacrifice one part of it to protect the whole, and teach it to run without one of its parts.

Health. Obesity is the most important problem for software. Always try to implement the same functionality with less code. Do not implement functionality, which nobody needs, but do prepare the software for the challenges it will definitely expect in the future — plan for extensibility. Use refactoring to avoid code areas that nobody is able to understand and to change, because these are the dead areas of the software body, limiting its flexibility.

Software is often created it teams. You want that the other team members love and care about the software as you do. Make sure that everyone writes code that can be read by anyone — force a uniform programming style if needed. Ensure that it is safe for team members to use the code of other team members — no unexpected results, proper error handling, consistent conventions. Avoid code ownership, because you want to get a lovely software system, and not just a set poorly interconnected moving parts.

For software to have a happy life, it must be loved and used by users. Ensure you not only understand the software requirements, but also why the users have these requirements. Work with the users to define even better requirements, which will make your software faster, slimmer or robuster. Come up with the ideas how to make your software even more lovable — a successful software will get more loving and motivating hands to work on it, while an unsuccessful software will be abandoned and die.

It is not easy to care for a child, nor it is easy to create a good software architecture. There is no rules equally suitable for all children — every time you will have to find a proper answer, may be by trial and error. But the results of the job done right might make you equally proud and your life fulfilled.

Being a happy bricklayer

“What are you doing?“
“I’m laying bricks,” said the first bricklayer.
“Feeding my family,” said the second bricklayer.
“I’m building a cathedral,” said the third bricklayer.

When I’ve learned this story in the primary school, I was shocked to see how shitty the life of the first two bricklayers were. The first one didn’t even had any intrinsic motivation to do his job, so he was probably a slave, a prisoner or some other kind of forced workforce. And the financial situation of the second one was apparently so critical that he was forced to take a job — any job he could find — to feed his family, even though he wasn’t really interested in laying bricks or perhaps even in construction works altogether.

I’m very happy to say that I was building a cathedral on every job I took so far. And frankly speaking, I don’t even see a point to do it differently. A job takes 8 hours a day. And for a hobby we could find, perhaps, one hour per day, on average? So by making your job to your hobby, and your hobby to your job, you increase the happy time of your life by 700%.

Another shocking aspect of that story is the missing loyalty of the first two workers. Per my upbringing and education, I’m normally very loyal to my employer, at least as long as they are loyal to myself. When my employer decides to hire me, they have some purpose in mind. It is the question of my loyalty, and of my integrity, to deliver upon it. But the first two workers seemed to be absolutely ignorant to their purpose in their organization!

That’s why I don’t really know what to say, every time when I hear someone declaring that his/her purpose in the company is not related to money. I mean, common, private companies have exactly one primary goal, one reason to exist: to earn money. Yes, they might have some cool vision like not being evil, or having a laser-sharp focus on perfect products, but these goals are all secondary. They are quickly forgotten when the primary goal is in danger. No company can survive for long, unless it follows the primary goal.

Therefore, I do really think that the purpose of all and every employee should be to see how s/he can help the company to earn or to save money. If s/he is not okay with that, well, wouldn’t s/he be much more happier working in a government agency, a non-government, a scientific, military, or a welfare organization?.. Just asking…

Tolles UX

HUK24.de hat eine faszinierende (und teilweise mutige) UX. Probiert mal selber aus! Was mir gefallen hat:

1) Sie verkaufen die KFZ-Versicherung in exakt gleicher Art und Weise, wie ich es kaufen will. Es gibt keine Landing Pages mit glücklichen Menschen, die mir die Vorteile erklären. Es gibt keine Testimonials. Es gibt keine übergroße CTAs “Jetzt kaufen”. Stattdessen verstehen sie, dass wenn ich zum ersten Mal auf huk24.de komme, bin ich noch am Vergleichen, welches Versicherungsunternehmen ich auswähle, und deswegen geben sie mir exakt das, was ich möchte: schnell, unverbindlich und unkompliziert mal berechnen zu können, wie viel ich in meinem Fall zahlen müsste.

2) Aber das noch nicht alles. Am Ende der Kalkulation gibt es naturlich einen CTA “Jetzt abschließen”. Wenn ich aber an dieser Stelle den Tab verlasse und mir ein Paar Tage Zeit nehme, um die anderen Alternativen abzuklappen, und dann zurückkehre, wirft mir die Seite keine “Session ist abgelaufen”, sondern sie weißt noch alles, was ich damals eingetragen habe, und ist immer noch bereit, sofort einen Vertrag abzuschließen! Das allein ist goldig.

3) Wenn ich dann bei der Bestellung an den Punkt komme, wo Zugangsdaten vergeben werden, fragen sie nur noch nach einem geheimen Passwort. Die Benutzerkennung wird dann automatisch generiert und mir angezeigt, so dass ich meine komplette Zugangsdaten in meinem KeePass abspeichern kann. Und wenn ich mich nicht täusche, wird die E-Mail erst später abgefragt, und zwar an der Stelle, wo ich selber daran Interesse habe, sie mitzuteilen (z.B. damit ich meinen Versicherungcode erhalten kann).

4) Es ist möglich, bei der Bestellung eine WerberID einzugeben, wenn ein anderer Kunde von HUK24 sie mir empfohlen hat. Es gibt aber auch einen Hinweis, dass ich die WerberID auch später (sogar nach Vertragsabschluß) eintragen kann, falls ich sie nicht zur Hand habe.

5) Wenn ich die Seite in einem eingeloggten Zustand verlasse und später einfach www.huk24.de eingebe, bekomme ich nicht die Startseite zu sehen, sondern ein Hinweis, dass ich automatisch ausgeloggt wurde und mich wieder einloggen kann. Ich kann zwar trozdem nicht-eingeloggt weiter surfen, aber es besteht schon ein softer Zwang, mich einzuloggen. So kann HUK24 mich besser verstehen und mir personalisierte Funktionen anbieten.

6) Nach der Anmeldung komme ich zum “Meine HUK24” Bereich, wo in der Mitte die exakt 6 wichtigsten Funktionen abgebildet sind, die ich überhaupt jemals brauchen könnte: huk247) Und viele kleinere UX-Merkmale, die ich toll finde, z.B. durchgehend werden Buttons nicht deaktiviert, sondern beim Klicken erhält man eine Overlay mit Erklärung, was man noch machen müsste, usw.

Mal schauen. Wenn ihr eigentliches Produkt (die Versicherung) genau so gut funktioniert wie die Webseite, habe ich eine richtige Entscheidung getroffen.

Enterprise Innovation

Well, my Enterprise Seasons model was too simple. Actually, after creating their first successful “flywheel” product, some corporations proceed with creating second, third and further successful products, always remaining an innovative enterprise, at least at some of its parts. There are a lot of advantages in this:

- Risk diversification. If one product fails for whatever reason, another products will keep the company afloat.

- The law of diminished returns can be worked around. Instead of investing more and more creative power into smaller and smaller uplifts, one can enjoy a much higher ROI with a new fresh product.

- Linear scalability. Growing the company by growing production and sales of a single product involves a lot of work with people, processes, and inevitable bureaucracy. Growing the company by creating a new product, can be just copying of its existing structure.

- Several revenue sources can allow for aggressive market policies, so that the company might allow one of its products to be intentionally unprofitable, to gain market share.

- At last, there might be synergy between different products, for example ideas or methods from one product can be applied to another, or selling a combination of products might be easier.

Therefore, for me it is even more interesting to understand, why there are so many enterprises that have problems with innovations. Why some enterprises don’t keep creating more and more products? So far, I’ve seen the following scenarios:

- Cultural incompatibility. Discovering a new product is everything but safe: 90% of new products fail. The traditional 19-century world view of a safe, life-time workplace, and a state welfare system eliminate the necessity of innovation. “We will work in the same safe market niche, and hopefully it will last until we leave the job and have our rent, and if not, the welfare system will help us to remain afloat and to find a new job.”

- Ethical reasons. Growing a company can be seen as a consumeristic, anti-ecological activity. In this case, the company not only doesn’t create new products, also continuous development of its primary product is almost non-existing; it is in maintenance mode.

- While investing most of resources into establishing and developing its secondary product, the company’s primary product is hit and almost destroyed by a sudden market shift; its development is frozen and everyone keep working to make the secondary product the new primary.

- Even though the primary product is running well, most of its revenues are paid out to foreign shareholders running a short-time strategy. Innovation is barely possible, because there are not enough people and money for it.

If some innovation is nevertheless trying to happen, often there are cultural difficulties:

- The Sun and stars fallacy. Sun is so much brighter than stars that we don’t see stars at day. The scale of the primary product is much higher than the one of a new product; it always has more visitors, page views, registrations, orders, revenue and operational spendings. “What? Your new product only generates X orders per month? What a fail, our primary product generates YYYYY orders! Let’s spend more on the primary product!” The trick is, if you don’t invest into the new product to grow it, it will also never reach maturity. The primary product was also so small in its initial stages.

- The No-Fail mentality. When searching for a new product, everyone in the team (PM, designers, developers) must have the “Fail Fast, Fail Cheap” mentality. On contrary, developing a mature product, the team must have a “No Fails Allowed” principle. If you like test-driven development, run-time performance optimization, software security, writing source code commented and formatted to style guide, creating comfortable in-house frameworks and planning several sprints ahead, you should develop a mature product. But, if you like fast user feedback, discussions about usability and the minimal viable product, several releases each week, and your software works only in 80% of cases, your source code is dirty as devil, but you’d rather spend more time discussing one-pixel changes in the UI, then you should be in team discovering a new feature or product. When companies ignore these differences and assign their “No-Fail” developers to discover a new product, this only leads to everyone’s frustration.

- The additive development fallacy. Development of the primary product is often additive. Projects like “We expect X% more users, have to scale hard– and software”, “We need feature X due to law changes”, or “we need a more modern design, let’s do a relaunch”, when implemented, usually never need to be rolled back. The problems begin, when new products or features are also implemented in the additive manner. Instead of starting with hypothesis verification and then a prototype, a complete product or feature is conceived, designed and implemented. Several months later, it rolls out, gets some less-than-moderate user attention, and starts to rot quietly in its tiny dark corner. Nobody has the balls to sun-set this feature, because, well, the company’s culture is additive, and the months of development are perceived as an asset. In reality, such features are a debt, constantly sucking team efforts and energy for maintenance, support, porting, translation, and operating.

I’m not sure yet, how enterprises create a new successful product. When observing enterprises with several products, I have the feeling that either

- a charismatic leader builds his very own small empire and creates a new product as a by-product (no pun intended),

- or the merge and acquisitions department grabs a product together with its team, and successfully integrates it into the company,

- or the company organizes its own startup incubator. The company owns then only partially its new products, and a lot of the existing infrastructure is not re-used, but at least the cultural issues are solved,

- or, in 0,00001% of cases, companies such as Valve have an innovation culture from the very beginning on.

Please share your experiences of innovations within an enterprise.

My Decision Theory

On my way to work I usually take a bus. Once, I’ve arrived to the bus stop a little bit late and had to wait for the next bus. I looked at the timetable and found out that the next bus was going to come in 12 minutes. I take two bus stops, which takes 4 minutes with the bus, or 20 minutes to walk.

I’ve decided to walk.

Now, mathematically, it was a wrong decision. Waiting for 12 minutes, then driving with the bus 4 minutes gives 16 minutes, which is shorter than 20 minutes. But, that day was very cold, so I’ve figured out I’d better walk and warm me up than staying at the bus stop for 12 minutes, possibly catching a cold. So, even if the decision was mathematically wrong, it was correct from the health point of view.

Several minutes into walking, I’ve watched a bus driving past me. What I’ve forgot while making my decision, is that two different bus lines pass my bus stop, and I can take both to come to work. I’ve looked up just one timetable and forgot about the second one.

As a consequence of this decision, I came into work several minutes later than I ought to come. Normally, this is not a very good thing. But I’ve worked a little bit more on the previous day, and I didn’t have any meetings scheduled, so that this hasn’t caused any major troubles. On the positive side, I’ve walked for 20 minutes, which was better for my health.

So, I took a decision, which was wrong both mathematically (16 minutes is less than 20) and logically (there was another bus line), but it didn’t have any major negative consequences, and indeed it was even good for my health.

Crazy, but this is how the world is. We take wrong decisions, but earn only positive consequences. Sometimes, we take perfectly correct and elegant decisions, that become huge source of negative consequences.

I’m still trying to understand how to handle it.

And this is by the way why I’m always laughing when I hear CS academics speaking about “reasoning about your code” and “formal proof of correctness”. They seem to be thinking, the biggest problem of software industry was to figure out, if 16 is less than 20.

Four Weeks of Bugfixing

The hardest bug I’ve ever fixed in my life took me 4 weeks to find. The bug report itself was pretty simple, but I have to give more context first.

I was one of the developers of a Smart TV software, and the bug related to the part of the software responsible for playing video files stored on your USB memory or drive. The CPU that was available for this task was a 750MHz ARM chip, and clearly it had not enough power to decode video (let alone HD video) in software. Luckily, every digital TV set has a hardware H.264 decoder, and our SOC was so flexible that we could use it programmatically. In this way, we were able to support H.264 video playback (too bad for you DivX and VC-1 owners).

Technically, the SOC has provided a number of building blocks, including a TS demux, an audio decoder, a video decoder, a scaler and multi-layer display device, and a DMA controller to transfer all the data between the blocks. Some of the blocks were present more than once (for example, for the PIP feature you naturally need two video decoders) and therefore could be dynamically and freely interconnected programmatically, building a hardware-based video processing pipeline. Theoretically, one could configure the pipeline by writing some proper bits and bytes in specified configuration registers of the corresponding devices. Practically, the chip manufacturer has provided an SDK for this chip, so that you only had to call a pretty well-designed set of C functions. The SDK was intended to run in the kernel mode of a Linux kernel, and it came from the manufacturer together with all building scripts needed to build the kernel.

Furthermore, this SDK was wrapped and extended by some more kernel-side code, first to avoid dependency on a particular SOC, and second to provide some devices to the user-mode, where the rest of the Smart TV software was running. So to play video programmatically, one needed to open a particular device from user mode as a file, and write into it a TS stream containing video and audio data.

Sadly, there are many people out there who have invented a lot of different container formats besides of TS. Therefore, our software had to detect the container format of the file to be played, demux the elementary streams out of it, then mux them again into a TS stream, and then hand it over to the kernel mode code. The kernel code would pass the TS bytes to the DMA device, that would feed the hardware TS demuxer, that would send the video elementary stream to the hardware video decoder, where it finally would be decoded and displayed.

For the user mode, we could implement all possible container formats ourselves (and this would mean some job security for the next 10 years of so). Fortunately the Smart TV software was architected very well so that the GStreamer framework was used (for you Windows developers it is an open-source alternative to DirectShow). The framework is written in C (to be quick) and GLib (to be object-oriented) and provides a pipeline container, where you can put some filters and interconnect them. Some filters read the data (sources), some process the data (eg. mux or demux), some use the data (sinks). When the pipeline starts playing, the filters agree on which one will drive the pipeline, and the driver would pull the data from all filters before it in the pipeline, and push the data into all the filters after it in the pipeline. Our typical pipeline looked like this (in a simplified form): “filesrc ! qtdemux ! mpegtsmux ! our_sink”. As you can expect from such a framework, there are also a lot of stuff related to events and state machines, as well as memory management.

So now, back to the bug report. It looked like this: when playing a TS file from USB memory, you can seek forward and backward with no limitation. When playing any other container format, you can seek forward, but you cannot seek backward. When seeking backward, the video freezes for several seconds, and then the playback continues from the latest position.

This is the sort of bugs when I think this might be fixed in a day or two. I mean, it works with TS, it doesn’t work with MP4, it is fully reproducible, so just find out what is different in those two cases and you’ve caught it.

The GStreamer pipeline in TS case looked like this: “filesrc ! our_sink”. So it must be either qtdemux or mpegtsmux. I’ve built another MP4 demuxer and replaced qtdemux with it. Negative, the bug is still there. No wonder, it also appeared in other container formats. I couldn’t replace mpegtsmux, because I haven’t found any alternatives. So the only thing I could do is to use the pipeline “filesrc ! qtdemux ! mpegtsmux ! filesink”, write the output into a file, and then try to dump the TS format structure and to look for irregularities.

If you know TS format, then for sure, you are already very sympathetic with me. TS is a very wicked and complicated format, and they repeat some meta-information every 188 bytes, so that the dump of several seconds of video took megabytes. After reading it, I didn’t find anything suspicious. Then I’ve converted my test MP4 video into a TS using some tool, dumped that TS, and compared. Well, there were some differences, in particular, how often the PCR was transmitted. Theoretically, PCR is just a system clock and should not influence the playback at all, but practically we already knew about some hardware bugs in the decoder making it allergic to unclear PCR signaling. I’ve spent some time trying to improve PCR, but this didn’t help either.

I have then played the dumped TS file, and I could see the seek backwards that I did during the recording. This has convinced me that mpegtsmux was also bug-free. The last filter I could suspect was our own sink. Implementing a GStreamer filter is not easy to do right in the first time. So that I went through all the functions, all the states, all the events, informed myself how the proper implementation should looked like, and found a lot of issues. Besides of a lot of memory leaks, we’ve generated a garbage during the seek. Specifically, GStreamer needs it to work in the following way:

1. The seek command arrives at the pipeline and a flush event is sent to all filters.

2. All filters are required to drop all buffered information to prepare themselves for the new data streamed from the new location.

3. When all filters have signaled to be flushed, the pipeline informs the pipeline driver to change playback location.

4. After the seek, the new bytes start flowing in the pipeline. Our code has conformed to this procedure somewhat, but did the cleanup prematurely, so that after the cleanup some more stale data polluted our buffers, before the data from the new location arrived.

I couldn’t explain why did it work with TS but not with MP4, but I’ve figured out that fixing it will make our product better anyways, so I’ve fixed it. As you can imagine, this didn’t solve the original problem.

At this point I’ve realized that I had to go into the kernel. This was a sad prospect, because every time I’ve changed anything in kernel, I had to rebuild it, then put the update on a USB stick, insert it into TV set, upgrade it to the new kernel by flashing the internal SOC memory, and then reboot the chip. And sometimes I’ve broken the build process, and the new kernel wouldn’t even boot, and I had to rescue the chip. But I had no other choice: I was out of ideas what else I could do in the user space, and I suspected that in the kernel space, we also had a similar issue with a garbage during the seek.

So that I’ve bravely read the implementation of the sink device and changed it in a way that it would explicitly receive a flush signal from the user space, then flush the internal buffer of the Linux device, then signal back to user space it is ready, and only then I would unlock the GStreamer pipeline and allow it to perform the seek and start streaming from the new location.

It didn’t help.

I went further and flushed the DMA device too. It didn’t help. Also flushing the video decoder device didn’t help.

At this point I’ve started to experiment with the flush order. If I first flush the DMA, the video decoder might starve in absence of data and therefore get stuck. But if I flush the decoder first, the DMA would immediately feed it with some more stale data. So perhaps I have to disconnect the DMA from video decoder first, then flush the decoder, then the DMA, and then reconnect them back? Implemented that. Nope, it didn’t work.

Well, perhaps the video decoder is allergic to asynchronous flushes? I’ve implemented some code that has waited until the video decoder reported that it has just finished the video frame, and then flushed it. Nope, this wasn’t it.

In the next step, I have subscribed to all hardware events of all devices and dumped them. Well, that were another megabytes of logs to read. And it didn’t help, that the video playback was a very fragile process per se. Even when playing some video, that looked perfectly well on the screen, the decoder and the TS demux would routinely complain of being out of sync, or losing it, or being unable to decode a frame.

After some time of trying to see a pattern, the only thing I could tell is that after the seek forward, the video decoder would complain for some frames, but eventually recover and start producing valid video frames. After a seek backward, the video decoder has never recovered. Hmm, can it be something with the H.264 stream itself that prevented the decoder to work?

Usually, one doesn’t think about elementary streams in terms of a format. They are just BLOBs containing the picture, somehow. But of course, they have some internal format, and this structure is normally only dealt with by authors of encoders and decoders. I went back to GStreamer and looked up, file by file, all the filters from the pipeline producing the bug. Finally, I’ve found out that mpegtsmux has a file having “h264” in its name, and this has immediately ringed alarm in my head. Because well, TS is one abstraction level higher than H.264, why the hell mpegtsmux has to know about the existence of H.264?

It turned out, H.264 bitstream has in its internal structure so-called SPS/PPS, the sequence parameter set that is basically a configuration for the video decoder. Without the proper configuration, it cannot decode video. In most container formats, this configuration is stored once somewhere in the header. The decoder normally reads the parameters once before the playback start, and uses them to configure itself. Not so in TS. The nature of TS format is so that it is not a file format, it is a streaming format. It has been designed in the way that you can start playing from any position in the stream. This means that all important information has to be repeated every now and then. This means, when H.264 stream gets packed into the TS format, the SPS/PPS data also has to be regularly repeated.

This is piece of code responsible for this repeating: http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/tree/gst/mpegtsmux/mpegtsmux_h264.c?h=0.11#n232 As you can see, during the normal playback, it would insert the contents of h264_data->cached_es every SPS_PPS_PERIOD seconds. This works perfectly well until you seek. But look how the diff is calculated in the line 234, and how the last_resync_ts is stored in line 241. The GST_BUFFER_TIMESTAMP is as you can imagine the timestamp of the current video sample passing through the tsmux. When we seek backwards, the next time we come into this function, the GST_BUFFER_TIMESTAMP will be much less than last_resync_ts, so the diff will be negative, and thus the SPS/PPS data won’t be repeatedly sent, until we reach the original playback time before the seek.

To fix the bug, one can either use the system time instead of playback time, or reset last_resync_ts during the flush event. Both would be just a one line change in the code.

Now, the careful reader might ask, why could the TS file I’ve recorded using mpegtsmux in the beginning of this adventure be played? The answer is simple. In the beginning of this file (i.e. before I’ve seek), there are H.264 data with repeated SPS/PPS. At some point (when I’ve seek during the recoding), the SPS/PPS stop being sent, and then some seconds later appear again. Because these SPS/PPS data are the same for the whole file, already the first instance of them configures the video decoder properly. On the other hard, during the actual seek of MP4 playback, the video decoder is being flushed, and therefore the SPS/PPS data is being also flushed, and this is the point when the video decoder relies on repeated SPS/PPS in the TS stream to recover, and this is exactly the point when they stop coming from the mpegtsmux.

Four weeks of search. 8 hours a day, 5 days a week. Tons of information read and understood. Dozens of other smaller bugs fixed on the way. Just to find out a single buggy line of code out of 50 millions lines of code in the source folder. A large haystack would contain to my estimate 40 to 80 millions of single hays, making this bug fixing adventure literally equivalent of finding a needle in a haystack.

Categories

Recent Tweets

  • New blog post: How to merge, 3/3 http://t.co/gFjDEJVs 5 years ago
  • Oh boy, how quick time goes by. Do you recognize that puppy toy language of 90ies in this young dynamic professional http://t.co/J4k53xn4 5 years ago
  • New blog post: How to merge, 2/3 http://t.co/wXIUAEZZ 5 years ago
  • New blog post: How to merge, 1/3 http://t.co/1lOQUp56 5 years ago
  • Just posted a photo http://t.co/pCUA758O 5 years ago

Follow Me on Twitter

Powered by Twitter Tools

Archive